How to add a new column to a Spark RDD? - apache-spark

I have a RDD with MANY columns (e.g., hundreds), how do I add one more column at the end of this RDD?
For example, if my RDD is like below:
123, 523, 534, ..., 893
536, 98, 1623, ..., 98472
537, 89, 83640, ..., 9265
7297, 98364, 9, ..., 735
......
29, 94, 956, ..., 758
how can I add a column to it, whose value is the sum of the second and the third columns?
Thank you very much.

You do not have to use Tuple* objects at all for adding a new column to an RDD.
It can be done by mapping each row, taking its original contents plus the elements you want to append, for example:
val rdd = ...
val withAppendedColumnsRdd = rdd.map(row => {
val originalColumns = row.toSeq.toList
val secondColValue = originalColumns(1).asInstanceOf[Int]
val thirdColValue = originalColumns(2).asInstanceOf[Int]
val newColumnValue = secondColValue + thirdColValue
Row.fromSeq(originalColumns :+ newColumnValue)
// Row.fromSeq(originalColumns ++ List(newColumnValue1, newColumnValue2, ...)) // or add several new columns
})

you have RDD of tuple 4, apply map and convert it to tuple5
val rddTuple4RDD = ...........
val rddTuple5RDD = rddTuple4RDD.map(r=> Tuple5(rddTuple4._1, rddTuple4._2, rddTuple4._3, rddTuple4._4, rddTuple4._2 + rddTuple4._3))

Related

Ranking of single datapoint against reference dataset

I have the following hypothetical dataframe:
data = {'score1':[60, 30, 80, 120],
'score2':[20, 21, 19, 18],
'score3':[12, 43, 71, 90]}
# Create the pandas DataFrame
df = pd.DataFrame(data)
# calculating the ranks
df['score1_rank'] = df['score1'].rank(pct = True)
df['score2_rank'] = df['score2'].rank(pct = True)
df['score3_rank'] = df['score3'].rank(pct = True)
I then have individual datapoints I would like to to rank against the references, for example:
data_to_test = {'score1':[12],
'score2':[4],
'score3':[6]}
How could I compare these new values against this reference?
Thank you for any help!

Spark Core How to fetch max n rows of an RDD function without using Rdd.max()

I have an RDD having below elements:
('09', [25, 66, 67])
('17', [66, 67, 39])
('04', [25])
('08', [120, 122])
('28', [25, 67])
('30', [122])
I need to fetch the elements having a max number of elements in the list which is 3 in the above RDD O/p should be filtered into another RDD and not use the max function and spark dataframes:
('09', [25, 66, 67])
('17', [66, 67, 39])
max_len = uniqueRDD.max(lambda x: len(x[1]))
maxRDD = uniqueRDD.filter(lambda x : (len(x[1]) == len(max_len[1])))
I am able to do with the above lines of code but spark streaming won't support this as max_len is a tuple and not RDD
Can someone suggest? Thanks in advance
Does this work for you? I tried filtering on the streaming rdds as well. Seems to work.
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark import SparkContext, SQLContext
from pyspark.sql.functions import *
from pyspark.streaming import StreamingContext
sc = SparkContext('local')
sqlContext = SQLContext(sc)
ssc = StreamingContext(sc,1)
data1 = [
('09', [25, 66, 67]),
('17', [66, 67, 39]),
('04', [25]),
('08', [120, 122]),
('28', [25, 67]),
('30', [122])
]
df1Columns = ["id", "list"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
df1.show(20, truncate=False)
uniqueRDD = df1.rdd
max_len = uniqueRDD.map(lambda x: len(x[1])).max(lambda x: x)
maxRDD = uniqueRDD.filter(lambda x : (len(x[1]) == max_len))
print("printing out maxlength = ", max_len)
dStream = ssc.queueStream([uniqueRDD])
resultStream = dStream.filter(lambda x : (len(x[1]) == max_len))
print("Printing the filtered streaming result")
def printResultStream(rdd):
mylist = rdd.collect()
for ele in mylist:
print(ele)
resultStream.foreachRDD(printResultStream)
ssc.start()
ssc.awaitTermination()
ssc.stop()
Here's output :
+---+------------+
|id |list |
+---+------------+
|09 |[25, 66, 67]|
|17 |[66, 67, 39]|
|04 |[25] |
|08 |[120, 122] |
|28 |[25, 67] |
|30 |[122] |
+---+------------+
printing out maxlength = 3
Printing the filtered streaming result
Row(id='09', list=[25, 66, 67])
Row(id='17', list=[66, 67, 39])
You can try something like this:
dStream = ssc.queueStream([uniqueRDD, uniqueRDD, uniqueRDD])
def maxOverRDD(input_rdd):
if not input_rdd.isEmpty():
reduced_rdd = input_rdd.reduce(lambda acc, value : value if (len(acc[1]) < len(value[1])) else acc)
internal_result = input_rdd.filter(lambda x: len(x[1]) == len(reduced_rdd[1]))
return internal_result
result = dStream.transform(maxOverRDD)
print("Printing the finalStream")
result.foreachRDD(printResultStream)
Output would be like (Output is repeated because the same RDD is provided 3 times in the stream):
Printing the finalStream
Row(id='09', list=[25, 66, 67])
Row(id='17', list=[66, 67, 39])
Row(id='09', list=[25, 66, 67])
Row(id='17', list=[66, 67, 39])
Row(id='09', list=[25, 66, 67])
Row(id='17', list=[66, 67, 39])

I want to do the same transformation in Python as I did in Scala

I'm new to Python.
Scala Code:
rdd1 is in string format
rdd1=sc.parallelize("[Canada,47;97;33;94;6]", "[Canada,59;98;24;83;3]","[Canada,77;63;93;86;62]")
val resultRDD = rdd1.map { r =>
val Array(country, values) = r.replaceAll("\\[|\\]", "").split(",")
country -> values
}.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map {
case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))
Output:
Country,Values //I have puted the column name to make sure that the output should be in two column
Canada,183;258;150;263;71
Edit: OP wants to use map instead of flatMap, so I adjusted flatMap to map by which, you just need to take the first item out of the list comprehension, thus map(lambda x: [...][0]).
side-note: The above change is valid only to this particular case when list comprehension returns a list with only one item. for more general cases, you might need two map()s to replace what flatMap() does.
One way with RDD is to use a list comprehension to strip, split and convert the String into a key-value pair, with Country as key and a tuple of numbers as value. Since we use list comprehension, so we take flatMap on the RDD element, then use reduceByKey to do the calculation and mapValues to convert the resulting tuple back into string:
rdd1.map(lambda x: [ (e[0], tuple(map(int, e[1].split(';')))) for e in [x.strip('][').split(',')] ][0]) \
.reduceByKey(lambda x,y: tuple([ x[i]+y[i] for i in range(len(x))]) ) \
.mapValues(lambda x: ';'.join(map(str,x))) \
.collect()
output after flatMap:
[('Canada', (47, 97, 33, 94, 6)),
('Canada', (59, 98, 24, 83, 3)),
('Canada', (77, 63, 93, 86, 62))]
output after reduceByKey:
[('Canada', (183, 258, 150, 263, 71))]
output after mapValues:
[('Canada', '183;258;150;263;71')]
You can do something like this
import pyspark.sql.functions as f
from pyspark.sql.functions import col
myRDD = sc.parallelize([('Canada', '47;97;33;94;6'), ('Canada', '59;98;24;83;3'),('Canada', '77;63;93;86;62')])
df = myRDD.toDF()
>>> df.show(10)
+------+--------------+
| _1| _2|
+------+--------------+
|Canada| 47;97;33;94;6|
|Canada| 59;98;24;83;3|
|Canada|77;63;93;86;62|
+------+--------------+
df.select(
col("_1").alias("country"),
f.split("_2", ";").alias("values"),
f.posexplode(f.split("_2", ";")).alias("pos", "val")
)\
.drop("val")\
.select(
"country",
f.concat(f.lit("position"),f.col("pos").cast("string")).alias("name"),
f.expr("values[pos]").alias("val")
)\
.groupBy("country").pivot("name").agg(f.sum("val"))\
.show()
+-------+---------+---------+---------+---------+---------+
|country|position0|position1|position2|position3|position4|
+-------+---------+---------+---------+---------+---------+
| Canada| 183.0| 258.0| 150.0| 263.0| 71.0|
+-------+---------+---------+---------+---------+---------+

CombineByKey does not allow work with flatMapValues to derive (Key, Value) pairs

I am working on a MapReduce problem where I want to filter every Map partition output.I want to filter only those keys which occur more than a threshold value in their map-partition.
So I have a RDD like (key, value<tuple>)
For every value of the tuple I want to get the count of its occurrence throughout the RDD, divided by map-partitions. Then I'll filter this count.
Eg: RDD: {(key1, ("a","b","c")),
(key2, ("a","d"),
(key3, ("b","c")}
Using flatMapValues I have been able to reduce this as
{(key1, a), (key1, b), (key1, c), (key2, a), (key2, d), (key3, b), (key3, c)}
Now using combineByKey step I have been able to get the count of each value in the respective partition.
suppose there were two partitions then it will return like
("a", [1, 1]), ("b", [1,1]), ("c", [1,1]), ("d", 1)
Now I want to filter this (Key, Value) such that the tuple of values makes individual Key-Value pair, i.e. what flatMapValues had done for me before but I am unable to use flatMapValues
python
from itertools import count
import pyspark
import sys
import json
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.rdd import RDD
sc = SparkContext('local[*]', 'assignment2_task1')
RDD = sc.textFile(sys.argv[1])
rdd1_original = RDD.map(lambda x: x.split(",")).map(lambda x: (x[0], [x[1]])).reduceByKey(lambda x, y: x + y)
rdd3_candidate = rdd1_original.flatMapValues(lambda a: a).map(lambda x: (x[1], 1)).combineByKey(lambda value: (value),lambda x, y: (x + y), lambda x, y: (x,y))
new_rdd = rdd3_candidate.flatMapValues(lambda a:a)
print(new_rdd.collect())
Expected answer:
[("a",1),("a", 1), ("b", 1), ("b", 1), ("c", 1), ("c", 1), ("d", 1)
Current error:
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/yashphogat/Downloads/spark-2.33-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 253, in main
process()
File "/Users/yashphogat/Downloads/spark-2.33-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 248, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/yashphogat/Downloads/spark-2.33-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 379, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/Users/yashphogat/Downloads/spark-2.33-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
return f(*args, **kwargs)
File "/Users/yashphogat/Python_Programs/lib/python3.6/site-packages/pyspark/rdd.py", line 1967, in <lambda>
<b>flat_map_fn = lambda kv: ((kv[0], x) for x in f(kv[1]))
TypeError: 'int' object is not iterable

Spark / PySpark: Group by any item of nested list

I´m still new to Spark / PySpark and have the following question. I got a nested list with ID´s in it:
result = [[411, 44, 61], [42, 33], [1, 100], [44, 42]]
The thing I´m trying to achieve is, that if any item of sublist matches an item in another sublist the both should be merged. The result should look like this:
merged_result = [[411, 44, 61, 42, 33, 44, 42], [1,100]]
The first list in "result" matches with the fourth list. The fourth list matches with the second, so all 3 should be merged into one list. The third list doesn´t match with any other list, so it stays the same.
I could achieve this by writing loops with Python.
result_after_matching = []
for i in result:
new_list = i
for s in result:
if any(x in i for x in s):
new_list = new_list + s
result_after_matching.append(set(new_list))
#merged_result = [[411, 44, 61, 42], [42,33,44], [1, 100], [44,42,33,411,61]]
As this is not the desired output I would need to repeat the loop and do another set() overt the "merged_result")
set([[411,44,61,42,33], [42,33,44,411,61],[1,100], [44,42,33,411,61]])
-> [[411, 44, 61, 42, 33], [1,100]]
As the list of lists, and the sublists gets bigger and bigger by time as new data will be incoming, this will not be the function to use.
Can anyone tell me if there is a function, in Spark / Pyspark, to match / merge / groupby / reduce these nested lists much easier and faster?!
Thanks a lot in advance!
MG
Most rdd or dataframe based solutions will probably be fairly inefficient. This is because the nature of your problem requires every element of your data set to be compared to every other element potentially multiple times. This makes it so that distributing the work across a cluster is at best inefficient.
Perhaps a different way to do this would be to reformulate this as a graph problem. If you treat each item in a list as a node on a graph, and each list as a subgraph, then the connected components of a parent graph constructed from the subgraphs will be the desired result. Here is an example using the networkx package in python:
import networkx as nx
result = [[411, 44, 61], [42, 33], [1, 100], [44, 42]]
g = nx.DiGraph()
for subgraph in result:
g.add_path(subgraph)
u = g.to_undirected()
output=[]
for component in nx.connected_component_subgraphs(u):
output.append(component.nodes())
print(output)
# [[33, 42, 411, 44, 61], [1, 100]]
This should be fairly efficient, but if your data is very large it will make sense to use a more scalable graph analysis tool. Spark does have a graph processing library called GraphX:
https://spark.apache.org/docs/latest/graphx-programming-guide.html
Unfortunately the pyspark implementation is lagging behind a bit. So if you intend to use something like this, you might be stuck using scala-spark or a different framework entirely for right now.
I think you can use aggregate action from RDD. Below I'm putting example implementation in Scala. Please note that I've used recursion, to make it more readable, but to improve performance it's good idea to reimplement those functions.
def overlap(s1: Seq[Int], s2: Seq[Int]): Boolean =
s1.exists(e => s2.contains(e))
def mergeSeq(s1: Seq[Int], s2: Seq[Int]): Seq[Int] =
s1.union(s2).distinct
def mergeSeqWithSeqSeq(s: Seq[Int], ss: Seq[Seq[Int]]): Seq[Seq[Int]] = ss match {
case Nil => Seq(s)
case h +: tail =>
if(overlap(h, s)) mergeSeqWithSeqSeq(mergeSeq(h, s), tail)
else h +: mergeSeqWithSeqSeq(s, tail)
}
def mergeSeqSeqWithSeqSeq(s1: Seq[Seq[Int]], s2: Seq[Seq[Int]]): Seq[Seq[Int]] = s1 match {
case Nil => s2
case h +: tail => mergeSeqWithSeqSeq(h, mergeSeqSeqWithSeqSeq(tail, s2))
}
val result = rdd
.aggregate(Seq.empty[Seq[Int]]) (
{case (ss, s) => mergeSeqWithSeqSeq(s, ss)},
{case (s1, s2) => mergeSeqSeqWithSeqSeq(s1, s2)}
)

Resources