Difference between RDD.foreach() and RDD.map() - apache-spark

I am learning Spark in Python and wondering can anyone explain the difference between the action foreach() and transformation map()?
rdd.map() returns a new RDD, like the original map function in Python. However, I want to see a rdd.foreach() function and understand the differences. Thanks!

A very simple example would be rdd.foreach(print) which would print the value of each row in the RDD but not modify the RDD in any way.
For example, this produces an RDD with the numbers 1 - 10:
>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The map call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach that would be useless because foreach doesn't modify the rdd in any way:
>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>
Conversely, calling map on a function that returns None like print isn't very useful:
>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]
The print call returns None so mapping that just gives you a bunch of None values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1, 2, etc. are the print being executed and they don't show up until you call take since the RDD is executed lazily. However the contents of the RDD are just a bunch of None.
More simply, call map if you care about the return value of the function. Call foreach if you don't.

Map is a transformation, thus when you perform a map you apply a function to each element in the RDD and return a new RDD where additional transformations or actions can be called.
Foreach is an action, it takes each element and applies a function, but it does not return a value. This is particularly useful in you have to call perform some calculation on an RDD and log the result somewhere else, for example a database or call a REST API with each element in the RDD.
For example let's say that you have an RDD with many queries that you wish to log in another system. The queries are stored in an RDD.
queries = <code to load queries or a transformation that was applied on other RDDs>
Then you want to save those queries in another system via a call to another API
import urllib2
def log_search(q):
response = urllib2.urlopen('http://www.bigdatainc.org/save_query/' + q)
queries.foreach(log_search)
Now you have executed the log_query on each element of the RDD. If you have done a map, nothing would have happened yet, until you called an action.

Related

PySpark Reduce on RDD with only single element

Is there anyway to deal with RDDs with only a single element (this can sometimes happen for what I am doing)? When that's the case, reduce stops working as the operation requires 2 inputs.
I am working with key-value pairs such as:
(key1, 10),
(key2, 20),
And I want to aggregate their values, so the result should be:
30
But there are cases where the rdd only contain a single key-value pair, so reduce does not work here, example:
(key1, 10)
This will return nothing.
If you do a .values() before doing reduce, it should work even if there is only 1 element in the RDD:
from operator import add
rdd = sc.parallelize([('key1', 10),])
rdd.values().reduce(add)
# 10

RDD and PipelinedRDD type

I am bit new to PySpark (, more into Spark-Scala), Recently i came across below observation . When i am creating RDD using parallelize() method the return type is RDD type. But when i am creating RDD using range() method it is of type PipelinedRDD . For example:
>>> listRDD =sc.parallelize([1,2,3,4,5,6,7])
>>> print(listRDD.collect())
[1, 2, 3, 4, 5, 6, 7]
>>> print(type(listRDD))
<class 'pyspark.rdd.RDD'>
>>> rangeRDD =sc.range(1,8)
>>> print(rangeRDD.collect())
[1, 2, 3, 4, 5, 6, 7]
>>> print(type(rangeRDD))
<class 'pyspark.rdd.PipelinedRDD'>
I checked how both rdds are constructed and found :
1)Internally both are using parallize method only.
>>> rangeRDD.toDebugString()
b'(8) PythonRDD[25] at collect at <stdin>:1 []\n | ParallelCollectionRDD[24] at parallelize at PythonRDD.scala:195 []'
>>> listRDD.toDebugString()
b'(8) PythonRDD[26] at RDD at PythonRDD.scala:53 []\n | ParallelCollectionRDD[21] at parallelize at PythonRDD.scala:195 []'
2)The PipeLineRDD is a subclass of RDD class that i understand.
But is there any generic logic when it will of type PipeLineedRDD and when it will be of RDD type?
Thanks All In Advance.
sc.range is indeed calling parallelize method internally - it is defined here. You can see that sc.range is calling sc.parallelize with xrange as input. And sc.parallelize has a separate code branch when called with xrange input type: it calls itself with the empty list as argument and then applies mapPartitionsWithIndex here which is the final output of sc.parallelize call and sc.range call in turn. So you can see how first regular object is created similarly as you do with sc.parallelize (though object is an empty list), but final output is the result of applying mapping function on top of it.
It seems that the main reason for this behavior is to avoid materializing the data which would happen otherwise (if input doesn't have len implemented, it's iterated over and converted into list right away).

Is there a inbuilt function to compare RDDs on specific criteria or better to write a UDF

How to do I count the occurrences of elements in child RDD occurring in Parent RDD.
Say,
I have two RDDs
Parent RDD -
['2 3 5']
['4 5 7']
['5 4 2 3']
Child RDD
['2 3','5 3','4 7','5 7','5 3','2 3']
I need something like -
[['2 3',2],['5 3',2],['4 7',1],['5 7',1],['5 3',2] ...]
Its actually finding the frequent item candidate set from the parent set.
Now, the child RDD can contain initially string elements or even lists i.e
['1 2','2 3'] or [[1,2],[2,3]]
as that's the data structure that I would implement according to what fits the best.
Question -
Are there inbuild functions which could do something similar to what I am trying to achieve with these two RDDs? Any transformations?
Or writing a UDF that parses each element of child and compares it to parent is needed, now my data is a lot so I doubt this would be efficient.
In case I end up writing a UDF should I use the foreach function of RDD?
Or RDD framework is not a good idea for some custom operation like this and dataframes could work here?
I am trying to do this in PySpark. Help or guidance is greatly appreciated!
It's easy enough if you use sets, but the trick is with grouping as sets cannot be used as keys.
The alternative used here is ordering set elements and generating a string as the corresponding key:
rdd = sc.parallelize(['2 3 5', '4 5 7', '5 4 2 3'])\
.map(lambda l: l.split())\
.map(set)
childRdd = sc.parallelize(['2 3','5 3','4 7','5 7','5 3','2 3'])\
.map(lambda l: l.split())\
.map(set)
#A small utility function to make strings from sets
#the point is order so that grouping can match keys
#that's because sets aren't ordered.
def setToString(theset):
lst = list(theset)
lst.sort()
return ''.join(lst)
Now find pairs where child is subset of parent
childRdd.cartesian(rdd)\
.filter(lambda l: set(l[0]).issubset(set(l[1])))\
.map(lambda pair: (setToString(pair[0]), pair[1]))\
.countByKey()
For the above example, the last line returns:
defaultdict(int, {'23': 4, '35': 4, '47': 1, '57': 1})

Pyspark applying foreach

I'm nooby in Pyspark and I pretend to play a bit with a couple of functions to understand better how could I use them in more realistic scenarios. for a while, I trying to apply a specific function to each number coming in a RDD. My problem is basically that, when I try to print what I grabbed from my RDD the result is None
My code:
from pyspark import SparkConf , SparkContext
conf = SparkConf().setAppName('test')
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
changed = []
def div_two (n):
opera = n / 2
return opera
numbers = [8,40,20,30,60,90]
numbersRDD = sc.parallelize(numbers)
changed.append(numbersRDD.foreach(lambda x: div_two(x)))
#result = numbersRDD.map(lambda x: div_two(x))
for i in changed:
print(i)
I appreciate a clear explanation about why this is coming Null in the list and what should be the right approach to achieve that using foreach whether it's possible.
thanks
Your function definition of div_two seems fine which can yet be reduced to
def div_two (n):
return n/2
And you have converted the arrays of integers to rdd which is good too.
The main issue is that you are trying to add rdds to an array changed by using foreach function. But if you look at the definition of foreach
def foreach(self, f) Inferred type: (self: RDD, f: Any) -> None
which says that the return type is None. And thats what is getting printed.
You don't need an array variable for printing the changed elements of an RDD. You can simply write a function for printing and call that function in foreach function
def printing(x):
print x
numbersRDD.map(div_two).foreach(printing)
You should get the results printed.
You can still add the rdd to an array variable but rdds are distributed collection in itself and Array is a collection too. So if you add rdd to an array you will have collection of collection which means you should write two loops
changed.append(numbersRDD.map(div_two))
def printing(x):
print x
for i in changed:
i.foreach(printing)
The main difference between your code and mine is that I have used map (which is a transformation) instead of foreach ( which is an action) while adding rdd to changed variable. And I have use two loops for printing the elements of rdd

dot product of a combination of elements of an RDD using pySpark

I have an RDD where each element is a tuple of the form
[ (index1,SparseVector({idx1:1,idx2:1,idx3:1,...})) , (index2,SparseVector() ),... ]
I would like to take a dot-product of each of the values in this RDD by using the SparseVector1.dot(SparseVector2) method provided by mllib.linalg.SparseVector class. I am aware that python has an itertools.combinations module that can be used to achieve the combinations of dot-products to be calculated. Could someone provide a code-snippet to achieve the same? I can only thing of doing an RDD.collect() so I receive a list of all elements in the RDD and then running the itertools.combinations on this list but this as per my understanding would perform all the calculations on the root and wouldn't be distributed per-se. Could someone please suggest a more distributed way of achieving this?
def computeDot(sparseVectorA, sparseVectorB):
"""
Function to compute dot product of two SparseVectors
"""
return sparseVectorA.dot(sparseVectorB)
# Use Cartesian function on the RDD to create tuples containing
# 2-combinations of all the rows in the original RDD
combinationRDD = (originalRDD.cartesian(originalRDD))
# The records in combinationRDD will be of the form
# [(Index, SV1), (Index, SV1)], therefore, you need to
# filter all the records where the index is not equal giving
# RDD of the form [(Index1, SV1), (Index2, SV2)] and so on,
# then use the map function to use the SparseVector's dot function
dottedRDD = (combinationRDD
.filter(lambda x: x[0][0] != x[1][0])
.map(lambda x: computeDot(x[0][1], x[1][1])
.cache())
The solution to this question should be along this line.

Resources