join() in pyspark does not produce expected results - apache-spark

num_of_words = (doc_title,num) #number of words in a document
lines = (doc_title,word,num_of_occurrences) #number of occurrences of a specific word in a document
When I called lines.join(num_of_words), I was expecting to get something like:
(doc_title,(word,num_of_occurrences,num))
but I got instead:
(doc_title,(word,num))
and num_of_occurrences was omitted. What did I do wrong here? How am I supposed to join these two RDDs to get the result I'm expecting?

In the API docs of Spark for the join method:
join(other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.
So the join method can only be used on pairs (or at least will only return you a result of the described form).
A way to overcome this would be to have tuples of (doc_title, (word, num_occurrences)) instead of (doc_title, word, num_occurrences).
Working example:
num_of_words = sc.parallelize([("harry potter", 4242)])
lines = sc.parallelize([("harry potter", ("wand", 100))])
result = lines.join(num_of_words)
print result.collect()
# [('harry potter', (('wand', 100), 4242))]
(Note that sc.parallelize only turns a local python collection into a Spark RDD, and that collect() does the exact opposite)

Related

CombineByKey object is not iterable [duplicate]

If I have a rdd, how do I understand the data is in key:value
format? is there a way to find the same - something like
type(object) tells me an object's type. I tried print
type(rdd.take(1)), but it just says <type 'list'>.
Let's say I have a data like (x,1),(x,2),(y,1),(y,3) and I use
groupByKey and got (x,(1,2)),(y,(1,3)). Is there a way to define
(1,2) and (1,3) as values where x and y are keys? Or does a key has to be a single value? I noted that if I use reduceByKey and sum function to get the data ((x,3),(y,4)) then it becomes much easier to define this data as a key-value pair
Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:
k, v = kv
Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:
key_value.py
class KeyValue(object):
def __init__(self, k, v):
self.k = k
self.v = v
def __iter__(self):
for x in [self.k, self.v]:
yield x
from key_value import KeyValue
rdd = sc.parallelize(
[KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)])
rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]
and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.
Also type(rdd.take(1)) returns a list of length n so its type will be always the same.

How can you store the results from a forEach in Spark

DataSet#foreach(f) applies the function f to each row in the dataset. In a clustered environment, the data is split across the cluster. How can the results from each of these functions be collected?
For example, say the function would count the number of characters stored in each row. How can you create a DataSet or RDD that contains the results of each of these functions applied to each row?
The definition for foreach looks something like :
final def foreach(f: (A) ⇒ Unit): Unit
f : The function that is applied for its side-effect to every element.
The result of function f is discarded
foreach in Scala is generally used to denote the usage of a function that involves a side-effect, e.g. printing to STDOUT.
If you want to return something by applying a particular function, you'll have to use map
final def map[B](f: (A) ⇒ B): List[B]
I copied the syntax from the documentation for List but it'll be something similar for RDDs as well.
As you can see, it works the function f on datatype A and returns a collection of datatype B where A and B can be the same data type as well.
val rdd = sc.parallelize(Array(
"String1",
"String2",
"String3" ))
scala> rdd.foreach(x => (x, x.length) )
// Nothing happens
rdd.map(x => (x, x.length) ).collect
// Array[(String, Int)] = Array((String1,7), (String2,7), (String3,7))

Too many Values to Unpack in SortByKey [duplicate]

If I have a rdd, how do I understand the data is in key:value
format? is there a way to find the same - something like
type(object) tells me an object's type. I tried print
type(rdd.take(1)), but it just says <type 'list'>.
Let's say I have a data like (x,1),(x,2),(y,1),(y,3) and I use
groupByKey and got (x,(1,2)),(y,(1,3)). Is there a way to define
(1,2) and (1,3) as values where x and y are keys? Or does a key has to be a single value? I noted that if I use reduceByKey and sum function to get the data ((x,3),(y,4)) then it becomes much easier to define this data as a key-value pair
Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:
k, v = kv
Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:
key_value.py
class KeyValue(object):
def __init__(self, k, v):
self.k = k
self.v = v
def __iter__(self):
for x in [self.k, self.v]:
yield x
from key_value import KeyValue
rdd = sc.parallelize(
[KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)])
rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]
and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.
Also type(rdd.take(1)) returns a list of length n so its type will be always the same.

Reading and learning Spark API?

I am learning Spark by example, but I don't know the good way to understand API. For instance, the very classic word count example:
val input = sc.textFile("README.md")
val words = input.flatMap(x => x.split(" "))
val result = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
When I read the reduceByKey API, I see:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
The API states: Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
In the programming guide: When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
Ok, through the example I know (x, y) is (V, V), and that should be the value part of the map. I give a function to compute the V and I get RDD[(K, V)]. My questions are: In such example, in reduceByKey(func: (V, V) ⇒ V), why 2 V? The 1st and 2nd V in (V, V) is same or not?
I guess I ask this question and therefore use the question topic due to that I still don't know how to correctly read the API, or I just miss some even basic Spark concept?!
in the code below:
reduceByKey((x, y) => x + y)
you could read for more clarity, something like this:
reduceByKey((sum, addend) => sum + addend)
so, for every key, you iterate that function fore every element with that key.
Basically, (func: (V, V) ⇒ V), means that you have a function with 2 input of a certain type (let's say Int) which returns a single output of the same type.
Usually the data sets will be of the form ("key1",val11),("key2",val21),("key1",val12),("key2",val22)...so on
There will be the same key with multiple values in the RDD[(K,V)]
When you use the reduceByKey . For each values in the key the function will be applied.
For example consider the following program
val data = Array(("key1",2),("key1",20),("key2",21),("key1",2),("key2",10),("key2",33))
val rdd = sc.parallelize(data)
val res = rdd.reduceByKey((x,y) => x+y)
res.foreach(println)
You will get the output as
(key2,64)
(key1,24)
Here the Sequence of values are passed to the function . For key1 -> (2,20,2)
In the end , You will have a single value for each key.
You could use spark shell to try out the APIs.

pyspark fold method output

I'm surprised at this output from fold, I can't imagine what it's doing.
I would expect that something.fold(0, lambda a,b: a+1) would return the number of elements in something, since the fold starts at 0 and adds 1 for each element.
sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
8
I'm coming from Scala, where fold works as the way I've described. So how is fold supposed to work in pyspark? Thanks for your thoughts.
To understand what's going on here, let's look at the definition of Spark's fold operation. Since you're using PySpark, I'm going to show the Python version of the code, but the Scala version exhibits the exact same behavior (you can also browse the source on GitHub):
def fold(self, zeroValue, op):
"""
Aggregate the elements of each partition, and then the results for all
the partitions, using a given associative function and a neutral "zero
value."
The function C{op(t1, t2)} is allowed to modify C{t1} and return it
as its result value to avoid object allocation; however, it should not
modify C{t2}.
>>> from operator import add
>>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
15
"""
def func(iterator):
acc = zeroValue
for obj in iterator:
acc = op(obj, acc)
yield acc
vals = self.mapPartitions(func).collect()
return reduce(op, vals, zeroValue)
(For comparison, see the Scala implementation of RDD.fold).
Spark's fold operates by first folding each partition and then folding the results. The problem is that an empty partition gets folded down to the zero element, so the final driver-side fold ends up folding one value for every partition rather than one value for each non-empty partition. This means that the result of fold is sensitive to the number of partitions:
>>> sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
100
>>> sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
50
>>> sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
1
In this last case, what's happening is that the single partition is being folded down to the correct value, then that value is folded with the zero-value at the driver to yield 1.
It seems that Spark's fold() operation actually requires the fold function to be commutative in addition to associative. There are actually other places in Spark that impose this requirement, such as the fact that the ordering of elements within a shuffled partition can be non-deterministic across runs (see SPARK-5750).
I've opened a Spark JIRA ticket to investigate this issue: https://issues.apache.org/jira/browse/SPARK-6416.
Lets me try to give simple examples to explain fold method of spark. I will be using pyspark here.
rdd1 = sc.parallelize(list([]),1)
Above line is going to create an empty rdd with one partition
rdd1.fold(10, lambda x,y:x+y)
This yield output as 20
rdd2 = sc.parallelize(list([1,2,3,4,5]),2)
Above line is going to create rdd with values 1 to 5 and will be having a total of 2 partitions
rdd2.fold(10, lambda x,y:x+y)
This yields output as 45
So in above case for sake of simplicity what is happening here is you are having zeroth element as 10. So the sum that you would otherwise get of all numbers in the RDD, is now added by 10(i.e. zeroth element+all other elements => 10+1+2+3+4+5 = 25). Also now we have two partitions(i.e. number of partitions*zeroth element => 2*10 = 20)
Final output that fold emits is 25+20 = 45
Using similar process its clear why the fold operation on rdd1 yielded 20 as output.
Reduce fails when we have empty list something like rdd1.reduce(lambda x,y:x+y)
ValueError: Can not reduce() empty RDD
Fold can be used if we think we can have empty list in the rdd
rdd1.fold(0, lambda x,y:x+y)
As expected this will yield output as 0.

Resources