If I have a rdd, how do I understand the data is in key:value
format? is there a way to find the same - something like
type(object) tells me an object's type. I tried print
type(rdd.take(1)), but it just says <type 'list'>.
Let's say I have a data like (x,1),(x,2),(y,1),(y,3) and I use
groupByKey and got (x,(1,2)),(y,(1,3)). Is there a way to define
(1,2) and (1,3) as values where x and y are keys? Or does a key has to be a single value? I noted that if I use reduceByKey and sum function to get the data ((x,3),(y,4)) then it becomes much easier to define this data as a key-value pair
Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:
k, v = kv
Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:
key_value.py
class KeyValue(object):
def __init__(self, k, v):
self.k = k
self.v = v
def __iter__(self):
for x in [self.k, self.v]:
yield x
from key_value import KeyValue
rdd = sc.parallelize(
[KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)])
rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]
and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.
Also type(rdd.take(1)) returns a list of length n so its type will be always the same.
Related
Can someone explain why NamedTuples and immutable structs are separate instead of NamedTuples being an anonymous struct like there are anonymous functions function (x) x^2 end? They look like they have the same structure conceptually (I would also like to know if they have a different memory layout), though they have different methods to access their fields (example below). It seems very plausible to implement the NamedTuple methods for structs, but I may just not be aware of a good reason not to do that.
struct X; a; b; c; end
Xnt = NamedTuple{(:a,:b,:c), Tuple{Any, Any, Any}}
t1 = (10, 20.2, 30im)
#
#
# t1[1] indexing by position
# t1[1:2] slicing
# for el in t1 iteration
x1 = X(t1...)
# x1.a getting field
xnt1 = Xnt(t1)
# xnt1.a getting field
# xnt1[:a] indexing by field
# xnt1[1] indexing by position
#
# for el in t1 iteration
Every single NamedTuple instance with the same names and field types is of the same type. Different structs (types) can have the same number and type of fields but are different types.
A named tuple has names for each column in the tuple. Named tuples are an alternative to a dataframe. When instantiating the namedTuple, you pass the list of field names as a list. Named tuples create a specification contract for expected fields and reduce the chance of code breaking.
If I have a rdd, how do I understand the data is in key:value
format? is there a way to find the same - something like
type(object) tells me an object's type. I tried print
type(rdd.take(1)), but it just says <type 'list'>.
Let's say I have a data like (x,1),(x,2),(y,1),(y,3) and I use
groupByKey and got (x,(1,2)),(y,(1,3)). Is there a way to define
(1,2) and (1,3) as values where x and y are keys? Or does a key has to be a single value? I noted that if I use reduceByKey and sum function to get the data ((x,3),(y,4)) then it becomes much easier to define this data as a key-value pair
Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:
k, v = kv
Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:
key_value.py
class KeyValue(object):
def __init__(self, k, v):
self.k = k
self.v = v
def __iter__(self):
for x in [self.k, self.v]:
yield x
from key_value import KeyValue
rdd = sc.parallelize(
[KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)])
rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]
and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.
Also type(rdd.take(1)) returns a list of length n so its type will be always the same.
I have used * or ** for passing arguments to a function in Python2, not in question, usually with list, set and dictionary.
def func(a, b, c):
pass
l = ['a', 'b', 'c']
func(*l)
d = {'a': 'a', 'b': 'b', 'c': 'c'}
func(**d)
However, in Python3, There appear the new objects replacing list with or something, for example, dict_keys, dict_values, range, map and so on.
While I have migrated my Python2 code to Python3, I need to decide whether the new objects could support the operation which former object in Python2 did so that If not, I should change the code using like type-cast to origin type, for instance list(dict_keys), or something.
d = {'a': 'a', 'b': 'b'}
print(list(d.keys())[0]) # type-case to use list-index
For Iterating I could figure out by the way below.
import collections
isinstance(dict_keys, collections.Iterable)
isinstance(dict_values, collections.Iterable)
isinstance(map, collections.Iterable)
isinstance(range, collections.Iterable)
It looks clear to distinguish if the new object is iterable or not but like the title of the question, how about asterisk operation for position/keyword argument?
Up to now, all objects replaced list with support asterisk operation as my testing but I need clear criterion not testing by hand.
I have tried a few way but there is no common criterion.
they are all Iterable class?
no, Iterable generator doesn't support.
they are all Iterator class?
no, Iterator generator doesn't support.
they are all Container class?
no map class is not Container
they all have a common superclass class?
no there is no common superclass(tested with Class.mro())
How could I know if the object support asterisk(*, **) operation for position/keyword argument?
Each iterable "supports" starred expression; even genrators and maps do. However, that "an object supports *" is a misleading term, because the star means "unpack my interable and pass each element in order to the parameteres of an interface". Hence, the * operator supports iterables.
And this is maybe where your problem comes in: the iterables you use with * have to have as many elements as the interface has parameters. See for example the following snippets:
# a function that takes three args
def fun(a, b, c):
return a, b, c
# a dict of some arbitrary vals:
d = {'x':10, 'y':20, 'z':30} # 3 elements
d2 = {'x':10, 'y':20, 'z':30, 't':0} # 4 elements
You can pass d to fun in many ways:
fun(*d) # valid
fun(*d.keys()) # valid
fun(*d.values()) # valid
You cannot, however, pass d2 to fun since it has more elements then
fun takes arguments:
fun(*d2) # invalid
You can also pass maps to fun using stared expression. But remeber, the result of map has to have as many arguments as fun takes arguments.
def sq(x):
return x**2
sq_vals = map(sq, *d.values())
fun(*sq_vals) # Result (100, 400, 900)
The same holds for generators if it produces as many elements as your function takes arguments:
def genx():
for i in range(3):
yield i
fun(*genx()) # Result: (0, 1, 2)
In order to check whether you can unpack an iterable into a function's interface using starred expression, you need to check if your iterable has the same number of elements as your function takes arguments.
If you want make your function safe against different length of arguments, you could, for example, try redefine you function the following way:
# this version of fun takes min. 3 arguments:
fun2(a, b, c, *args):
return a, b, c
fun2(*range(10)) # valid
fun(*range(10)) # TypeError
The single asterisk form ( *args ) is used to pass a non-keyworded,
variable- length argument list, and the double asterisk form is used
to pass a keyworded, variable-length argument list
args and kwargs explainedalso this one
I've been tryng to replicate the code in http://www.data-intuitive.com/2015/01/transposing-a-spark-rdd/ to traspose an RDD in pyspark. I am able to load my RDD correctly and apply the zipWithIndex method to it as follows:
m1.rdd.zipWithIndex().collect()
[(Row(c1_1=1, c1_2=2, c1_3=3), 0),
(Row(c1_1=4, c1_2=5, c1_3=6), 1),
(Row(c1_1=7, c1_2=8, c1_3=9), 2)]
But, when I want to apply it a flatMap with a lambda enumerating that array either the syntax is non-valid:
m1.rdd.zipWithIndex().flatMap(lambda (x,i): [(i,j,e) for (j,e) in enumerate(x)]).take(1)
Or, the positional argument i appears as missing:
m1.rdd.zipWithIndex().flatMap(lambda x,i: [(i,j,e) for (j,e) in enumerate(x)]).take(1)
When I run the lambda in python, it needs an extra index parameter to catch the function.
aa = m1.rdd.zipWithIndex().collect()
g = lambda x,i: [(i,j,e) for (j,e) in enumerate(x)]
g(aa,3) #extra parameter
Which seems to me unnecessary as the index has been calculated previously.
I'm quite an amateur in python and spark and I would like to know what is the issue with the indexes and why neither spark nor python are catching them. Thank you.
First let's take a look a the signature of RDD.flatMap (preservesPartitioning parameter removed for clarity):
flatMap(self: RDD[T], f: Callable[[T], Iterable[U]]) -> RDD[U]: ...
As you can see flatMap expects an unary function.
Going back to your code:
lambda x, i: ... is a binary function, so clearly it won't work.
lambda (x, i): ... use to be a syntax for an unary function with tuple argument unpacking. It used structural matching to destructure (unpack in Python nomenclature) a single input argument (here Tuple[Any, Any]). This syntax was brittle and has been removed in Python 3. A correct way to achieve the same result in Python 3 is indexing:
lambda xi: ((x[1], j, e) for e, j in enumerate(x[0]))
If you prefer structural matching just use standard function:
def flatten(xsi):
xs, i = xsi
for j, x in enumerate(xs):
yield i, j, x
rdd.flatMap(flatten)
num_of_words = (doc_title,num) #number of words in a document
lines = (doc_title,word,num_of_occurrences) #number of occurrences of a specific word in a document
When I called lines.join(num_of_words), I was expecting to get something like:
(doc_title,(word,num_of_occurrences,num))
but I got instead:
(doc_title,(word,num))
and num_of_occurrences was omitted. What did I do wrong here? How am I supposed to join these two RDDs to get the result I'm expecting?
In the API docs of Spark for the join method:
join(other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.
So the join method can only be used on pairs (or at least will only return you a result of the described form).
A way to overcome this would be to have tuples of (doc_title, (word, num_occurrences)) instead of (doc_title, word, num_occurrences).
Working example:
num_of_words = sc.parallelize([("harry potter", 4242)])
lines = sc.parallelize([("harry potter", ("wand", 100))])
result = lines.join(num_of_words)
print result.collect()
# [('harry potter', (('wand', 100), 4242))]
(Note that sc.parallelize only turns a local python collection into a Spark RDD, and that collect() does the exact opposite)