I'm new to spark. Could somebody explain more detail about "aggregatByKey() lets you return result in different type than input value type while reduceByKey() return the same type as input".
If i use reduceByKey() i also can get a different type of value in output:
>>> rdd = sc.parallelize([(1,3),(2,3),(1,2),(2,5)])
>>> rdd.collect()
[(1, 3), (2, 3), (1, 2), (2, 5)]
>>> rdd.reduceByKey(lambda x,y: str(x)+str(y)).collect()
[(2, '35'), (1, '32')]
As we can see - input is int, output - str.
Eather i don't understand this diff correctly? whats the point?
I have an RDD1 in this form: ['once','upon','a','time',...,'the','end']. I want to convert in into a key/value pair such that the strings are values and keys are in ascending order. The expected RDD2 should be as follows: [(1,'once'),(2,'upon'),(3,'a'),(4,'time'),...,(RDD1.count()-1,'the'),(RDD1.count(),'end']
Any hints?
Thanks
Use pyspark's own zip function. This might help:
rdd1 = sc.parallelize(['once','upon','a','time','the','end'])
nums = sc.parallelize(range(rdd1.count())).map(lambda x: x+1)
zippedRdds = nums.zip(rdd1)
rdd2 = zippedRdds.sortByKey()
rdd2.collect()
will give:
[(1, 'once'), (2, 'upon'), (3, 'a'), (4, 'time'), (5, 'the'), (6,
'end')]
If I happen to have the following list of lists:
L=[[(1,3)],[(1,3),(2,4)],[(1,3),(1,4)],[(1,2)],[(1,2),(1,3)],[(1,3),(2,4),(1,2)]]
and what I wish to do, is to create a relation between lists in the following way:
I wish to say that
[(1,3)] and [(1,3),(1,4)]
are related, because the first is a sublist of the second, but then I would like to add this relation into a list as:
Relations=[([(1,3)],[(1,3),(1,4)])]
but, we can also see that:
[(1,3)] and [(1,3),(2,4)]
are related, because the first is a sublist of the second, so I would want this to also be a relation added into my Relations list:
Relations=[([(1,3)],[(1,3),(1,4)]),([(1,3)],[(1,3),(2,4)])]
The only thing I wish to be careful with, is that I am considering for a list to be a sublist of another if they only differ by ONE element. So in other words, we cannot have:
([(1,3)],[(1,3),(2,4),(1,2)])
as an element of my Relations list, but we SHOULD have:
([(1,3),(2,4)],[(1,3),(2,4),(1,2)])
as an element in my Relations list.
I hope there is an optimal way to do this, since in the original context I have to deal with a much bigger list of lists.
Any help given is much appreciated.
You really haven't provided enough information, so can't tell if you need itertools.combinations() or itertools.permutations(). Your examples work with itertools.combinations so will use that.
If x and y are two elements of the list then you just want all occurrences where the set(x).issubset(y) and the size of the set difference is <= 1 - len(set(y) - set(x)) <= 1, e.g.:
In []:
[[x, y] for x, y in it.combinations(L, r=2) if set(x).issubset(y) and len(set(y)-set(x)) <= 1]
Out[]:
[[[(1, 3)], [(1, 3), (2, 4)]],
[[(1, 3)], [(1, 3), (1, 4)]],
[[(1, 3)], [(1, 2), (1, 3)]],
[[(1, 3), (2, 4)], [(1, 3), (2, 4), (1, 2)]],
[[(1, 2)], [(1, 2), (1, 3)]],
[[(1, 2), (1, 3)], [(1, 3), (2, 4), (1, 2)]]]
I need executors to finish processing data at different times.
I think the easiest way is to make RDD partitions have not uniform sizes. How can I do this?
Not sure what you are trying to achieve, but you can partition the RDD anyway you like using partitionBy eg:
sc.parallelize(xrange(10)).zipWithIndex()
.partitionBy(2, lambda x: 0 if x<2 else 1)
.glom().collect()
[[(0, 0), (1, 1)], [(2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9)]]
Note that it works on a (k,v) RDD and the partitioning function takes only k as a param
Currently, I'm working a string alignment comparison. I'm confused on how to optimize DP by pruning.
DP can be represented as a matrix/table. The start point is (0, 0). For example, element at (3, 4) is pruned and its value marked as -1 or null. But when I compute location (4, 4), (3, 5) and (4, 5), I still need a if-statement to check whether the value of (3, 4) is invalid(pruned) or valid(not pruned). Can this implementation save time because pruning function brings extra running time???