Related
I am given a rdd. Example:
test = sc.parallelize([(1,0), (2,0), (3,0)])
I need to get the Cartesian product and remove resulting tuple pairs that have duplicate entries.
In this toy example these would be ((1, 0), (1, 0)), ((2, 0), (2, 0)), ((3, 0), (3, 0)).
I can get the Cartesian product as follows: NOTE The collect and print statements are there ONLY for
troubleshooting.
def compute_cartesian(rdd):
result1 = sc.parallelize(sorted(rdd.cartesian(rdd).collect()))
print(type(result1))
print(result1.collect())
My type and output at this stage are correct:
<class 'pyspark.rdd.RDD'>
[((1, 0), (1, 0)), ((1, 0), (2, 0)), ((1, 0), (3, 0)), ((2, 0), (1, 0)), ((2, 0), (2, 0)), ((2, 0), (3, 0)), ((3, 0), (1, 0)), ((3, 0), (2, 0)), ((3, 0), (3, 0))]
But now I need to remove the three pairs of tuples with duplicate entries.
Tried so far:
.distinct() This runs but does not produce a correct resulting rdd.
.dropDuplicates() Will not run. I assume this is an incorrect usage of .dropDuplicates().
Manual function:
Without an RDD this task is easy.
# Remove duplicates
for elem in result:
if elem[0] == elem[1]:
result.remove(elem)
print(result)
print("After: ", len(result))
This was a function I wrote that removes duplicate tuple pairs and then spits out the resulting len so I could do a sanity check.
I am just not sure how to directly perform actions on the RDD, in this case remove any duplicate tuple pairs resulting from the Cartesian product, and return an RDD.
Yes, I can .collect() it, perform the operation, and then re-type it as an RDD, but that defeats the purpose. Suppose this was billions of pairs. I need to perform the operations on the rdd and return an rdd.
You can use filter to remove the pairs that you don't want:
dd.cartesian(rdd).filter(lambda x: x[0] != x[1])
Note that I would not call those pairs "duplicate pairs", but rather "pairs of duplicates" or even better, "diagonal pairs": they correspond to the diagonal if you visualize the Cartesian product geometrically.
This is why distinct and dropDuplicates are not appropriate here: they remove duplicates, which is not what you want. For instance, [1,1,2].distinct() is [1,2].
I have an RDD1 in this form: ['once','upon','a','time',...,'the','end']. I want to convert in into a key/value pair such that the strings are values and keys are in ascending order. The expected RDD2 should be as follows: [(1,'once'),(2,'upon'),(3,'a'),(4,'time'),...,(RDD1.count()-1,'the'),(RDD1.count(),'end']
Any hints?
Thanks
Use pyspark's own zip function. This might help:
rdd1 = sc.parallelize(['once','upon','a','time','the','end'])
nums = sc.parallelize(range(rdd1.count())).map(lambda x: x+1)
zippedRdds = nums.zip(rdd1)
rdd2 = zippedRdds.sortByKey()
rdd2.collect()
will give:
[(1, 'once'), (2, 'upon'), (3, 'a'), (4, 'time'), (5, 'the'), (6,
'end')]
I've being rolling my head for a while over this - would really appreciate any suggestions!
Sorry for long title, I hope a short example I'll construct below will explain this much better.
Let's say we have an RDD of the below form:
data = sc.parallelize([(1,[('k1',4),('k2',3),('k1',2)]),\
(2,[('k3',1),('k3',8),('k1',6)])])
data.collect()
Output:
[(1, [('k1', 4), ('k2', 3), ('k1', 2)]),
(2, [('k3', 1), ('k3', 8), ('k1', 6)])]
I am looking to do the following with the deepest list of (key,val) RDD's
.reduceByKey(lambda a, b: a + b)
(i.e. reduce the values of these RDD's by key to get the sum by key while retaining the result mapped with keys of the initial higher level RDD, which would produce the following output):
[(1, [('k1', 6), ('k2', 3)]),
(2, [('k3', 9), ('k1', 6)])]
I'm relatively new with PySpark and probably missing something basic here, but I've tried a lot of different approaches on this, but essentially cannot find a way to access and reduceByKey the (key,val) RDD's in a list, which is itself a value of another RDD.
Many thanks in advance!
Denys
What you are trying to do is : your value (in input K,V) is an iterable on which you want to sum on inner key and return result as =>
(outer_key(e.g 1,2) -> List(Inner_Key(E.g."K1","K2"),Summed_value))
As you see the sum is calculated on inner Key-V,
we can achieve this by
First peeling out elements from each list item
=> making a new key as (outer key ,inner key)
=> making a sum on (outer_key,inner_key) -> value
=> Changing data format back to (outer_key ->(inner_key, summed_value))
=> finally Grouping again on Outer Key
I am not sure about Python one but believe just replacing Scala collection syntax with python's would suffice and here is the solution
SCALA VERSION
scala> val keySeq = Seq((1,List(("K1",4),("K2",3),("K1",2))),
| (2,List(("K3",1),("K3",8),("K1",6))))
keySeq: Seq[(Int, List[(String, Int)])] = List((1,List((K1,4), (K2,3), (K1,2))), (2,List((K3,1), (K3,8), (K1,6))))
scala> val inRdd = sc.parallelize(keySeq)
inRdd: org.apache.spark.rdd.RDD[(Int, List[(String, Int)])] = ParallelCollectionRDD[111] at parallelize at <console>:26
scala> inRdd.take(10)
res64: Array[(Int, List[(String, Int)])] = Array((1,List((K1,4), (K2,3), (K1,2))), (2,List((K3,1), (K3,8), (K1,6))))
// And solution :
scala> inRdd.flatMap { case (i,l) => l.map(l => ((i,l._1),l._2)) }.reduceByKey(_+_).map(x => (x._1._1 ->(x._1._2,x._2))).groupByKey.map(x => (x._1,x._2.toList.sortBy(x =>x))).collect()
// RESULT ::
res65: Array[(Int, List[(String, Int)])] = Array((1,List((K1,6), (K2,3))), (2,List((K1,6), (K3,9))))
UPDATE => Python Solution
>>> data = sc.parallelize([(1,[('k1',4),('k2',3),('k1',2)]),\
... (2,[('k3',1),('k3',8),('k1',6)])])
>>> data.collect()
[(1, [('k1', 4), ('k2', 3), ('k1', 2)]), (2, [('k3', 1), ('k3', 8), ('k1', 6)])]
# Similar operation
>>> data.flatMap(lambda x : [ ((x[0],y[0]),y[1]) for y in x[1]]).reduceByKey(lambda a,b : (a+b)).map(lambda x : [x[0][0],(x[0][1],x[1])]).groupByKey().mapValues(list).collect()
# RESULT
[(1, [('k1', 6), ('k2', 3)]), (2, [('k3', 9), ('k1', 6)])]
you should .map your dataset instead of reducing because the count of rows in your example are same as in source dataset, inside map you could reduce values as python list
use mapValues() + itertools.groupby():
from itertools import groupby
data.mapValues(lambda x: [ (k, sum(f[1] for f in g)) for (k,g) in groupby(sorted(x), key=lambda d: d[0]) ]) \
.collect()
#[(1, [('k1', 6), ('k2', 3)]), (2, [('k1', 6), ('k3', 9)])]
with itertools.groupby, we use the first item of the tuple as grouped-key k and sum the 2nd item from the tuple in each g.
Edit: for a large data set, sorting with itertools.groupby is expensive, just write up a function w/o sorting to handle the same:
def merge_tuples(x):
d = {}
for (k,v) in x:
d[k] = d.get(k,0) + v
return d.items()
data.mapValues(merge_tuples).collect()
#[(1, [('k2', 3), ('k1', 6)]), (2, [('k3', 9), ('k1', 6)])]
If I happen to have the following list of lists:
L=[[(1,3)],[(1,3),(2,4)],[(1,3),(1,4)],[(1,2)],[(1,2),(1,3)],[(1,3),(2,4),(1,2)]]
and what I wish to do, is to create a relation between lists in the following way:
I wish to say that
[(1,3)] and [(1,3),(1,4)]
are related, because the first is a sublist of the second, but then I would like to add this relation into a list as:
Relations=[([(1,3)],[(1,3),(1,4)])]
but, we can also see that:
[(1,3)] and [(1,3),(2,4)]
are related, because the first is a sublist of the second, so I would want this to also be a relation added into my Relations list:
Relations=[([(1,3)],[(1,3),(1,4)]),([(1,3)],[(1,3),(2,4)])]
The only thing I wish to be careful with, is that I am considering for a list to be a sublist of another if they only differ by ONE element. So in other words, we cannot have:
([(1,3)],[(1,3),(2,4),(1,2)])
as an element of my Relations list, but we SHOULD have:
([(1,3),(2,4)],[(1,3),(2,4),(1,2)])
as an element in my Relations list.
I hope there is an optimal way to do this, since in the original context I have to deal with a much bigger list of lists.
Any help given is much appreciated.
You really haven't provided enough information, so can't tell if you need itertools.combinations() or itertools.permutations(). Your examples work with itertools.combinations so will use that.
If x and y are two elements of the list then you just want all occurrences where the set(x).issubset(y) and the size of the set difference is <= 1 - len(set(y) - set(x)) <= 1, e.g.:
In []:
[[x, y] for x, y in it.combinations(L, r=2) if set(x).issubset(y) and len(set(y)-set(x)) <= 1]
Out[]:
[[[(1, 3)], [(1, 3), (2, 4)]],
[[(1, 3)], [(1, 3), (1, 4)]],
[[(1, 3)], [(1, 2), (1, 3)]],
[[(1, 3), (2, 4)], [(1, 3), (2, 4), (1, 2)]],
[[(1, 2)], [(1, 2), (1, 3)]],
[[(1, 2), (1, 3)], [(1, 3), (2, 4), (1, 2)]]]
there is a RDD object:
//have some data in RDD[(Int, Int)] object
(1, 2)
(3, 2)
(2, 3)
(5, 4)
(2, 7)
(5, 2)
(5, 7)
I want to get max key and remove it, the max key is 5, so the result I want is:
//a new RDD object,RDD[(Int, Int)]
(1, 2)
(3, 2)
(2, 3)
(2, 7)
Could you help me? Thank you!
You need to first get the results sorted and then use RDD.max() to get the highest value and finally perform filter to filter the keys which are other than the highest key.
or
You can also register this as DataFrame and execute simple SQL query to get the results.