How to make calculations between RDD rows? - apache-spark

I have a Spark RDD like this:
[(1, '02-01-1950', 2.8), (2, '03-01-1950', 3.1), (3, '04-01-1950', 3.2)]
And I want to calculate the increase (by percentage) between sequential rows. For example, from row 1 to row 2 the increase of value is 110.7% ((3.1/2.8)*100), and so on.
Any suggestions on how to make calculations between rows?

You can join the RDD with the same RDD that have the keys shifted by 1:
rdd = sc.parallelize([(1, '02-01-1950', 2.8), (2, '03-01-1950', 3.1), (3, '04-01-1950', 3.2)])
rdd2 = rdd.map(lambda x: (x[0], x[2]))
rdd3 = rdd.map(lambda x: (x[0]+1, x[2]))
rdd4 = rdd2.join(rdd3).mapValues(lambda r: r[0]/r[1]*100)
rdd4.collect()
# [(2, 110.71428571428572), (3, 103.2258064516129)]

Related

PySpark Combining Strings into a tuple-pair based on second value as key

I am new to pyspark and am still trying to understand how map and reduce work.
I have a dataset read as an RDD, attached below after learn.txt.
Based on the second value (numeric one), I want to see which 2 letters have same value and how many times.
My current codes output:
[(('b', 'c'), 1),
('c', 1),
('d', 1),
('a', 1),
(('a', 'b'), 2),
((('a', 'b'), 'c'), 1)]
What I want as the output:
[(('b','a'),3),
(('a','b'),3),
(('b','c'),2),
(('c','b'),2),
(('a','c'),1),
(('c','a'),1)]
That is pairs only, of all permutations if they have a single match.
I don't believe my code will be too helpful but this is what I have got:
from pyspark import RDD, SparkContext
from pyspark.sql import DataFrame, SparkSession
sc = SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()
df = sc.textFile("learn.txt")
mapped = df.map(lambda x: [a for a in x.split(',')])
remapped = mapped.map(lambda x: (x[1], x[0]))
reduced = remapped.reduceByKey(lambda x,y: (x,y))
threemapped = reduced.map(lambda x: (x[1], 1))
output = threemapped.reduceByKey(lambda x, y: x+y)
output.collect()
Where learn.txt:
a,1
a,2
a,3
a,4
b,2
b,3
b,4
b,6
c,2
c,5
c,6
d,7
With .reduceByKey(lambda x,y: (x,y)), you create intertwined tuples of tuples of tuples... You are not going to be able to do anything with that.
Since you are looking for couples of values that share a key, we might use a join like this:
# same code as you
vals = df\
.map(lambda x: [a for a in x.split(',')])\
.map(lambda x: (x[1], x[0]))
# but then you can join vals with itself and use reduceByKey to count occurrences
result = vals.join(vals)\
.filter(lambda x: x[1][0] != x[1][1])\
.map(lambda x: ((x[1][1], x[1][0]), 1))\
.reduceByKey(lambda a, b: a+b)\
.collect()
which yields:
[(('b', 'a'), 3), (('c', 'a'), 1), (('c', 'b'), 2),
(('b', 'c'), 2), (('a', 'b'), 3), (('a', 'c'), 1)]

Pyspark- Convert an RDD into a key value pair RDD, with the keys in ascending order

I have an RDD1 in this form: ['once','upon','a','time',...,'the','end']. I want to convert in into a key/value pair such that the strings are values and keys are in ascending order. The expected RDD2 should be as follows: [(1,'once'),(2,'upon'),(3,'a'),(4,'time'),...,(RDD1.count()-1,'the'),(RDD1.count(),'end']
Any hints?
Thanks
Use pyspark's own zip function. This might help:
rdd1 = sc.parallelize(['once','upon','a','time','the','end'])
nums = sc.parallelize(range(rdd1.count())).map(lambda x: x+1)
zippedRdds = nums.zip(rdd1)
rdd2 = zippedRdds.sortByKey()
rdd2.collect()
will give:
[(1, 'once'), (2, 'upon'), (3, 'a'), (4, 'time'), (5, 'the'), (6,
'end')]

How to use combineByKey in pyspark [duplicate]

This question already has answers here:
Who can give a clear explanation for `combineByKey` in Spark?
(3 answers)
Closed 3 years ago.
I got a question from HW:
we have a sample data like this---
data = [ ("B", 2), ("A", 1), ("A", 4), ("B", 2), ("B", 3) ]
the combineByKey code is like this---
>>> rdd = sc.parallelize( data )
>>> rdd2 = rdd.combineByKey
>>> rdd2 = rdd.combineByKey(lambda value: (value, value+2, 1),
... lambda x, value: (x[0] + value, x[1] + value*value, x[2] + 1),
... lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))
I got a result like this:
>>> myoutput = rdd2.collect()
>>> myoutput
[('B', (7, 17, 3)), ('A', (5, 9, 2))]
since we suppose to manually write out the answer instead of just run the code to get the result.
after the first lambda, is it correct I got this result: (b, (2,4,1)), (a,(1,3,1)), (a,(4,6,1)),(b,(2,4,1)),(b,(3,5,1)? But I don't quite understand "x[1] + value*value" part for the second lambda? How to get the middle value of 17 and 9 for b and a?
Can anyone help to explain to me? Thank you!
As explained in the link by cricket_007,
When using combineByKey values are merged into one value at each partition then each partition value is merged into a single value.
Lets first look at the number of partitions and what each partition contains after we parallelize the data.
>>> data = [ ("B", 2), ("A", 1), ("A", 4), ("B", 2), ("B", 3) ]
>>> rdd = sc.parallelize( data )
>>> rdd.collect()
[('B', 2), ('A', 1), ('A', 4), ('B', 2), ('B', 3)]
Number of partitions (by default):
>>> num_partitions = rdd.getNumPartitions()
>>> print(num_partitions)
4
Contents of each partition:
>>> partitions = rdd.glom().collect()
>>> for num,partition in enumerate(partitions):
... print(f'Partitions {num} -> {partition}')
Partitions 0 -> [('B', 2)]
Partitions 1 -> [('A', 1)]
Partitions 2 -> [('A', 4)]
Partitions 3 -> [('B', 2), ('B', 3)]
combineByKey is defined as
combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)
The three functions that combineByKey takes as arguments,
createCombiner :(lambda value: (value, value+2, 1)
This will be called on every unseen key in a partition.
mergeValue : lambda x, value: (x[0] + value, x[1] + value*value, x[2] + 1) This will be called when the key is already seen before in a particular partition.
mergeCombiners : lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2])
This will be called to merge the keys of different partitions
partitioner : Beyond the scope of this answer.
Now let's work out what happens:
Parition 0: [('B', 2)]
createCombiner
('B', 2) -> Unseen Key -> ('B', (2, 2+2, 1))
-> ('B', (2,4,1)
# Same createCombiner for partition 1,2,3
Partition 1: [('A',1)]
createCombiner
('A',1) -> Unseen Key -> ('A', (1,3,1))
Partition 2: [('A',4)]
createCombiner
('A',4) -> Unseen Key -> ('A', (4,6,1))
Partition 3: [('B',2), ('B',3)]
createCombiner
('B',2) -> Unseen Key -> ('B',(2,4,1))
('B',3) -> Seen Key -> mergeValue ('B',(2,4,1)) with ('B',3)
-> ('B', (2 + 3, 4+(3*3), 1+1)
-> ('B', (5,13,2))
Partition 0 and Partition 3:
mergeCombiners ('B', (2,4,1)) and ('B', (5,13,2))
-> ('B', (2+5,4+13,1+2))
-> ('B', (7,19,3)
Partition 1 and 2:
mergeCombiners ('A', (1,3,1)) and ('A', (4,6,1))
-> ('A', (1+4, 3+6, 1+1))
-> ('A', (5,9,2))
So the final answer that we get is:
>>> rdd2 = rdd.combineByKey(lambda value: (value, value+2, 1),
... lambda x, value: (x[0] + value, x[1] + value*value, x[2] + 1),
... lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))
>>> rdd2.collect()
[('B', (7, 17, 3)), ('A', (5, 9, 2))]
I hope this explains whats going on.
Additional Clarification as asked in comments:
How does spark set the number of partitions?
From the docs: Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)
How does spark partition the data?
A partition (aka split) is a logical chunk of a large distributed data set.
Spark has three different partitioning schemes, namely
hashPartitioner : The Default. Send keys with the same hash module end up on the same node.
customPartitioner :Example below.
rangePartitioner : Elements with keys in the same range appear on the same node.
I quote from Learning Spark by Karau et al. Pg.61, that spark does not give you explicit control on which key goes to which partition, but it ensures a set of keys will appear together on some node. If you want keys with the same value to appear together in the same partition you can use a custom partitioner like so.
>>> def customPartitioner(key):
... if key == 'A':
... return 0
... if key == 'B':
... return 1
>>> num_partitions = 2
>>> rdd = sc.parallelize( data ).partitionBy(num_partitions,customPartitioner)
>>> partitions = rdd.glom().collect()
>>> for num,partition in enumerate(partitions):
... print(f'Partition {num} -> {partition}')
Partition 0 -> [('A', 1), ('A', 4)]
Partition 1 -> [('B', 2), ('B', 2), ('B', 3)]
I encourage you to read the book to learn more.

Reducing values in lists of (key, val) RDD's, given these lists are values in another list of (key, val) RDD's

I've being rolling my head for a while over this - would really appreciate any suggestions!
Sorry for long title, I hope a short example I'll construct below will explain this much better.
Let's say we have an RDD of the below form:
data = sc.parallelize([(1,[('k1',4),('k2',3),('k1',2)]),\
(2,[('k3',1),('k3',8),('k1',6)])])
data.collect()
Output:
[(1, [('k1', 4), ('k2', 3), ('k1', 2)]),
(2, [('k3', 1), ('k3', 8), ('k1', 6)])]
I am looking to do the following with the deepest list of (key,val) RDD's
.reduceByKey(lambda a, b: a + b)
(i.e. reduce the values of these RDD's by key to get the sum by key while retaining the result mapped with keys of the initial higher level RDD, which would produce the following output):
[(1, [('k1', 6), ('k2', 3)]),
(2, [('k3', 9), ('k1', 6)])]
I'm relatively new with PySpark and probably missing something basic here, but I've tried a lot of different approaches on this, but essentially cannot find a way to access and reduceByKey the (key,val) RDD's in a list, which is itself a value of another RDD.
Many thanks in advance!
Denys
What you are trying to do is : your value (in input K,V) is an iterable on which you want to sum on inner key and return result as =>
(outer_key(e.g 1,2) -> List(Inner_Key(E.g."K1","K2"),Summed_value))
As you see the sum is calculated on inner Key-V,
we can achieve this by
First peeling out elements from each list item
=> making a new key as (outer key ,inner key)
=> making a sum on (outer_key,inner_key) -> value
=> Changing data format back to (outer_key ->(inner_key, summed_value))
=> finally Grouping again on Outer Key
I am not sure about Python one but believe just replacing Scala collection syntax with python's would suffice and here is the solution
SCALA VERSION
scala> val keySeq = Seq((1,List(("K1",4),("K2",3),("K1",2))),
| (2,List(("K3",1),("K3",8),("K1",6))))
keySeq: Seq[(Int, List[(String, Int)])] = List((1,List((K1,4), (K2,3), (K1,2))), (2,List((K3,1), (K3,8), (K1,6))))
scala> val inRdd = sc.parallelize(keySeq)
inRdd: org.apache.spark.rdd.RDD[(Int, List[(String, Int)])] = ParallelCollectionRDD[111] at parallelize at <console>:26
scala> inRdd.take(10)
res64: Array[(Int, List[(String, Int)])] = Array((1,List((K1,4), (K2,3), (K1,2))), (2,List((K3,1), (K3,8), (K1,6))))
// And solution :
scala> inRdd.flatMap { case (i,l) => l.map(l => ((i,l._1),l._2)) }.reduceByKey(_+_).map(x => (x._1._1 ->(x._1._2,x._2))).groupByKey.map(x => (x._1,x._2.toList.sortBy(x =>x))).collect()
// RESULT ::
res65: Array[(Int, List[(String, Int)])] = Array((1,List((K1,6), (K2,3))), (2,List((K1,6), (K3,9))))
UPDATE => Python Solution
>>> data = sc.parallelize([(1,[('k1',4),('k2',3),('k1',2)]),\
... (2,[('k3',1),('k3',8),('k1',6)])])
>>> data.collect()
[(1, [('k1', 4), ('k2', 3), ('k1', 2)]), (2, [('k3', 1), ('k3', 8), ('k1', 6)])]
# Similar operation
>>> data.flatMap(lambda x : [ ((x[0],y[0]),y[1]) for y in x[1]]).reduceByKey(lambda a,b : (a+b)).map(lambda x : [x[0][0],(x[0][1],x[1])]).groupByKey().mapValues(list).collect()
# RESULT
[(1, [('k1', 6), ('k2', 3)]), (2, [('k3', 9), ('k1', 6)])]
you should .map your dataset instead of reducing because the count of rows in your example are same as in source dataset, inside map you could reduce values as python list
use mapValues() + itertools.groupby():
from itertools import groupby
data.mapValues(lambda x: [ (k, sum(f[1] for f in g)) for (k,g) in groupby(sorted(x), key=lambda d: d[0]) ]) \
.collect()
#[(1, [('k1', 6), ('k2', 3)]), (2, [('k1', 6), ('k3', 9)])]
with itertools.groupby, we use the first item of the tuple as grouped-key k and sum the 2nd item from the tuple in each g.
Edit: for a large data set, sorting with itertools.groupby is expensive, just write up a function w/o sorting to handle the same:
def merge_tuples(x):
d = {}
for (k,v) in x:
d[k] = d.get(k,0) + v
return d.items()
data.mapValues(merge_tuples).collect()
#[(1, [('k2', 3), ('k1', 6)]), (2, [('k3', 9), ('k1', 6)])]

Spark filter one RDD by keys in another RDD

Here is two RDD:
rdd1 = sc.parallelize(("a","b"))
rdd2 = sc.parallelize((("a", 3), ("b", 5), ("c",4)))
I want to filter rdd2 with keys in data1. The result should be
[('a', 3), ('b', 5)]
If the sizes of RDDs are small, I can collect and broadcast rdd1, and then use a filter transformation to get the result.
However, the size of rdd1 is large. So the method I'm now using is the join function:
rdd1.map(lambda x : (x,1)).join(rdd2).mapValues(lambda x : x[1])
Since the cost of join increases with the size of data, is there a better to achieve the result?

Resources