Pyspark Use groupByKey/mapValues more than once in one line - apache-spark

I'm using PySpark and I'm looking for a way to use more than once groupByKey/mapValues method.
Given :
rdd = sc.parallelize([(u'04896f3765094732a478ba63dd42c785',
u'2016-01-01',
u'2',
u'1404.0',
u'2016-333',
u'2016-48',
u'2016-11',
'2016-11-28'),
(u'kldmm584753dljkdhggdklkfj32a478ba63dd422574',
u'2016-01-14',
u'6',
u'2000.0',
u'2016-333',
u'2016-48',
u'2016-11',
'2016-11-28')
])
I want to group my rdd by the 4TH element ('2016-333' here), and get len, sum, etc..
My codes :
(rdd
.map(lambda x : (x[4], x[0]))
.groupByKey()
.mapValues(len)
.collect())
Output : [(u'2016-333', 2)]
(rdd
.map(lambda x : (x[4], float(x[3])))
.groupByKey()
.mapValues(sum)
.collect())
Output : [(u'2016-333', 3404.0)]
(rdd
.map(lambda x : (x[4], int(x[2])))
.groupByKey()
.mapValues(sum)
.collect())
Output : [(u'2016-333', 8)]
My question : Is there a way to do this in one go ?
The expected output is :
[(u'2016-333', 2, 3404.0, 8)]
Thx !

You could use reduceByKey as in the wordcount example. Here, your values are a 3-part tuple, and your reducer would be a summation of the elements.
rdd.map(lambda x: (x[4], (1, float(x[3]), int(x[2])))).reduceByKey(lambda x,y: (x[0] + y[0], x[1] + y[1], x[2] + y[2])).collect()

The simplest possible:
rdd.map(lambda x: (x[4], float(x[3]), int(x[2]))).toDF(["key", "x3", "x2"]) \
.groupBy("key").agg({"*": "count", "x3": "sum", "x2": "sum"}).rdd
or
rdd.map(lambda x: (x[4], np.array([1, float(x[3]), int(x[2])]))) \
.reduceByKey(lambda x, y: x + y) \
.mapValues(lambda x: (int(x[0]) , int(x[1]), x[2]))

Related

Filter out non digit values in pyspark RDD

I have an RDD which looks like this:
[["3331/587","Metro","1235","1000"],
["1234/232","City","8479","2000"],
["5987/215","Metro","1111","Unkown"],
["8794/215","Metro","1112","1000"],
["1254/951","City","6598","XXXX"],
["1584/951","City","1548","Unkown"],
["1833/331","Metro","1009","2000"],
["2213/987","City","1197", ]]
I want to calculate the average and max of the last values of each row (1000, 2000 etc) for each distinct value in the second entries (City/Metro) separately. I am using the the following code to collect "City" values:
rdd.filter(lambda row: row[1] == 'City').map(lambda x: float(x[3])).collect()
However, I get error, probably because of the string values ("Unkown" e.g.) in the series.
How can I filter out rows with string and null values (=keep only those convertable to digits), and then calculate max and average?
Try this.
rdd = rdd.map(lambda l: [l[i].replace('"', '') for i in range(0, len(l))])
rdd = rdd.filter(lambda l: len(l) > 3) \
.filter(lambda l: l[1] in ['City', 'Metro']) \
.filter(lambda l: l[3].isdigit()) \
.map(lambda l: (l[1], int(l[3]))) \
rdd_avg = rdd.aggregateByKey((0, 0), lambda a, b: (a[0] + b, a[1] + 1), lambda a, b: a + b).mapValues(lambda x: x[0] / x[1])
rdd_max = rdd.reduceByKey(lambda a, b: a if a > b else b)
print(rdd_avg.collect())
print(rdd_max.collect())
[('Metro', 1333.3333333333333), ('City', 2000.0)]
[('Metro', 2000), ('City', 2000)]

Efficiently find top-N values from multiple columns independently in Pyspark

I am trying to get the top n frequency count for all the columns in a dataframe
counts = (df.rdd
.flatMap(lambda x: x.asDict().items())
.map(lambda x: (x, 1))
.reduceByKey(add))
and I get this:
((Column Name/Value), count)
(('name', 'Dominion Range 08357'), 1)
(('id', 52132), 1)
(('nametype', 'Valid'), 10)
(('recclass', 'L6'), 2)
(('mass (g)', 8.9), 1)
(('fall', 'Found'), 10)
(('year', '01/01/2008 12:00:00 AM'), 2)
(('reclat', 0.0), 1)
(('reclong', 0.0), 1)
(('GeoLocation', '(0.000000, 0.000000)'), 1)
(('name', 'Yamato 792863'), 1)
(('id', 28212), 1)
(('recclass', 'H5'), 3)
(('mass (g)', 132.25), 1)
(('year', '01/01/1979 12:00:00 AM'), 1)
(('reclat', -71.5), 1)
(('reclong', 35.66667), 1)
(('GeoLocation', '(-71.500000, 35.666670)'), 1)
After that I am trying to get the top 10 values:
(counts
.groupBy(lambda x: x[0])
.flatMap(lambda g: nlargest(10, g[1], key=lambda x: x[1])))
But I am getting the same results.
Any help?
I just figure it out. I was missing and extra index in groupBy
counts = (df.rdd
.flatMap(lambda x: x.asDict().items())
.map(lambda x: (x, 1))
.reduceByKey(add)
.groupBy(lambda x: x[0][0])
.flatMap(lambda g: nlargest(3, g[1], key=lambda x: x[1])))

How to associate unique id with text in word counting with spark

I have an RDD that is populated as
id txt
1 A B C
2 A B C
1 A B C
The result of my word count (pyspark) should be for a combination of string and id associated with it. Example:
[(u'1_A',2), (u'1_B',2), (u'1_C',2),(u'2_A',1),(u'2_B',1),(u'2_C',1)]
I tried using a user defined function to combine id with string splits from text. It, however, complains that the append function is unavailable in this context.
Appreciate any code samples that will set me in the right direction.
Here is an alternative solution using PySpark Dataframe. Mainly, the code uses explode and split to split the txt column. Then, use groupby and count to count number of pairs.
import pyspark.sql.functions as func
rdd = spark.sparkContext.parallelize([(1,'A B C'), (2, 'A B C'), (1,'A B C')])
df = rdd.toDF(['id', 'txt'])
df_agg = df.select('id', func.explode(func.split('txt', ' '))).\
groupby(['id', 'col']).\
count().\
sort(['id', 'col'], ascending=True)
df_agg.rdd.map(lambda x:(str(x['id']) + '_' + x['col'], x['count'] )).collect()
Output
[('1_A', 2), ('1_B', 2), ('1_C', 2), ('2_A', 1), ('2_B', 1), ('2_C', 1)]
the below snippet should work
rdd = sc.parallelize([(1,'A B C'), (2, 'A B C'), (1,'A B C')])
result = rdd \
.map(lambda x: (x[0],x[1].split(' '))) \
.flatMap(lambda x: [ '%s_%s'% (x[0],y) for y in x[1] ]) \
.map(lambda x: (x,1)) \
.reduceByKey(lambda x,y: x + y)
result.collect()
Output
[('1_C', 2), ('1_B', 2), ('1_A', 2), ('2_A', 1), ('2_B', 1), ('2_C', 1)]

PySpark flip key/value

I am trying to flip key value from a dataset to do sorting. However, the map function return an invalid syntax error
rdd = clean_headers_rdd.rdd\
.filter(lambda x: x['date'].year == 2016)\
.map(lambda x: (x['user_id'], 1)).reduceByKey(lambda x, y: x + y)\
.map(lambda (x, y): (y, x)).sortByKey(ascending = False)
PEP 3113 -- Removal of Tuple Parameter Unpacking.
Method recommended by the transition plan:
rdd.map(lambda x_y: (x_y[1], x_y[0])
Shortcut with the operator module:
from operator import itemgetter
rdd.map(itemgetter(1, 0))
Slicing:
rdd.map(lambda x: x[::-1])

Reduce ResultIterable objects after groupByKey in PySpark

I'm working on temperature forecasting data using PySpark.
The raw temperature data in the following format:
station;date;time,temperature;quality
102170;2012-11-01;06:00:00;6.8;G
102185;2012-11-02;06:00:00;5.8;G
102170;2013-11-01;18:00:00;2.8;G
102185;2013-11-01;18:00:00;7.8;G
The target result is getting the min/max temperature for each year, mentioned in which station, like the following:
year;station;max_temp
2013;102185;7.8
2012;102170;6.8
My current code as the following:
sc = SparkContext(appName="maxMin")
lines = sc.textFile('data/temperature-readings.csv')
lines = lines.map(lambda a: a.split(";"))
lines = lines.filter(lambda x: int(x[1][0:4]) >= 1950 and int(x[1][0:4]) <= 2014)
temperatures = lines.map(lambda x: (x[1][0:4], (x[0], float(x[3]))))
so far, the result as following:
temperatures.take(4)
(2012, (102170,6.8))
(2012, (102185,5.8))
(2013, (102170,2.8))
(2013, (102185,7.8))
After grouping by key, the becomes as the following:
temperatures = temperatures.groupByKey()
temperatures.take(2)
[(u'2012', <pyspark.resultiterable.ResultIterable object at 0x2a0be50>),
(u'2013', <pyspark.resultiterable.ResultIterable object at 0x2a0bc50>)]
So, how I can reduce these resultiterable objects to get only the element with min or max temperature.
Just don't. Use reduce by key:
lines.map(lambda x: (x[1][0:4], (x[0], float(x[3])))).map(lambda x: (x, x)) \
.reduceByKey(lambda x, y: (
min(x[0], y[0], key=lambda x: x[1]),
max(x[1], y[1], , key=lambda x: x[1])))

Resources