Efficiently find top-N values from multiple columns independently in Pyspark - apache-spark

I am trying to get the top n frequency count for all the columns in a dataframe
counts = (df.rdd
.flatMap(lambda x: x.asDict().items())
.map(lambda x: (x, 1))
.reduceByKey(add))
and I get this:
((Column Name/Value), count)
(('name', 'Dominion Range 08357'), 1)
(('id', 52132), 1)
(('nametype', 'Valid'), 10)
(('recclass', 'L6'), 2)
(('mass (g)', 8.9), 1)
(('fall', 'Found'), 10)
(('year', '01/01/2008 12:00:00 AM'), 2)
(('reclat', 0.0), 1)
(('reclong', 0.0), 1)
(('GeoLocation', '(0.000000, 0.000000)'), 1)
(('name', 'Yamato 792863'), 1)
(('id', 28212), 1)
(('recclass', 'H5'), 3)
(('mass (g)', 132.25), 1)
(('year', '01/01/1979 12:00:00 AM'), 1)
(('reclat', -71.5), 1)
(('reclong', 35.66667), 1)
(('GeoLocation', '(-71.500000, 35.666670)'), 1)
After that I am trying to get the top 10 values:
(counts
.groupBy(lambda x: x[0])
.flatMap(lambda g: nlargest(10, g[1], key=lambda x: x[1])))
But I am getting the same results.
Any help?

I just figure it out. I was missing and extra index in groupBy
counts = (df.rdd
.flatMap(lambda x: x.asDict().items())
.map(lambda x: (x, 1))
.reduceByKey(add)
.groupBy(lambda x: x[0][0])
.flatMap(lambda g: nlargest(3, g[1], key=lambda x: x[1])))

Related

PySpark sort values

I have a data:
[(u'ab', u'cd'),
(u'ef', u'gh'),
(u'cd', u'ab'),
(u'ab', u'gh'),
(u'ab', u'cd')]
I would like to do a mapreduce on this data and to find out how often same pairs appear.
As a result I get:
[((u'ab', u'cd'), 2),
((u'cd', u'ab'), 1),
((u'ab', u'gh'), 1),
((u'ef', u'gh'), 1)]
As you can see it is not quire right as (u'ab', u'cd') has to be 3 instead of 2 because (u'cd', u'ab') is the same pair.
My question is how can I make the program to count (u'cd', u'ab') and (u'ab', u'cd') as the same pair? I was thinking about sorting values for each row but could not find any solution for this.
You can sort the values then use reduceByKey to count the pairs:
rdd1 = rdd.map(lambda x: (tuple(sorted(x)), 1))\
.reduceByKey(lambda a, b: a + b)
rdd1.collect()
# [(('ab', 'gh'), 1), (('ef', 'gh'), 1), (('ab', 'cd'), 3)]
You can key by the sorted element, and count by key:
result = rdd.keyBy(lambda x: tuple(sorted(x))).countByKey()
print(result)
# defaultdict(<class 'int'>, {('ab', 'cd'): 3, ('ef', 'gh'): 1, ('ab', 'gh'): 1})
To convert the result into a list, you can do:
result2 = sorted(result.items())
print(result2)
# [(('ab', 'cd'), 3), (('ab', 'gh'), 1), (('ef', 'gh'), 1)]

Filter out non digit values in pyspark RDD

I have an RDD which looks like this:
[["3331/587","Metro","1235","1000"],
["1234/232","City","8479","2000"],
["5987/215","Metro","1111","Unkown"],
["8794/215","Metro","1112","1000"],
["1254/951","City","6598","XXXX"],
["1584/951","City","1548","Unkown"],
["1833/331","Metro","1009","2000"],
["2213/987","City","1197", ]]
I want to calculate the average and max of the last values of each row (1000, 2000 etc) for each distinct value in the second entries (City/Metro) separately. I am using the the following code to collect "City" values:
rdd.filter(lambda row: row[1] == 'City').map(lambda x: float(x[3])).collect()
However, I get error, probably because of the string values ("Unkown" e.g.) in the series.
How can I filter out rows with string and null values (=keep only those convertable to digits), and then calculate max and average?
Try this.
rdd = rdd.map(lambda l: [l[i].replace('"', '') for i in range(0, len(l))])
rdd = rdd.filter(lambda l: len(l) > 3) \
.filter(lambda l: l[1] in ['City', 'Metro']) \
.filter(lambda l: l[3].isdigit()) \
.map(lambda l: (l[1], int(l[3]))) \
rdd_avg = rdd.aggregateByKey((0, 0), lambda a, b: (a[0] + b, a[1] + 1), lambda a, b: a + b).mapValues(lambda x: x[0] / x[1])
rdd_max = rdd.reduceByKey(lambda a, b: a if a > b else b)
print(rdd_avg.collect())
print(rdd_max.collect())
[('Metro', 1333.3333333333333), ('City', 2000.0)]
[('Metro', 2000), ('City', 2000)]

How to associate unique id with text in word counting with spark

I have an RDD that is populated as
id txt
1 A B C
2 A B C
1 A B C
The result of my word count (pyspark) should be for a combination of string and id associated with it. Example:
[(u'1_A',2), (u'1_B',2), (u'1_C',2),(u'2_A',1),(u'2_B',1),(u'2_C',1)]
I tried using a user defined function to combine id with string splits from text. It, however, complains that the append function is unavailable in this context.
Appreciate any code samples that will set me in the right direction.
Here is an alternative solution using PySpark Dataframe. Mainly, the code uses explode and split to split the txt column. Then, use groupby and count to count number of pairs.
import pyspark.sql.functions as func
rdd = spark.sparkContext.parallelize([(1,'A B C'), (2, 'A B C'), (1,'A B C')])
df = rdd.toDF(['id', 'txt'])
df_agg = df.select('id', func.explode(func.split('txt', ' '))).\
groupby(['id', 'col']).\
count().\
sort(['id', 'col'], ascending=True)
df_agg.rdd.map(lambda x:(str(x['id']) + '_' + x['col'], x['count'] )).collect()
Output
[('1_A', 2), ('1_B', 2), ('1_C', 2), ('2_A', 1), ('2_B', 1), ('2_C', 1)]
the below snippet should work
rdd = sc.parallelize([(1,'A B C'), (2, 'A B C'), (1,'A B C')])
result = rdd \
.map(lambda x: (x[0],x[1].split(' '))) \
.flatMap(lambda x: [ '%s_%s'% (x[0],y) for y in x[1] ]) \
.map(lambda x: (x,1)) \
.reduceByKey(lambda x,y: x + y)
result.collect()
Output
[('1_C', 2), ('1_B', 2), ('1_A', 2), ('2_A', 1), ('2_B', 1), ('2_C', 1)]

Pyspark Use groupByKey/mapValues more than once in one line

I'm using PySpark and I'm looking for a way to use more than once groupByKey/mapValues method.
Given :
rdd = sc.parallelize([(u'04896f3765094732a478ba63dd42c785',
u'2016-01-01',
u'2',
u'1404.0',
u'2016-333',
u'2016-48',
u'2016-11',
'2016-11-28'),
(u'kldmm584753dljkdhggdklkfj32a478ba63dd422574',
u'2016-01-14',
u'6',
u'2000.0',
u'2016-333',
u'2016-48',
u'2016-11',
'2016-11-28')
])
I want to group my rdd by the 4TH element ('2016-333' here), and get len, sum, etc..
My codes :
(rdd
.map(lambda x : (x[4], x[0]))
.groupByKey()
.mapValues(len)
.collect())
Output : [(u'2016-333', 2)]
(rdd
.map(lambda x : (x[4], float(x[3])))
.groupByKey()
.mapValues(sum)
.collect())
Output : [(u'2016-333', 3404.0)]
(rdd
.map(lambda x : (x[4], int(x[2])))
.groupByKey()
.mapValues(sum)
.collect())
Output : [(u'2016-333', 8)]
My question : Is there a way to do this in one go ?
The expected output is :
[(u'2016-333', 2, 3404.0, 8)]
Thx !
You could use reduceByKey as in the wordcount example. Here, your values are a 3-part tuple, and your reducer would be a summation of the elements.
rdd.map(lambda x: (x[4], (1, float(x[3]), int(x[2])))).reduceByKey(lambda x,y: (x[0] + y[0], x[1] + y[1], x[2] + y[2])).collect()
The simplest possible:
rdd.map(lambda x: (x[4], float(x[3]), int(x[2]))).toDF(["key", "x3", "x2"]) \
.groupBy("key").agg({"*": "count", "x3": "sum", "x2": "sum"}).rdd
or
rdd.map(lambda x: (x[4], np.array([1, float(x[3]), int(x[2])]))) \
.reduceByKey(lambda x, y: x + y) \
.mapValues(lambda x: (int(x[0]) , int(x[1]), x[2]))

Reduce ResultIterable objects after groupByKey in PySpark

I'm working on temperature forecasting data using PySpark.
The raw temperature data in the following format:
station;date;time,temperature;quality
102170;2012-11-01;06:00:00;6.8;G
102185;2012-11-02;06:00:00;5.8;G
102170;2013-11-01;18:00:00;2.8;G
102185;2013-11-01;18:00:00;7.8;G
The target result is getting the min/max temperature for each year, mentioned in which station, like the following:
year;station;max_temp
2013;102185;7.8
2012;102170;6.8
My current code as the following:
sc = SparkContext(appName="maxMin")
lines = sc.textFile('data/temperature-readings.csv')
lines = lines.map(lambda a: a.split(";"))
lines = lines.filter(lambda x: int(x[1][0:4]) >= 1950 and int(x[1][0:4]) <= 2014)
temperatures = lines.map(lambda x: (x[1][0:4], (x[0], float(x[3]))))
so far, the result as following:
temperatures.take(4)
(2012, (102170,6.8))
(2012, (102185,5.8))
(2013, (102170,2.8))
(2013, (102185,7.8))
After grouping by key, the becomes as the following:
temperatures = temperatures.groupByKey()
temperatures.take(2)
[(u'2012', <pyspark.resultiterable.ResultIterable object at 0x2a0be50>),
(u'2013', <pyspark.resultiterable.ResultIterable object at 0x2a0bc50>)]
So, how I can reduce these resultiterable objects to get only the element with min or max temperature.
Just don't. Use reduce by key:
lines.map(lambda x: (x[1][0:4], (x[0], float(x[3])))).map(lambda x: (x, x)) \
.reduceByKey(lambda x, y: (
min(x[0], y[0], key=lambda x: x[1]),
max(x[1], y[1], , key=lambda x: x[1])))

Resources