Reduce ResultIterable objects after groupByKey in PySpark - apache-spark

I'm working on temperature forecasting data using PySpark.
The raw temperature data in the following format:
station;date;time,temperature;quality
102170;2012-11-01;06:00:00;6.8;G
102185;2012-11-02;06:00:00;5.8;G
102170;2013-11-01;18:00:00;2.8;G
102185;2013-11-01;18:00:00;7.8;G
The target result is getting the min/max temperature for each year, mentioned in which station, like the following:
year;station;max_temp
2013;102185;7.8
2012;102170;6.8
My current code as the following:
sc = SparkContext(appName="maxMin")
lines = sc.textFile('data/temperature-readings.csv')
lines = lines.map(lambda a: a.split(";"))
lines = lines.filter(lambda x: int(x[1][0:4]) >= 1950 and int(x[1][0:4]) <= 2014)
temperatures = lines.map(lambda x: (x[1][0:4], (x[0], float(x[3]))))
so far, the result as following:
temperatures.take(4)
(2012, (102170,6.8))
(2012, (102185,5.8))
(2013, (102170,2.8))
(2013, (102185,7.8))
After grouping by key, the becomes as the following:
temperatures = temperatures.groupByKey()
temperatures.take(2)
[(u'2012', <pyspark.resultiterable.ResultIterable object at 0x2a0be50>),
(u'2013', <pyspark.resultiterable.ResultIterable object at 0x2a0bc50>)]
So, how I can reduce these resultiterable objects to get only the element with min or max temperature.

Just don't. Use reduce by key:
lines.map(lambda x: (x[1][0:4], (x[0], float(x[3])))).map(lambda x: (x, x)) \
.reduceByKey(lambda x, y: (
min(x[0], y[0], key=lambda x: x[1]),
max(x[1], y[1], , key=lambda x: x[1])))

Related

Understanding frequency sort with lambda

I have this code which does a frequency sort with key=lambda x: (count[x])
nums = [1,1,2,2,2,3]
count = collections.Counter(nums)
output = sorted(nums, key=lambda x: (count[x]))
this gives the output
[3,1,1,2,2,2]
I would like to know why [3,1,2] isn't the output? How are the keys being repeated from the counter?
Because you are still sorting nums which contains all the elements (with duplication/repetition). key=lambda x: count[x] only decides the order in which the elements are ordered in.
An equivalent but less efficient (O(n2) instead of O(n)) code is
sorted(nums, key=lambda x: nums.count(x))

Alternate or better approach to aggregateByKey in pyspark RDD

I have a weather data csv file in which each entry has station ID and the minimum or max value recorded for that day. The second element is key word to know what the value represents. Sample input is as below.
stationID feature value
ITE00100554 TMAX -75
ITE00100554 TMIN -148
GM000010962 PRCP 0
EZE00100082 TMAX -86
EZE00100082 TMIN -135
ITE00100554 TMAX -60
ITE00100554 TMIN -125
GM000010962 PRCP 0
EZE00100082 TMAX -44
EZE00100082 TMIN -130
ITE00100554 TMAX -23
I have filtered out entries with TMIN or TMAX. Each entry is recorded for a given data. I have stripped Date while building my RDD as it's not of interest. My goal is to find the Min and Max value of each station amongst all of its records i.e.,
ITE00100554, 'TMIN', <global_min_value recorded by that station>
ITE00100554, 'TMAX', <global_max_value>
EZE00100082, 'TMIN', <global_min_value>
EZE00100082, 'TMAX', <global_max_value>
I was able to accomplish this using aggregateByKey, but according to this link https://backtobazics.com/big-data/spark/apache-spark-aggregatebykey-example/ I dont have to use aggregateByKey since the input and output values format is the same. So I would like to know if there are an alternate or better ways to code this without defining so many functions.
stationtemps = entries.filter(lambda x: x[1] in ['TMIN', 'TMAX']).map(lambda x: (x[0], (x[1], x[2]))) # (stationID, (tempkey, value))
max_temp = stationtemps.values().values().max()
min_temp = stationtemps.values().values().min()
def max_seqOp(accumulator, element):
return (accumulator if accumulator[1] > element[1] else element)
def max_combOp(accu1, accu2):
return (accu1 if accu1[1] > accu2[1] else accu2)
def min_seqOp(accumulator, element):
return (accumulator if accumulator[1] < element[1] else element)
def min_combOp(accu1, accu2):
return (accu1 if accu1[1] < accu2[1] else accu2)
station_max_temps = stationtemps.aggregateByKey(('', min_temp), max_seqOp, max_combOp).sortByKey()
station_min_temps = stationtemps.aggregateByKey(('', max_temp), min_seqOp, min_combOp).sortByKey()
min_max_temps = station_max_temps.zip(station_min_temps).collect()
with open('1800_min_max.csv', 'w') as fd:
writer = csv.writer(fd)
writer.writerows(map(lambda x: list(list(x)), min_max_temps))
I am learning pyspark and havent mastered all different transforming functions.
Here simulated input and if the min and max is filled in correctly, then why the need for the indicator TMIN, TMAX? Indeed no need for an accumulator.
rdd = sc.parallelize([ ('s1','tmin',-3), ('s1','tmax', 5), ('s2','tmin',0), ('s2','tmax', 7), ('s0','tmax',14), ('s0','tmin', 3) ])
rddcollect = rdd.collect()
#print(rddcollect)
rdd2 = rdd.map(lambda x: (x[0], x[2]))
#rdd2collect = rdd2.collect()
#print(rdd2collect)
rdd3 = rdd2.groupByKey().sortByKey()
rdd4 = rdd3.map(lambda k_v: ( k_v[0], (sorted(k_v[1]))) )
rdd4.collect()
returns:
Out[27]: [('s0', [3, 14]), ('s1', [-3, 5]), ('s2', [0, 7])]
ALTERNATE ANSWER
after clarification
assuming that min and max values make sense
with my own data
there are other solutions BTW
Here goes:
include = ['tmin','tmax']
rdd0 = sc.parallelize([ ('s1','tmin',-3), ('s1','tmax', 5), ('s2','tmin',0), ('s2','tmin',-12), ('s2','tmax', 7), ('s2','tmax', 17), ('s2','tother', 17), ('s0','tmax',14), ('s0','tmin', 3) ])
rdd1 = rdd0.filter(lambda x: any(e in x for e in include) )
rdd2 = rdd1.map(lambda x: ( (x[0],x[1]), x[2]))
rdd3 = rdd2.groupByKey().sortByKey()
rdd4Min = rdd3.filter(lambda k_v: k_v[0][1] == 'tmin').map(lambda k_v: ( k_v[0][0], min( k_v[1] ) ))
rdd4Max = rdd3.filter(lambda k_v: k_v[0][1] == 'tmax').map(lambda k_v: ( k_v[0][0], max( k_v[1] ) ))
rdd5=rdd4Min.union(rdd4Max)
rdd6 = rdd5.groupByKey().sortByKey()
res = rdd6.map(lambda k_v: ( k_v[0], (sorted(k_v[1]))))
rescollect = res.collect()
print(rescollect)
returns:
[('s0', [3, 14]), ('s1', [-3, 5]), ('s2', [-12, 17])]
Following the same logic as #thebluephantom, this was my final code while reading from csv
def get_temp_points(item):
if item[0][1] == 'TMIN':
return (item[0], min(item[1]))
else:
return (item[0], max(item[1]))
data = lines.filter(lambda x: any(ele for ele in x if ele in ['TMIN', 'TMAX']))
temps = data.map(lambda x: ((x[0], x[2]), float(x[3]))
temp_list = temps.groupByKey().mapValues(list)
##((stationID, 'TMIN'/'TMAX'), listofvalues)
min_max_temps = temp_list.map(get_temp_points).collect()

Filter out non digit values in pyspark RDD

I have an RDD which looks like this:
[["3331/587","Metro","1235","1000"],
["1234/232","City","8479","2000"],
["5987/215","Metro","1111","Unkown"],
["8794/215","Metro","1112","1000"],
["1254/951","City","6598","XXXX"],
["1584/951","City","1548","Unkown"],
["1833/331","Metro","1009","2000"],
["2213/987","City","1197", ]]
I want to calculate the average and max of the last values of each row (1000, 2000 etc) for each distinct value in the second entries (City/Metro) separately. I am using the the following code to collect "City" values:
rdd.filter(lambda row: row[1] == 'City').map(lambda x: float(x[3])).collect()
However, I get error, probably because of the string values ("Unkown" e.g.) in the series.
How can I filter out rows with string and null values (=keep only those convertable to digits), and then calculate max and average?
Try this.
rdd = rdd.map(lambda l: [l[i].replace('"', '') for i in range(0, len(l))])
rdd = rdd.filter(lambda l: len(l) > 3) \
.filter(lambda l: l[1] in ['City', 'Metro']) \
.filter(lambda l: l[3].isdigit()) \
.map(lambda l: (l[1], int(l[3]))) \
rdd_avg = rdd.aggregateByKey((0, 0), lambda a, b: (a[0] + b, a[1] + 1), lambda a, b: a + b).mapValues(lambda x: x[0] / x[1])
rdd_max = rdd.reduceByKey(lambda a, b: a if a > b else b)
print(rdd_avg.collect())
print(rdd_max.collect())
[('Metro', 1333.3333333333333), ('City', 2000.0)]
[('Metro', 2000), ('City', 2000)]

Do Spark RDD's have something similar to set allowing rapid lookup?

I have some data consisting of pairs such as
data = [(3,7), (2,4), (7,3), ...]
These correspond to connections in a graph I want to build. I want to keep only the pairs whose reverse pair is contained in the data, and only one copy of each. For instance, in the above data, I want [(3,7)] because the reverse ((7,3)) is also in the data.
In Python, I would do something like this:
pairs = set(data)
edges = [p for p in pairs if p[0] < p[1] and (p[1], p[0]) in pairs]
Can I do something similar on Spark? The closest I can get is creating a new RDD with the pairs reversed, taking the intersection with the original data, and filtering based on pair elements being sorted, but that seems inefficient.
<-- language: python -->
rdd = sc.parallelize([(3, 7), (2, 4), (7, 3)]) \
.map(lambda x: ((min(x), max(x)), [x])) \
.reduceByKey(lambda x, y: x + y) \
.filter(lambda x: len(x[1]) > 1) \
.map(lambda x: x[0])
this can work as well

PySpark flip key/value

I am trying to flip key value from a dataset to do sorting. However, the map function return an invalid syntax error
rdd = clean_headers_rdd.rdd\
.filter(lambda x: x['date'].year == 2016)\
.map(lambda x: (x['user_id'], 1)).reduceByKey(lambda x, y: x + y)\
.map(lambda (x, y): (y, x)).sortByKey(ascending = False)
PEP 3113 -- Removal of Tuple Parameter Unpacking.
Method recommended by the transition plan:
rdd.map(lambda x_y: (x_y[1], x_y[0])
Shortcut with the operator module:
from operator import itemgetter
rdd.map(itemgetter(1, 0))
Slicing:
rdd.map(lambda x: x[::-1])

Resources