PySpark flip key/value - apache-spark

I am trying to flip key value from a dataset to do sorting. However, the map function return an invalid syntax error
rdd = clean_headers_rdd.rdd\
.filter(lambda x: x['date'].year == 2016)\
.map(lambda x: (x['user_id'], 1)).reduceByKey(lambda x, y: x + y)\
.map(lambda (x, y): (y, x)).sortByKey(ascending = False)

PEP 3113 -- Removal of Tuple Parameter Unpacking.
Method recommended by the transition plan:
rdd.map(lambda x_y: (x_y[1], x_y[0])
Shortcut with the operator module:
from operator import itemgetter
rdd.map(itemgetter(1, 0))
Slicing:
rdd.map(lambda x: x[::-1])

Related

Filter out non digit values in pyspark RDD

I have an RDD which looks like this:
[["3331/587","Metro","1235","1000"],
["1234/232","City","8479","2000"],
["5987/215","Metro","1111","Unkown"],
["8794/215","Metro","1112","1000"],
["1254/951","City","6598","XXXX"],
["1584/951","City","1548","Unkown"],
["1833/331","Metro","1009","2000"],
["2213/987","City","1197", ]]
I want to calculate the average and max of the last values of each row (1000, 2000 etc) for each distinct value in the second entries (City/Metro) separately. I am using the the following code to collect "City" values:
rdd.filter(lambda row: row[1] == 'City').map(lambda x: float(x[3])).collect()
However, I get error, probably because of the string values ("Unkown" e.g.) in the series.
How can I filter out rows with string and null values (=keep only those convertable to digits), and then calculate max and average?
Try this.
rdd = rdd.map(lambda l: [l[i].replace('"', '') for i in range(0, len(l))])
rdd = rdd.filter(lambda l: len(l) > 3) \
.filter(lambda l: l[1] in ['City', 'Metro']) \
.filter(lambda l: l[3].isdigit()) \
.map(lambda l: (l[1], int(l[3]))) \
rdd_avg = rdd.aggregateByKey((0, 0), lambda a, b: (a[0] + b, a[1] + 1), lambda a, b: a + b).mapValues(lambda x: x[0] / x[1])
rdd_max = rdd.reduceByKey(lambda a, b: a if a > b else b)
print(rdd_avg.collect())
print(rdd_max.collect())
[('Metro', 1333.3333333333333), ('City', 2000.0)]
[('Metro', 2000), ('City', 2000)]

Do Spark RDD's have something similar to set allowing rapid lookup?

I have some data consisting of pairs such as
data = [(3,7), (2,4), (7,3), ...]
These correspond to connections in a graph I want to build. I want to keep only the pairs whose reverse pair is contained in the data, and only one copy of each. For instance, in the above data, I want [(3,7)] because the reverse ((7,3)) is also in the data.
In Python, I would do something like this:
pairs = set(data)
edges = [p for p in pairs if p[0] < p[1] and (p[1], p[0]) in pairs]
Can I do something similar on Spark? The closest I can get is creating a new RDD with the pairs reversed, taking the intersection with the original data, and filtering based on pair elements being sorted, but that seems inefficient.
<-- language: python -->
rdd = sc.parallelize([(3, 7), (2, 4), (7, 3)]) \
.map(lambda x: ((min(x), max(x)), [x])) \
.reduceByKey(lambda x, y: x + y) \
.filter(lambda x: len(x[1]) > 1) \
.map(lambda x: x[0])
this can work as well

Pyspark Use groupByKey/mapValues more than once in one line

I'm using PySpark and I'm looking for a way to use more than once groupByKey/mapValues method.
Given :
rdd = sc.parallelize([(u'04896f3765094732a478ba63dd42c785',
u'2016-01-01',
u'2',
u'1404.0',
u'2016-333',
u'2016-48',
u'2016-11',
'2016-11-28'),
(u'kldmm584753dljkdhggdklkfj32a478ba63dd422574',
u'2016-01-14',
u'6',
u'2000.0',
u'2016-333',
u'2016-48',
u'2016-11',
'2016-11-28')
])
I want to group my rdd by the 4TH element ('2016-333' here), and get len, sum, etc..
My codes :
(rdd
.map(lambda x : (x[4], x[0]))
.groupByKey()
.mapValues(len)
.collect())
Output : [(u'2016-333', 2)]
(rdd
.map(lambda x : (x[4], float(x[3])))
.groupByKey()
.mapValues(sum)
.collect())
Output : [(u'2016-333', 3404.0)]
(rdd
.map(lambda x : (x[4], int(x[2])))
.groupByKey()
.mapValues(sum)
.collect())
Output : [(u'2016-333', 8)]
My question : Is there a way to do this in one go ?
The expected output is :
[(u'2016-333', 2, 3404.0, 8)]
Thx !
You could use reduceByKey as in the wordcount example. Here, your values are a 3-part tuple, and your reducer would be a summation of the elements.
rdd.map(lambda x: (x[4], (1, float(x[3]), int(x[2])))).reduceByKey(lambda x,y: (x[0] + y[0], x[1] + y[1], x[2] + y[2])).collect()
The simplest possible:
rdd.map(lambda x: (x[4], float(x[3]), int(x[2]))).toDF(["key", "x3", "x2"]) \
.groupBy("key").agg({"*": "count", "x3": "sum", "x2": "sum"}).rdd
or
rdd.map(lambda x: (x[4], np.array([1, float(x[3]), int(x[2])]))) \
.reduceByKey(lambda x, y: x + y) \
.mapValues(lambda x: (int(x[0]) , int(x[1]), x[2]))

Reduce ResultIterable objects after groupByKey in PySpark

I'm working on temperature forecasting data using PySpark.
The raw temperature data in the following format:
station;date;time,temperature;quality
102170;2012-11-01;06:00:00;6.8;G
102185;2012-11-02;06:00:00;5.8;G
102170;2013-11-01;18:00:00;2.8;G
102185;2013-11-01;18:00:00;7.8;G
The target result is getting the min/max temperature for each year, mentioned in which station, like the following:
year;station;max_temp
2013;102185;7.8
2012;102170;6.8
My current code as the following:
sc = SparkContext(appName="maxMin")
lines = sc.textFile('data/temperature-readings.csv')
lines = lines.map(lambda a: a.split(";"))
lines = lines.filter(lambda x: int(x[1][0:4]) >= 1950 and int(x[1][0:4]) <= 2014)
temperatures = lines.map(lambda x: (x[1][0:4], (x[0], float(x[3]))))
so far, the result as following:
temperatures.take(4)
(2012, (102170,6.8))
(2012, (102185,5.8))
(2013, (102170,2.8))
(2013, (102185,7.8))
After grouping by key, the becomes as the following:
temperatures = temperatures.groupByKey()
temperatures.take(2)
[(u'2012', <pyspark.resultiterable.ResultIterable object at 0x2a0be50>),
(u'2013', <pyspark.resultiterable.ResultIterable object at 0x2a0bc50>)]
So, how I can reduce these resultiterable objects to get only the element with min or max temperature.
Just don't. Use reduce by key:
lines.map(lambda x: (x[1][0:4], (x[0], float(x[3])))).map(lambda x: (x, x)) \
.reduceByKey(lambda x, y: (
min(x[0], y[0], key=lambda x: x[1]),
max(x[1], y[1], , key=lambda x: x[1])))

Spark: Expansion of RDD(Key, List) to RDD(Key, Value)

So I have an RDD of something like this
RDD[(Int, List)]]
Where a single element in the RDD looks like
(1, List(1, 2, 3))
My question is how can I expand the key value pair to something like this
(1,1)
(1,2)
(1,3)
Thank you
rdd.flatMap { case (key, values) => values.map((key, _)) }
And in Python (based on #seanowen's answer):
rdd.flatMap(lambda x: map(lambda e: (x[0], e), x[1]))

Resources