Do Spark RDD's have something similar to set allowing rapid lookup? - apache-spark

I have some data consisting of pairs such as
data = [(3,7), (2,4), (7,3), ...]
These correspond to connections in a graph I want to build. I want to keep only the pairs whose reverse pair is contained in the data, and only one copy of each. For instance, in the above data, I want [(3,7)] because the reverse ((7,3)) is also in the data.
In Python, I would do something like this:
pairs = set(data)
edges = [p for p in pairs if p[0] < p[1] and (p[1], p[0]) in pairs]
Can I do something similar on Spark? The closest I can get is creating a new RDD with the pairs reversed, taking the intersection with the original data, and filtering based on pair elements being sorted, but that seems inefficient.

<-- language: python -->
rdd = sc.parallelize([(3, 7), (2, 4), (7, 3)]) \
.map(lambda x: ((min(x), max(x)), [x])) \
.reduceByKey(lambda x, y: x + y) \
.filter(lambda x: len(x[1]) > 1) \
.map(lambda x: x[0])
this can work as well

Related

Keep duplciate items in list of tuples if only the first index matches between the tuples

Input [(1,3), (3,1), (1,5), (2,3), (2,4), (44,33), (33,22), (44,22), (22,33)]
Expected Output [(1,3), (1,5), (2,3), (2,4), (44,33), (44,22)]
I am trying to figure out the above and have tried lots of stuff. So far my only success has been,
for x in range(len(list1)):
if list1[0][0] == list1[x][0]:
print(list1[x])
Output: (1, 3) \n (1, 5)
Any sort of advice or help would be appreciated.
Use a collections.defaultdict(list) keyed by the first value, and keep only the values that are ultimately duplicated:
from collections import defaultdict # At top of file, for collecting values by first element
from itertools import chain # At top of file, for flattening result
dct = defaultdict(list)
inp = [(1,3), (3,1), (1,5), (2,3), (2,4), (44,33), (33,22), (44,22), (22,33)]
# For each tuple
for tup in inp:
first, _ = tup # Extract first element (and verify it's actually a pair)
dct[first].append(tup) # Collect with other tuples sharing the same first element
# Extract all lists which have two or more elements (first element duplicated at least once)
# Would be list of lists, with each inner list sharing the same first element
onlydups = [lst for firstelem, lst in dct.items() if len(lst) > 1]
# Flattens to make single list of all results (if desired)
flattened_output = list(chain.from_iterable(onlydups))
Importantly, this doesn't required ordered input, and scales well, doing O(n) work (scaling your solution naively would produce a O(n²) solution, considerably slower for larger inputs).
Another approach is the following :
def sort(L:list):
K = []
for i in L :
if set(i) not in K :
K.append(set(i))
output = [tuple(m) for m in K]
return output
output :
[(1, 3), (1, 5), (2, 3), (2, 4), (33, 44), (33, 22), (44, 22)]

Filter out non digit values in pyspark RDD

I have an RDD which looks like this:
[["3331/587","Metro","1235","1000"],
["1234/232","City","8479","2000"],
["5987/215","Metro","1111","Unkown"],
["8794/215","Metro","1112","1000"],
["1254/951","City","6598","XXXX"],
["1584/951","City","1548","Unkown"],
["1833/331","Metro","1009","2000"],
["2213/987","City","1197", ]]
I want to calculate the average and max of the last values of each row (1000, 2000 etc) for each distinct value in the second entries (City/Metro) separately. I am using the the following code to collect "City" values:
rdd.filter(lambda row: row[1] == 'City').map(lambda x: float(x[3])).collect()
However, I get error, probably because of the string values ("Unkown" e.g.) in the series.
How can I filter out rows with string and null values (=keep only those convertable to digits), and then calculate max and average?
Try this.
rdd = rdd.map(lambda l: [l[i].replace('"', '') for i in range(0, len(l))])
rdd = rdd.filter(lambda l: len(l) > 3) \
.filter(lambda l: l[1] in ['City', 'Metro']) \
.filter(lambda l: l[3].isdigit()) \
.map(lambda l: (l[1], int(l[3]))) \
rdd_avg = rdd.aggregateByKey((0, 0), lambda a, b: (a[0] + b, a[1] + 1), lambda a, b: a + b).mapValues(lambda x: x[0] / x[1])
rdd_max = rdd.reduceByKey(lambda a, b: a if a > b else b)
print(rdd_avg.collect())
print(rdd_max.collect())
[('Metro', 1333.3333333333333), ('City', 2000.0)]
[('Metro', 2000), ('City', 2000)]

How to associate unique id with text in word counting with spark

I have an RDD that is populated as
id txt
1 A B C
2 A B C
1 A B C
The result of my word count (pyspark) should be for a combination of string and id associated with it. Example:
[(u'1_A',2), (u'1_B',2), (u'1_C',2),(u'2_A',1),(u'2_B',1),(u'2_C',1)]
I tried using a user defined function to combine id with string splits from text. It, however, complains that the append function is unavailable in this context.
Appreciate any code samples that will set me in the right direction.
Here is an alternative solution using PySpark Dataframe. Mainly, the code uses explode and split to split the txt column. Then, use groupby and count to count number of pairs.
import pyspark.sql.functions as func
rdd = spark.sparkContext.parallelize([(1,'A B C'), (2, 'A B C'), (1,'A B C')])
df = rdd.toDF(['id', 'txt'])
df_agg = df.select('id', func.explode(func.split('txt', ' '))).\
groupby(['id', 'col']).\
count().\
sort(['id', 'col'], ascending=True)
df_agg.rdd.map(lambda x:(str(x['id']) + '_' + x['col'], x['count'] )).collect()
Output
[('1_A', 2), ('1_B', 2), ('1_C', 2), ('2_A', 1), ('2_B', 1), ('2_C', 1)]
the below snippet should work
rdd = sc.parallelize([(1,'A B C'), (2, 'A B C'), (1,'A B C')])
result = rdd \
.map(lambda x: (x[0],x[1].split(' '))) \
.flatMap(lambda x: [ '%s_%s'% (x[0],y) for y in x[1] ]) \
.map(lambda x: (x,1)) \
.reduceByKey(lambda x,y: x + y)
result.collect()
Output
[('1_C', 2), ('1_B', 2), ('1_A', 2), ('2_A', 1), ('2_B', 1), ('2_C', 1)]

Reduce ResultIterable objects after groupByKey in PySpark

I'm working on temperature forecasting data using PySpark.
The raw temperature data in the following format:
station;date;time,temperature;quality
102170;2012-11-01;06:00:00;6.8;G
102185;2012-11-02;06:00:00;5.8;G
102170;2013-11-01;18:00:00;2.8;G
102185;2013-11-01;18:00:00;7.8;G
The target result is getting the min/max temperature for each year, mentioned in which station, like the following:
year;station;max_temp
2013;102185;7.8
2012;102170;6.8
My current code as the following:
sc = SparkContext(appName="maxMin")
lines = sc.textFile('data/temperature-readings.csv')
lines = lines.map(lambda a: a.split(";"))
lines = lines.filter(lambda x: int(x[1][0:4]) >= 1950 and int(x[1][0:4]) <= 2014)
temperatures = lines.map(lambda x: (x[1][0:4], (x[0], float(x[3]))))
so far, the result as following:
temperatures.take(4)
(2012, (102170,6.8))
(2012, (102185,5.8))
(2013, (102170,2.8))
(2013, (102185,7.8))
After grouping by key, the becomes as the following:
temperatures = temperatures.groupByKey()
temperatures.take(2)
[(u'2012', <pyspark.resultiterable.ResultIterable object at 0x2a0be50>),
(u'2013', <pyspark.resultiterable.ResultIterable object at 0x2a0bc50>)]
So, how I can reduce these resultiterable objects to get only the element with min or max temperature.
Just don't. Use reduce by key:
lines.map(lambda x: (x[1][0:4], (x[0], float(x[3])))).map(lambda x: (x, x)) \
.reduceByKey(lambda x, y: (
min(x[0], y[0], key=lambda x: x[1]),
max(x[1], y[1], , key=lambda x: x[1])))

RDD operation in pyspark

I have two RDD's
x = [("XYZ",12),("ABC",15),("PQR",20)]
y = [("XY",100),("AB",200),("PQR",123),("MNO",111)]
I need final result such as "XY" is sub string of "XYZ", so I want to merge this two tuple into one. Result for above example is as follow,
result = [("XYZ",12,100),("ABC",15,200),("PQR",20,123)]
Right now I have achieved this using for loop.
is there any better way to solve this ?
I think cartesian join resolves your issue:
>>> x.cartesian(y).filter(lambda r: r[1][0] in r[0][0]) \
... .map(lambda r: (r[0][0], r[0][1], r[1][1])).collect()
[('XYZ', 12, 100), ('ABC', 15, 200), ('PQR', 20, 123)]

Resources