Reduce key, value pair based on similarity of their value in PySpark

Reduce key, value pair based on similarity of their value in PySpark - apache-spark

I am a beginner in PySpark.
I want to find the pairs of letters with the same numbers in values and then to find out which pair of letters appear more often.
Here is my data
data = sc.parallelize([('a', 1), ('b', 4), ('c', 10), ('d', 4), ('e', 4), ('f', 1), ('b', 5), ('d', 5)])
data.collect()
[('a', 1), ('b', 4), ('c', 10), ('d', 4), ('e', 4), ('f', 1), ('b', 5), ('d', 5)]
The result I want would look like this:
1: a,f
4: b, d
4: b, e
4: d, e
10: c
5: b, d
I have tried the following:
data1= data.map(lambda y: (y[1], y[0]))
data1.collect()
[(1, 'a'), (4, 'b'), (10, 'c'), (4, 'd'), (4, 'e'), (1, 'f'), ('b', 5), ('d', 5)]
data1.groupByKey().mapValues(list).collect()
[(10, ['c']), (4, ['b', 'd', 'e']), (1, ['a', 'f']), (5, ['b', 'd'])]
As I said I am very new to PySpark and tried to search the command for that but was not successful. Could anyone please help me with this?

You can use flatMap with python itertools.combinations to get combinations of 2 from the grouped values. Also, prefer using reduceByKey rather than groupByKey:
from itertools import combinations
result = data.map(lambda x: (x[1], [x[0]])) \
.reduceByKey(lambda a, b: a + b) \
.flatMap(lambda x: [(x[0], p) for p in combinations(x[1], 2 if (len(x[1]) > 1) else 1)])
result.collect()
#[(1, ('a', 'f')), (10, ('c',)), (4, ('b', 'd')), (4, ('b', 'e')), (4, ('d', 'e')), (5, ('b', 'd'))]
If you want to get None when tuple has only one element, you can use this:
.flatMap(lambda x: [(x[0], p) for p in combinations(x[1] if len(x[1]) > 1 else x[1] + [None], 2)])

Related

How to pair up elements using permutations in an rdd list of lists

I am trying to get a list of paired elements from each list of lists in rdd.
My data :
[['a','b','c'],['e','f','g','h'],['x','y','z']]
I want :
[('a','b'),('b','c'),('c','a'),('e','f'),('f','g'),('g','h'),('e','g'),('e','h')...... and all possible pairs]

>>> data = [[['a','b','c'],['e','f','g','h'],['x','y','z']]]
>>> df = spark.sparkContext.parallelize(data)
>>> import itertools
>>> df.map(lambda x: [list(itertools.combinations(i,2)) for i in x]).map(lambda x: list(itertools.chain.from_iterable(x))).foreach(lambda y: print(y))
result:
[('a', 'b'), ('a', 'c'), ('b', 'c'), ('e', 'f'), ('e', 'g'), ('e', 'h'), ('f', 'g'), ('f', 'h'), ('g', 'h'), ('x', 'y'), ('x', 'z'), ('y', 'z')]

Compare 2 lists of tuples with same size,:compare and swap

Made 2 list of tuples:
I wanna use the alphabet and the counter, for comparing both lists. Tuples of su, belong on the index of tuples in list tu. -> tuple 0 on tu has (40, 'b', 1) -> 'b', 1 in tuple 4 in su are the same, therefore
tuple 4 of su should go to index 0, usw.
su = [(30, 'a', 1), (1, 'b', 0), (4, 'a', 0), (17, 'c', 0), (8, 'b', 1)]
tu = [(40, 'b', 1), (9, 'c', 0), (3, 'b', 0), (11, 'a', 0), (12, 'a', 1)]
for i, (s, t) in enumerate(zip(su, tu)):
if t[1] == 'H':
print(f" 'H' {i}")
My final wanted list su_new = [(8, b, 1), (17, 'c', 0), (1, 'b', 0), (4, 'a', 0), (30, 'a', 1)]
For comparing, I filled in both lists, indices.
[(8, b), (17, 'c'), (1, 'b'), (4, 'a'), (30, 'a')]

This works:
from copy import copy
su = [(30, 'a', 1), (1, 'b', 0), (4, 'a', 0), (17, 'c', 0), (8, 'b', 1)]
tu = [(40, 'b', 1), (9, 'c', 0), (3, 'b', 0), (11, 'a', 0), (12, 'a', 1)]
index_dic = {}
for i, tup in enumerate(tu):
index_dic[tup[1:]] = i
new_su = copy(su)
for tup in su:
new_index = index_dic[tup[1:]]
new_su[new_index] = tup
print(new_su)
#[(8, 'b', 1), (17, 'c', 0), (1, 'b', 0), (4, 'a', 0), (30, 'a', 1)]
Alternatively, the index_dic can be constructed as a dictionary comprehension:
index_dic = {tup[1:]:i for i, tup in enumerate(tu)}

How to sort and remove tuples with same first element and only keeping the first occurrence

Support we have a list of tuples listeT:
ListeT=[('a', 1), ('x',1) , ('b', 1), ('b', 1), ('a', 2), ('a',3), ('c', 6), ('c', 5),('e', 6), ('d', 7),('b', 2)]` and i want to get the following result:
Result = [('a', 1), ('x',1) , ('b', 1), ('c', 5), ('c', 6), ('e', 6), ('d', 7)]`
1-I want to order this list according to the second element of tuples.
2- I want to remove duplicates and only keep the first occurrence of tuples having the same value of first elements: For instance if have ('a', 1) and (a,2), I want to keep only ('a', 1).
For the sorting I used : res=sorted(listT, key=lambda x: x[1], reverse= True) and it worked.
But for the duplicates I could't find a good solution:
I can remove duplicate elements by converting the the list to a set (set(listeT)) or by using
numpy.unique(ListeT, axis=0). However, this only removes the duplicates for all the tuples but I want also to remove the duplicates of tuples having the same first element and only keeping the first occurrence.
Thank you.

The dict can take care of the uniqueness while feeding it in reversed order.
I did not understood if you need the outer sort so you can just replace it with list() if not needed.
sorted(dict(sorted(ListeT, key=lambda x: x[1], reverse= True)).items(), key=lambda x: x[1])
[('a', 1), ('b', 1), ('x', 1), ('c', 5), ('e', 6), ('d', 7)]

Pyspark: Applying reduce by key to the values of an rdd

After some transformations I have ended up with an rdd with the following format:
[(0, [('a', 1), ('b', 1), ('b', 1), ('b', 1)])
(1, [('c', 1), ('d', 1), ('h', 1), ('h', 1)])]
I can't figure out how to essentially "reduceByKey()" on the values portion of this rdd.
This is what I'd like to achieve:
[(0, [('a', 1), ('b', 3)])
(1, [('c', 1), ('d', 1), ('h', 2)])]
I was originally using .values() then applying reduceByKey to the result of that but then I end up losing my original key (in this case 0 or 1).

You lose the original key because .values() will only get value of the key-value in a row. You should sum the tuple in the row.
from collections import defaultdict
def sum_row(row):
result = defaultdict(int)
for key, val in row[1]:
result[key] += val
return (row[0],list(result.items()))
data_rdd = data_rdd.map(sum_row)
print(data_rdd.collect())
# [(0, [('a', 1), ('b', 3)]), (1, [('h', 2), ('c', 1), ('d', 1)])]

Though values gives RDD, reduceByKey works on all the values on RDD not row-wise.
You can also use groupby(ordering is required) to achieve the same:
from itertools import groupby
distdata.map(lambda x: (x[0], [(a, sum(c[1] for c in b)) for a,b in groupby(sorted(x[1]), key=lambda p: p[0]) ])).collect()

after sorted one zip object , why list the object is empty

I zip two list into one,then I use sorted function to order it.
But after that, I list zip object will show empty [] .
[(11, 'a'), (1, 'b'), (15, 'c'), (2, 'd'), (3, 'e'), (19, 'f'), (12, 'g'), (23, 'h'), (5, 'i'), (14, 'j'), (21, 'k'), (9, 'l'), (8, 'm'), (22, 'n'), (20, 'o'), (0, 'p'), (6, 'q'), (25, 'r'), (13, 's'), (10, 't'), (18, 'u'), (17, 'v'), (4, 'w'), (24, 'x'), (16, 'y'), (7, 'z')]
[]
import random
eng=[ chr(i) for i in range(ord('a'),ord('a')+26,1)]
enum_eng=zip(list(range(len(eng))),eng)
random.shuffle(eng)
enum_eng=zip(list(range(len(eng))),eng)
print(sorted(enum_eng,key=lambda x : x[1]))
print(list(enum_eng))
I want to compare sorted zip list before and after.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Reduce key, value pair based on similarity of their value in PySpark - apache-spark

Related

How to pair up elements using permutations in an rdd list of lists

Compare 2 lists of tuples with same size,:compare and swap

How to sort and remove tuples with same first element and only keeping the first occurrence

Pyspark: Applying reduce by key to the values of an rdd

after sorted one zip object , why list the object is empty

Categories

Resources