Iterating through values of a paired RDD (Pyspark) and replacing null values - apache-spark

I am collecting data using the Spark RDD API and have created a paired RDD, as shown below:
spark = SparkSession.builder.master('local').appName('app').getOrCreate()
sc = spark.sparkContext
raw_rdd = sc.textFile("data.csv")
paired_rdd = raw_rdd\
.map(lambda x: x.split(","))\
.map(lambda x: (x[2], [x[1], x[3],x[5]]))
Here is a sample excerpt of the paired RDD:
[('VXIO456XLBB630221', ['I', 'Nissan', '2003']),
('VXIO456XLBB630221', ['A', '', '']),
('VXIO456XLBB630221', ['R', '', '']),
('VXIO456XLBB630221', ['R', '', ''])]
As you notice, the keys in this paired RDD are the same for all elements, but only one element has all the fields completed.
What do we want to accomplish? We want to replace the empty fields with the values of the element with complete fields. So we would have an expected output like this:
[('VXIO456XLBB630221', ['I', 'Nissan', '2003']),
('VXIO456XLBB630221', ['A', 'Nissan', '2003']),
('VXIO456XLBB630221', ['R', 'Nissan', '2003']),
('VXIO456XLBB630221', ['R', 'Nissan', '2003'])]
I know the first step would be to do a groupByKey, i.e.,
paired_rdd.groupByKey().map(lambda kv: ____)
I am just not sure how to iterate through the values and how this would fit into one lambda function.

The best way would probably to go with dataframes and window functions. With RDDs, you could work something out as well with an aggregation (reduceByKey) that would fill in the blanks and keep in memory the list of first elements of the list. Then we could re flatten based on that memory to create the same number of rows as before but with the values filled in.
# let's define a function that selects the none empty values between two strings
def select_value(a, b):
if a is None or len(a) == 0:
return b
else:
return a
# let's use mapValues to separate the first element of the list and the rest
# Then we use reduceByKey to aggregate the list of all first elements (first
# element of the tuple). For the other elements, we only keep non empty values
# (second element of the tuple).
# Finally, we use flatMapValues to recreate the rows based on the memorized
# first elements of the lists.
paired_rdd\
.mapValues(lambda x: ([x[0]], x[1:]))\
.reduceByKey(lambda a, b: (
a[0] + b[0],
[select_value(a[1][i], b[1][i]) for i in range(len(a[1])) ]
) )\
.flatMapValues(lambda x: [[k] + x[1] for k in x[0]])\
.collect()
Which yields:
[('VXIO456XLBB630221', ['I', 'Nissan', '2003']),
('VXIO456XLBB630221', ['A', 'Nissan', '2003']),
('VXIO456XLBB630221', ['R', 'Nissan', '2003']),
('VXIO456XLBB630221', ['R', 'Nissan', '2003'])
]

Related

Apparently empty groups generated with itertools.groupby

I have some troubles with groupby from itertools
from itertools import groupby
for k, grp in groupby("aahfffddssssnnb"):
print(k, list(grp), list(grp))
output is:
a ['a', 'a'] []
h ['h'] []
f ['f', 'f', 'f'] []
d ['d', 'd'] []
s ['s', 's', 's', 's'] []
n ['n', 'n'] []
b ['b'] []
It works as expected.
itertools._grouper objects seems to be readable only once (maybe iterators ?)
but:
li = [grp for k, grp in groupby("aahfffddssssnnb")]
list(li[0])
[]
list(li[1])
[]
It seems empty ... I don't understand why ?
This one works:
["".join(grp) for k, grp in groupby("aahfffddssssnnb")]
['aa', 'h', 'fff', 'dd', 'ssss', 'nn', 'b']
I am using version 3.9.9
Question already asked to newsgroup comp.lang.python without any answsers
grp is a sub-iterator over the same major iterator given to groupby. A new one is created for every key.
When you skip to the next key, the old grp is no longer available as you advanced the main iterator beyond the current group.
It is stated clearly in the Python documentation:
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list:
k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)

Find cartesian product of the elements in a program generated dynamic "sub-list"

I have a program which producing and modifying a list of "n" elements/members, n remaining constant throughout a particular run of the program. (The value of "n" might change in the next run).
Each member in the list is a "sub-list"! Each of these sub-list elements are not only of variable lengths, but are also dynamic and might keep changing while the program keeps running.
So, eventually, at some given point, my list would look something like (assuming n=3):
[['1', '2'], ['a', 'b', 'c', 'd'], ['x', 'y', 'z']]
I want the output to be like the following:
['1ax', '1ay', '1az', '1bx', '1by', '1bz',
'1cx', '1cy', '1cz', '1dx', '1dy', '1dz',
'2ax', '2ay', '2az', '2bx', '2by', '2bz',
'2cx', '2cy', '2cz', '2dx', '2dy', '2dz']
i.e. a list with exactly (2 * 3 * 4) elements where each element is of length exactly 3 and has exactly 1 member from each of the "sub-lists".
Easiest is itertools.product:
from itertools import product
lst = [['1', '2'], ['a', 'b', 'c', 'd'], ['x', 'y', 'z']]
output = [''.join(p) for p in product(*lst)]
# OR
output = list(map(''.join, product(*lst)))
# ['1ax', '1ay', '1az', '1bx', '1by', '1bz',
# '1cx', '1cy', '1cz', '1dx', '1dy', '1dz',
# '2ax', '2ay', '2az', '2bx', '2by', '2bz',
# '2cx', '2cy', '2cz', '2dx', '2dy', '2dz']
A manual implementation specific to strings could look like this:
def prod(*pools):
if pools:
*rest, pool = pools
for p in prod(*rest):
for el in pool:
yield p + el
else:
yield ""
list(prod(*lst))
# ['1ax', '1ay', '1az', '1bx', '1by', '1bz',
# '1cx', '1cy', '1cz', '1dx', '1dy', '1dz',
# '2ax', '2ay', '2az', '2bx', '2by', '2bz',
# '2cx', '2cy', '2cz', '2dx', '2dy', '2dz']

Remove redundant sublists within list in python

Hello everyone I have a list of lists values such as :
list_of_values=[['A','B'],['A','B','C'],['D','E'],['A','C'],['I','J','K','L','M'],['J','M']]
and I would like to keep within that list, only the lists where I have the highest amount of values.
For instance in sublist1 : ['A','B'] A and B are also present in the sublist2 ['A','B','C'], so I remove the sublist1.
The same for sublist4.
the sublist6 is also removed because J and M were present in a the longer sublist5.
at the end I should get:
list_of_no_redundant_values=[['A','B','C'],['D','E'],['I','J','K','L','M']]
other exemple =
list_of_values=[['A','B'],['A','B','C'],['B','E'],['A','C'],['I','J','K','L','M'],['J','M']]
expected output :
[['A','B','C'],['B','E'],['I','J','K','L','M']]
Does someone have an idea ?
mylist=[['A','B'],['A','C'],['A','B','C'],['D','E'],['I','J','K','L','M'],['J','M']]
def remove_subsets(lists):
outlists = lists[:]
for s1 in lists:
for s2 in lists:
if set(s1).issubset(set(s2)) and (s1 is not s2):
outlists.remove(s1)
break
return outlists
print(remove_subsets(mylist))
This should result in [['A', 'B', 'C'], ['D', 'E'], ['I', 'J', 'K', 'L', 'M']]

Export Python dict of nested lists of varying lengths to csv. If nested list has > 1 entry, expand to column before moving to next key

I have the following dictionary of lists
d = {1: ['1','B1',['C1','C2','C3']], 2: ['2','B2','C15','D12'], 3: ['3','B3'], 4: ['4', 'B4', 'C4', ['D1', 'D2']]}
writing that to a csv using
with open('test.csv', "w", newline = '') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(d.values())
gives me a csv that looks like
A B C D
1 B1 ['C1','C2',C3']
2 B2 C15 D12
3 B3
4 B4 C4 ['D1','D2']
If there is a multiple item list in the value (nested list?), I would like that list to be expanded down the column like this
A B C D
1 B1 C1
1 C2
1 C3
2 B2 C15 D12
3 B3
4 B4 C4 D1
4 D2
I'm fairly new to python and can't seem to figure out a way to do what I need after a few days sifting through forums and banging my head on the wall. I think I may need to break apart the nested lists, but I need to keep them tied to their respective "A" value. Columns A and B will always have 1 entry, columns C and D can have 1 to X number of entries.
Any help is much appreciated
Seems like it might be easier to make a list of lists, with appropriately-located empty spaces, than what you're doing. Here's something that might do:
import csv
from itertools import zip_longest
def condense(dct):
# get the maximum number of columns of any list
num_cols = len(max(dct.values(), key=len)) - 1
# Ignore the key, it's not really relevant.
for _, v in dct.items():
# first, memorize the index of this list,
# since we need to repeat it no matter what
idx = v[0]
# next, use zip_longest to make a correspondence.
# We will deliberately make a 2d list,
# and we will later withdraw elements from it one by one.
matrix = [([] if elem is None else
[elem] if not isinstance(elem, list) else
elem[:] # soft copy to avoid altering original dict
) for elem, _ in zip_longest(v[1:], range(num_cols), fillvalue=None)
]
# Now, we output the top row of the matrix as long as it has contents
while any(matrix):
# If a column in the matrix is empty, we put an empty string.
# Otherwise, we remove the row as we pass through it,
# progressively emptying the matrix top-to-bottom
# as we output a row, we also remove that row from the matrix.
# *-notation is more convenient than concatenating these two lists.
yield [idx, *((col.pop(0) if col else '') for col in matrix)]
# e.g. for key 0 and a matrix that looks like this:
# [['a1', 'a2'],
# ['b1'],
# ['c1', 'c2', 'c3']]
# this would yield the following three lists before moving on:
# ['0', 'a1', 'b1', 'c1']
# ['0', 'a2', '', 'c2']
# ['0', '', '', 'c3']
# where '' should parse into an empty column in the resulting CSV.
The biggest thing to note here is that I use isinstance(elem, list) as a shorthand to check whether the thing is a list (which you need to be able to do, one way or another, to flatten or rounden lists as we do here). If you have more complicated or more varied data structures, you'll need to improvise with this check - maybe write a helper function isiterable() that tries to iterate through and returns a boolean based on whether doing so produced an error.
That done, we can call condense() on d and have the csv module deal with the output.
headers = ['A', 'B', 'C', 'D']
d = {1: ['1','B1',['C1','C2','C3']], 2: ['2','B2','C15','D12'], 3: ['3','B3'], 4: ['4', 'B4', 'C4', ['D1', 'D2']]}
# condense(d) produces
# [['1', 'B1', 'C1', '' ],
# ['1', '', 'C2', '' ],
# ['1', '', 'C3', '' ],
# ['2', 'B2', 'C15', 'D12'],
# ['3', 'B3', '', '' ],
# ['4', 'B4', 'C4', 'D1' ],
# ['4', '', '', 'D2' ]]
with open('test.csv', "w", newline = '') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(condense(d))
Which produces the following file:
A,B,C,D
1,B1,C1,
1,,C2,
1,,C3,
2,B2,C15,D12
3,B3,,
4,B4,C4,D1
4,,,D2
This is equivalent to your expected output. Hopefully the solution is sufficiently extensible for you to apply it to your non-MVCE problem.

PySpark - Convert an RDD into a key value pair RDD, with the values being in a List

I have an RDD with tuples being in the form:
[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ...
What I want is to transform that into a key-value pair RDD, where the first field will be the first string (key) and the second field a list of strings (value), i.e. I want to turn it to the form:
[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ...
>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")])
>>> result = rdd.map(lambda x: (x[0], list(x[1:])))
>>> print result.collect()
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]
Explanation of lambda x: (x[0], list(x[1:])):
x[0] will make the first element to be the first element of the
output
x[1:] will make all the elements except the first one to be
in the second element
list(x[1:]) will force that to be a list
because the default will be a tuple

Resources