Find distinct values for each column in an RDD in PySpark - apache-spark

I have an RDD that is both very long (a few billion rows) and decently wide (a few hundred columns). I want to create sets of the unique values in each column (these sets don't need to be parallelized, as they will contain no more than 500 unique values per column).
Here is what I have so far:
data = sc.parallelize([["a", "one", "x"], ["b", "one", "y"], ["a", "two", "x"], ["c", "two", "x"]])
num_columns = len(data.first())
empty_sets = [set() for index in xrange(num_columns)]
d2 = data.aggregate((empty_sets), (lambda a, b: a.add(b)), (lambda x, y: x.union(y)))
What I am doing here is trying to initate a list of empty sets, one for each column in my RDD. For the first part of the aggregation, I want to iterate row by row through data, adding the value in column n to the nth set in my list of sets. If the value already exists, it doesn't do anything. Then, it performs the union of the sets afterwards so only distinct values are returned across all partitions.
When I try to run this code, I get the following error:
AttributeError: 'list' object has no attribute 'add'
I believe the issue is that I am not accurately making it clear that I am iterating through the list of sets (empty_sets) and that I am iterating through the columns of each row in data. I believe in (lambda a, b: a.add(b)) that a is empty_sets and b is data.first() (the entire row, not a single value). This obviously doesn't work, and isn't my intended aggregation.
How can I iterate through my list of sets, and through each row of my dataframe, to add each value to its corresponding set object?
The desired output would look like:
[set(['a', 'b', 'c']), set(['one', 'two']), set(['x', 'y'])]
P.S I've looked at this example here, which is extremely similar to my use case (it's where I got the idea to use aggregate in the first place). However, I find the code very difficult to convert into PySpark, and I'm very unclear what the case and zip code is doing.

There are two problems. One, your combiner functions assume each row is a single set, but you're operating on a list of sets. Two, add doesn't return anything (try a = set(); b = a.add('1'); print b), so your first combiner function returns a list of Nones. To fix this, make your first combiner function non-anonymous and have both of them loop over the lists of sets:
def set_plus_row(sets, row):
for i in range(len(sets)):
sets[i].add(row[i])
return sets
unique_values_per_column = data.aggregate(
empty_sets,
set_plus_row, # can't be lambda b/c add doesn't return anything
lambda x, y: [a.union(b) for a, b in zip(x, y)]
)
I'm not sure what zip does in Scala, but in Python, it takes two lists and puts each corresponding element together into tuples (try x = [1, 2, 3]; y = ['a', 'b', 'c']; print zip(x, y);) so you can loop over two lists simultaneously.

Related

How to loop through a list of tuples using a for loop?

I have a list that contains lists of tuples. Here is a sample of the data.
[ 144.91, 145.03, 145.1] [ 12964.0, 12818.0, 13441.0] [123.23, 152.45, 132.75] [12523.51, 12425.32, 12225.1] [122.22, 123.42, 120.21] [12444.43, 12232.22, 12111.12]
The list is structured as x and y pairs. For example, [ 144.91, 145.03, 145.1] is x1 and [ 12964.0, 12818.0, 13441.0] is y1. Then the sequence repeats, [123.23, 152.45, 132.75] is x2 and [12523.51, 12425.32, 12225.1] is y2.....
What i am trying to do is pass the x and y pairs into a for loop that inserts each x and y individual tuples into its own column within a common row inside SQL. However this is where my problem occurs.
if i use the below loop code i can insert the entire x tuple and y tuple into a single row but it will repeat the same x tuple and y tuple in ever row without iterating through.
for i in range(len(hex_strings_list)):
insert_query(x_tuple_values,y_tuple_values)
if i attempt to iterate through each list of tuples with this code the for loop will insert each number within each tuple within its own row.
for i in range(len(hex_strings_list)):
print(x_tuple_values[i],y_tuple_values[i])
i know that i want to iterate through the tuples rather than the individual numbers that make up the tuple, but i'm at a loss how to accomplish this. total mental block!

How to find match between two 2D lists in Python?

Lets say I have two 2D lists like this:
list1 = [ ['A', 5], ['X', 7], ['P', 3]]
list2 = [ ['B', 9], ['C', 5], ['A', 3]]
I want to compare these two lists and find where the 2nd item matches between the two lists e.g here we can see that numbers 5 and 3 appear in both lists. The first item is actually not relevant in comparison.
How do I compare the lists and copy those values that appear in 2nd column of both lists? Using 'x in list' does not work since these are 2D lists. Do I create another copy of the lists with just the 2nd column copied across?
It is possible that this can be done using list comprehension but I am not sure about it so far.
There might be a duplicate for this but I have not found it yet.
The pursuit of one-liners is a futile exercise. They aren't always more efficient than the regular loopy way, and almost always less readable when you're writing anything more complicated than one or two nested loops. So let's get a multi-line solution first. Once we have a working solution, we can try to convert it to a one-liner.
Now the solution you shared in the comments works, but it doesn't handle duplicate elements and also is O(n^2) because it contains a nested loop. https://wiki.python.org/moin/TimeComplexity
list_common = [x[1] for x in list1 for y in list2 if x[1] == y[1]]
A few key things to remember:
A single loop O(n) is better than a nested loop O(n^2).
Membership lookup in a set O(1) is much quicker than lookup in a list O(n).
Sets also get rid of duplicates for you.
Python includes set operations like union, intersection, etc.
Let's code something using these points:
# Create a set containing all numbers from list1
set1 = set(x[1] for x in list1)
# Create a set containing all numbers from list2
set2 = set(x[1] for x in list2)
# Intersection contains numbers in both sets
intersection = set1.intersection(set2)
# If you want, convert this to a list
list_common = list(intersection)
Now, to convert this to a one-liner:
list_common = list(set(x[1] for x in list1).intersection(x[1] for x in list2))
We don't need to explicitly convert x[1] for x in list2 to a set because the set.intersection() function takes generator expressions and internally handles the conversion to a set.
This gives you the result in O(n) time, and also gets rid of duplicates in the process.

How can I append a different element for each list in a column in pandas?

I have a dataframe, df, with lists in a specific column, col_a. For example,
df = pd.DataFrame()
df['col_a'] = [[1,2,3], [3,4], [5,6,7]]
I want to use conditions on these lists and apply specific modifications, including appends. For example, imagine that if the length of the list is > 2, I want to append another element, which is the sum of the last two elements of the current list. So, considering the first list above, I have [1, 2, 3] and I want to have [1, 2, 3, 5].
What I tried to do was:
df.loc[:, col_a] = df[col_a].apply(
lambda value: value.append(value[-2]+value[-1])
if len(value) > 1 else value)
But the result in that column is None for all the elements of the column.
Can someone help me, please?
Thank you very much in advance.
The issue is that append is an in place function and returns None. You need to add two lists together. So a working example with dummy variable would be:
df = pd.DataFrame({'cola':[[1,2],[2,3,4]], 'dum':[1,2]})
df['cola']=df.cola.apply(lambda x: (x+[sum(x[-2:])] if len(x)>2 else x))
If you want to use append try this:
def my_logic_for_list(values):
if len(values) > 2:
return values + [values[-2]+values[-1]]
return values
df['new_a'] = df['a'].apply(my_logic_for_list)
You can not use append inside lambda function.

setting an array element with a list

I'd like to create a numpy array with 3 columns (not really), the last of which will be a list of variable lengths (really).
N = 2
A = numpy.empty((N, 3))
for i in range(N):
a = random.uniform(0, 1/2)
b = random.uniform(1/2, 1)
c = []
A[i,] = [a, b, c]
Over the course of execution I will then append or remove items from the lists. I used numpy.empty to initialize the array since this is supposed to give an object type, even so I'm getting the 'setting an array with a sequence error'. I know I am, that's what I want to do.
Previous questions on this topic seem to be about avoiding the error; I need to circumvent the error. The real array has 1M+ rows, otherwise I'd consider a dictionary. Ideas?
Initialize A with
A = numpy.empty((N, 3), dtype=object)
per numpy.empty docs. This is more logical than A = numpy.empty((N, 3)).astype(object) which first creates an array of floats (default data type) and only then casts it to object type.

Two dictionary nested inside

I have nested dictionary like this:
dic={'dic1':'a': , 'b': , 'dic2':'a': , 'b': , 'dic3':'a': , 'b': }
each inner dictionary has a many rows of data.
There is two problem:
1. I want to compare value of 'a' in nested dictionary to the value of one of hdf5 file dataset containing two dataset dataset1 and dataset2 such as if values of a exists in dataset1, access to the dataset2 values.
2.Access to the 'b'information corresponds to 'a' data?
for the first part I'm doing following procedure which is a never ending solution and for the second question I don't know how to access to the b in the the same tuple of a!
Does anybody have any clue how can I solve this?
for key, value in dict.items():
for k,v in value.items():
if 'a' in k:
for t in entry[key][k]:
if t in file['/dataset1']:
joint = file['/dataset2'][file['/dataset1'] == t]
You probably don't need the second loop, if your 'a' and 'b' keys are always present and known in advance (if not, you could add a test if 'a' in inner_dict and 'b' in inner_dict). Your test 'a' in k probably doesn't do what you expect (it's doing a substring match on an inner key string, which might give false positives if not all the keys are single characters).
Try something like this:
for outer_key, inner_dict in dic.items():
for t in inner_dict['a']:
if t in file['/dataset1']:
joint = file['/dataset2'][file['/dataset1'] == t] # not sure this makes sense
b_value = inner_dict['b']
# I think you want to do something with b_value here, but I'm not sure what

Resources