Remove duplicate tuple pairs from PySpark RDD

Remove duplicate tuple pairs from PySpark RDD - python-3.x

I am given a rdd. Example:
test = sc.parallelize([(1,0), (2,0), (3,0)])
I need to get the Cartesian product and remove resulting tuple pairs that have duplicate entries.
In this toy example these would be ((1, 0), (1, 0)), ((2, 0), (2, 0)), ((3, 0), (3, 0)).
I can get the Cartesian product as follows: NOTE The collect and print statements are there ONLY for
troubleshooting.
def compute_cartesian(rdd):
result1 = sc.parallelize(sorted(rdd.cartesian(rdd).collect()))
print(type(result1))
print(result1.collect())
My type and output at this stage are correct:
<class 'pyspark.rdd.RDD'>
[((1, 0), (1, 0)), ((1, 0), (2, 0)), ((1, 0), (3, 0)), ((2, 0), (1, 0)), ((2, 0), (2, 0)), ((2, 0), (3, 0)), ((3, 0), (1, 0)), ((3, 0), (2, 0)), ((3, 0), (3, 0))]
But now I need to remove the three pairs of tuples with duplicate entries.
Tried so far:
.distinct() This runs but does not produce a correct resulting rdd.
.dropDuplicates() Will not run. I assume this is an incorrect usage of .dropDuplicates().
Manual function:
Without an RDD this task is easy.
# Remove duplicates
for elem in result:
if elem[0] == elem[1]:
result.remove(elem)
print(result)
print("After: ", len(result))
This was a function I wrote that removes duplicate tuple pairs and then spits out the resulting len so I could do a sanity check.
I am just not sure how to directly perform actions on the RDD, in this case remove any duplicate tuple pairs resulting from the Cartesian product, and return an RDD.
Yes, I can .collect() it, perform the operation, and then re-type it as an RDD, but that defeats the purpose. Suppose this was billions of pairs. I need to perform the operations on the rdd and return an rdd.

You can use filter to remove the pairs that you don't want:
dd.cartesian(rdd).filter(lambda x: x[0] != x[1])
Note that I would not call those pairs "duplicate pairs", but rather "pairs of duplicates" or even better, "diagonal pairs": they correspond to the diagonal if you visualize the Cartesian product geometrically.
This is why distinct and dropDuplicates are not appropriate here: they remove duplicates, which is not what you want. For instance, [1,1,2].distinct() is [1,2].

Related

Condense list of nested tuples

I have an assignment that I have successfully solved using defaultdict(list).
In a nutshell, take two pairs of points (Ax, Ay) and (Bx, By) and compute the slope.
Then combine all points that have the same slope together.
Using defaultdict(list) I did this:
dic = defaultdict(list)
for elem in result:
x1 = elem[0][0]
y1 = elem[0][1]
x2 = elem[1][0]
y2 = elem[1][1]
si = slope_intercept(x1, y1, x2, y2)
temp = defaultdict(list)
temp[si].append(elem)
FullMergeDict(dic, temp)
temp.clear()
Works perfectly. (Yes, there's a lot more to the whole program not shown.)
However, I am being told to discard defaultdict(list) and that I must use a nested tuple based structure.
I have a list of tuples where the structure looks like: (((1, 2), 3), (2, 5))
(1, 2) is the first coordinate point
3 is the computed slope
(2, 5) is the second coordinate point
NOTE: These are just made up values to illustrate structure. The points almost certainly will not
generate the shown slopes.
If I start with this:
start = [(((1, 2), 3), (2, 5)), (((4, 5), 2), (3, 7)), (((2, 4), 1), (8, 9)), (((1, 2), 3), (4, 8))]
I need to end up with this:
end = [((1, 2), (2, 5), (1, 2), (4, 8)), ((4, 5), (3, 7)), ((2, 4), (8, 9))]
For every unique slope, I need a tuple of all the coordinates that share that same slope.
In the above example, the first and last tuples shared the same slope, 3, so all pairs of coordinates
with slope 3 are combined into one tuple. Yes I realize that (1, 2) is represented twice in my example. If there was another set of coordinates with slope 3, then the first tuple would contain
those additional coordinates, including duplicates. Note the embedded slope from 'start' is discarded.
defaultdict(list) made this quite straightforward. I made the key the slope and then merged the values (coordinates).
I can't seem to work through how to transform 'start' into 'end' using this required structure.

I'm not sure what you mean by "I must use the structure detailed above". You have start, you want end, so at some point there is a change to the structure. Do you mean that you are not allowed to use a dictionary or a list at all? How does your instructor expect that you go from start to end without using anything else? Here's an approach that uses only tuples (and the start and end lists).
end will be a list of tuples. We'll keep track of the slope in the a separate list. Expect end and lookup to look like so:
lookup = [ slope_1, , slope_2, ...]
end = [((p1_x, p1_y), (p2_x, p2_y), ...), ((p10_x, p10_y), (p11_x, p11_y)), ...]
start = [(((1, 2), 3), (2, 5)), (((4, 5), 2), (3, 7)), (((2, 4), 1), (8, 9)), (((1, 2), 3), (4, 8))]
end = []
lookup = []
def find_tuple_index_with_slope(needle_slope):
for index, item in enumerate(lookup):
if item == needle_slope:
return index
return None
for item in start:
p1 = item[0][0]
slope = item[0][1]
p2 = item[1]
# Check if end already contains this slope
slope_index = find_tuple_index_with_slope(slope)
if slope_index is None:
# If it doesn't exist, add an item to end
end.append(p1, p2))
# And add the slope to lookup
lookup.append(slope)
else:
# If it exists, append the new points to the existing value and
# reassign it to the correct index of end
end[slope_index] = (*end[slope_index], p1, p2)
Now, we have end looking like so:
[((1, 2), (2, 5), (1, 2), (4, 8)), ((4, 5), (3, 7)), ((2, 4), (8, 9))]
The reason this approach isn't great is the function find_tuple_index_with_slope() needs to iterate over all the elements in end to look up the correct one to append to. This increases the time complexity of the code, when you could use a dictionary to do this lookup and it would be much faster, especially if you have lots of points and lots of distinct values of slope.
A better way: replace the lookup function with a new dictionary, where the keys are the values of slope, and the values are the indices in end where the corresponding tuple is stored.
lookup = dict()
end = []
for item in start:
p1 = item[0][0]
slope = item[0][1]
p2 = item[1]
# Find the index of the tuple for `slope` using the lookup
slope_index = lookup.get(slope, None)
if slope_index is None:
# If it doesn't exist, add an item to end
end.append((p1, p2))
# And add that index to lookup
lookup[slope] = len(end) - 1
else:
end[slope_index] = (*end[slope_index], p1, p2)
The code looks almost the same as before, but looking up using a dictionary instead of a list is what saves you time.

How to pass list of tuples through a object method in python

Having this frustrating issue where i want to pass through the tuples in the following list
through a method on another list of instances of a class that i have created
list_1=[(0, 20), (10, 1), (0, 1), (0, 10), (5, 5), (10, 50)]
instances=[instance[0], instance[1],...instance[n]]
results=[]
pos_list=[]
for i in range(len(list_1)):
a,b=List_1[i]
result=sum(instance.method(a,b) for instance in instances)
results.append(result)
if result>=0:
pos_list.append((a,b))
print(results)
print(pos_list)
the issue is that all instances are taking the same tuple, where as i want the method on the first instance to take the first tuple and so on.
I ultimately want to see it append to the new list (pos_list) if the sum is >0.
Anyone know how i can iterate this properly?
EDIT
It will make it clearer if I print the result of the sum also.
Basically I want the sum to perform as follows:
result = instance[0].method(0,20), instance[1].method(10,1), instance[2].method(0,1), instance[3].method(0,10), instance[4].method(5,5), instance[5].method(10,50)
For info the method is just the +/- product of the two values depending on the attributes of the instance.
So results for above would be:
result = [0*20 - 10*1 - 0*1 + 0*10 - 5*5 + 10*50] = [465]
pos_list=[(0, 20), (10, 1), (0, 1), (0, 10), (5, 5), (10, 50)]
except what is actually doing is using the same tuple for all instances like this:
result = instance[0].method(0,20), instance[1].method(0,20), instance[2].method(0,20), instance[3].method(0,20), instance[4].method(0,20), instance[5].method(0,20)
result = [0*20 - 0*20 - 0*20 + 0*20 - 0*20 + 0*20] = [0]
pos_list=[]
and so on for (10,1) etc.
How do I make it work like the first example?

You can compute your sum using zip to generate all the pairs of correspondent instances and tuples.
result=sum(instance.payout(*t) for instance, t in zip(instances, List_1))
The zip will stop as soon as it reaches the end of the shortest of the two iterators. So if you have 10 instances and 100 tuples, zip will produce only 10 pairs, using the first 10 elements of both lists.
The problem I see in your code is that you are computing this sum for each element of List_1, so if payout produces always the same result with the same inputs (e.g., it has no memory or randomness), the value of result will be the same at each iteration. So, in the end, results will be composed by the same value repeated a number of times equal to the length of List_1, while pos_list will contain all (the sum is greater than 0) or none (the sum is less or equal to zero) of the input tuples.
Instead, it would make sense if items of List_1 were lists or tuples themselves:
List_1 = [
[(0, 1), (2, 3), (4, 5)],
[(6, 7), (8, 9), (10, 11)],
[(12, 13), (14, 15), (16, 17)],
]
So, in this case, supposing that your class for instances is something like this:
class Goofy:
def __init__(self, positive_sum=True):
self.positive_sum = positive_sum
def payout(self, *args):
if self.positive_sum:
return sum(args)
else:
return -1 * sum(args)
instances = [Goofy(i) for i in [True, True, False]]
you can rewrite your code in this way:
results=[]
pos_list=[]
for el in List_1:
result = sum(g.payout(*t) for g, t in zip(instances, el))
results.append(result)
if result >= 0:
pos_list.append(el)
Running the previous code, results will be:
[-3, 9, 21]
while pop_list:
[[(6, 7), (8, 9), (10, 11)], [(12, 13), (14, 15), (16, 17)]]
If you are interested only in pop_list, you can compact your code in only one line:
pop_list = list(filter(lambda el: sum(g.payout(*t) for g, t in zip(instances, el)) > 0, List_1))

many thanks for the above! I have it working now.
Wasn't able to use args given my method had a bit more to it but the use of zip is what made it click
import random
rand=random.choices(list_1, k=len(instances))
results=[]
pos_list=[]
for r in rand:
x,y=r
result=sum(instance.method(x,y) for instance,(x,y) in zip(instances, rand))
results.append(result)
if result>=0:
pos_list.append(rand)
print(results)
print(pos_list)
for list of e.g.
rand=[(20, 5), (0, 2), (0, 100), (2, 50), (5, 10), (50, 100)]
this returns the following
results=[147]
pos_list=[(20, 5), (0, 2), (0, 100), (2, 50), (5, 10), (50, 100)]
so exactly what I wanted. Thanks again!

Sort list of tuples based on multiple criteria

Given a list of tuples, [(x, y, z), ....., (x_n, y_n,z_n)], x, y are nonnegative number and z is either 0 or 1, I want to sort the list based on the following three criteria-
if x_i != x_j, sort on ascendening order of x(tuple[0])
if x_i == x_j and z_i != z_j, sort on ascendening order of z(tuple[2])
if x_i == x_j and z_i == z_j and z_i == 0, sort on descending order of y(tuple[1])
if x_i == x_j and z_i == z_j and z_i == 1, sort on ascending order of y(tuple[1])
Input: [(1, 1, 0), (2, 1, 1), (1, 2, 0), (2, 2, 1), (1, 3, 0), (2, 3, 1)]
output:[(1, 3, 0), (1, 2, 0), (1, 1, 0), (2, 1, 1), (2, 2, 1), (2, 3, 1)]
Since Python 3 does not support custom comparator function for sort as I know for JAVA, I do not know how to incorporate the above three criteria in the sort method.
I can sort based on the two criteria (either 1,2 or 1,3) of the above-mentioned criterion. Adding the third criteria makes one of 2 or 3 invalid. I am adding my code here-
points.sort(key=lambda p: p[2])
points.sort(key=lambda p: p[1], reverse=True)
points.sort(key=lambda p: p[0])
OUTPUT: [(1, 3, 0), (1, 2, 0), (1, 1, 0), (2, 3, 1), (2, 2, 1), (2, 1, 1)] (criteria 3 not satisfied)
Can anybody suggest, what should be the value of key argument in this situation? Thanks

Just encoding your criteria...
points.sort(key=lambda p: (p[0], p[2], p[1] if p[2] else -p[1]))

If you have truly ridiculously complicated sorting rules, you can just write a comparator function, then use functools.cmp_to_key to make it into a valid key argument. So write your insane comparator function, add from functools import cmp_to_key to the top of your file, then do:
points.sort(key=cmp_to_key(my_insane_comparator))
and it will work as expected. All cmp_to_key really does is make a custom class with a custom __lt__ (less than operator) that performs the work of the comparator in the __lt__ on each comparison.

Efficient way to loop through orthodiagonal indices in order

I wanted to find a better way to loop through orthodiagonal indices in order, I am currently using numpy but I think I'm making an unnecessary number of function calls.
import numpy as np
len_x, len_y = 50, 50 #they don't have to equal
index_arr = np.add.outer(np.arange(len_x), np.arange(len_y))
Currently, I am looping through like this:
for i in range(np.max(index_arr)):
orthodiag_indices = zip(*np.where(index_arr == i))
for index in orthodiag_indices:
# DO FUNCTION OF index #
I have an arbitrary function of the index tuple, index and other parameters outside of this loop. It feels like I don't need the second for loop, and I should be able to do the whole thing in one loop. On top of this, I'm making a lot of function calls from zip(*np.where(index_arr == i)) for every i. What's the most efficient way to do this?
Edit: should mention that it's important that the function applies to index_arr == i in order, i.e., it does 0 first, then 1, then 2 etc. (the order of the second loop doesn't matter).
Edit 2: I guess what I want is a way to get the indices [(0,0), (0,1), (1,0), (2,0), (1,1), (2,0), ...] efficiently. I don't think I can apply a vectorized function because I am populating an np.zeros((len_x, len_y)) array, and going back to the first edit, the order matters.

You could use tril/triu_indices. Since the order of the (former) inner loop doesn't matter dimensions can be swapped as needed, I'll assume L>=S:
L,S = 4,3
a0,a1 = np.tril_indices(L,0,S)
b0,b1 = np.triu_indices(S,1)
C0 = np.concatenate([a0-a1,b0+L-b1])
C1 = np.concatenate([a1,b1])
*zip(C0,C1),
# ((0, 0), (1, 0), (0, 1), (2, 0), (1, 1), (0, 2), (3, 0), (2, 1), (1, 2), (3, 1), (2, 2), (3, 2))

I think itertools.product() will be of use here
import itertools as it
x,y = 2,3
a=list(it.product(range(x),range(y))
which gives a as
[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)]
If you need them in order then,
b=np.argsort(np.sum(a,1))
np.array(a)[b]
which gives,
array([[0, 0],
[0, 1],
[1, 0],
[0, 2],
[1, 1],
[1, 2]])
Hope that helps!

Selecting sublists of a list of lists to define a relation

If I happen to have the following list of lists:
L=[[(1,3)],[(1,3),(2,4)],[(1,3),(1,4)],[(1,2)],[(1,2),(1,3)],[(1,3),(2,4),(1,2)]]
and what I wish to do, is to create a relation between lists in the following way:
I wish to say that
[(1,3)] and [(1,3),(1,4)]
are related, because the first is a sublist of the second, but then I would like to add this relation into a list as:
Relations=[([(1,3)],[(1,3),(1,4)])]
but, we can also see that:
[(1,3)] and [(1,3),(2,4)]
are related, because the first is a sublist of the second, so I would want this to also be a relation added into my Relations list:
Relations=[([(1,3)],[(1,3),(1,4)]),([(1,3)],[(1,3),(2,4)])]
The only thing I wish to be careful with, is that I am considering for a list to be a sublist of another if they only differ by ONE element. So in other words, we cannot have:
([(1,3)],[(1,3),(2,4),(1,2)])
as an element of my Relations list, but we SHOULD have:
([(1,3),(2,4)],[(1,3),(2,4),(1,2)])
as an element in my Relations list.
I hope there is an optimal way to do this, since in the original context I have to deal with a much bigger list of lists.
Any help given is much appreciated.

You really haven't provided enough information, so can't tell if you need itertools.combinations() or itertools.permutations(). Your examples work with itertools.combinations so will use that.
If x and y are two elements of the list then you just want all occurrences where the set(x).issubset(y) and the size of the set difference is <= 1 - len(set(y) - set(x)) <= 1, e.g.:
In []:
[[x, y] for x, y in it.combinations(L, r=2) if set(x).issubset(y) and len(set(y)-set(x)) <= 1]
Out[]:
[[[(1, 3)], [(1, 3), (2, 4)]],
[[(1, 3)], [(1, 3), (1, 4)]],
[[(1, 3)], [(1, 2), (1, 3)]],
[[(1, 3), (2, 4)], [(1, 3), (2, 4), (1, 2)]],
[[(1, 2)], [(1, 2), (1, 3)]],
[[(1, 2), (1, 3)], [(1, 3), (2, 4), (1, 2)]]]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Remove duplicate tuple pairs from PySpark RDD - python-3.x

Related

Condense list of nested tuples

How to pass list of tuples through a object method in python

Sort list of tuples based on multiple criteria

Efficient way to loop through orthodiagonal indices in order

Selecting sublists of a list of lists to define a relation

Categories

Resources