Spark / PySpark: Group by any item of nested list - apache-spark

I´m still new to Spark / PySpark and have the following question. I got a nested list with ID´s in it:
result = [[411, 44, 61], [42, 33], [1, 100], [44, 42]]
The thing I´m trying to achieve is, that if any item of sublist matches an item in another sublist the both should be merged. The result should look like this:
merged_result = [[411, 44, 61, 42, 33, 44, 42], [1,100]]
The first list in "result" matches with the fourth list. The fourth list matches with the second, so all 3 should be merged into one list. The third list doesn´t match with any other list, so it stays the same.
I could achieve this by writing loops with Python.
result_after_matching = []
for i in result:
new_list = i
for s in result:
if any(x in i for x in s):
new_list = new_list + s
result_after_matching.append(set(new_list))
#merged_result = [[411, 44, 61, 42], [42,33,44], [1, 100], [44,42,33,411,61]]
As this is not the desired output I would need to repeat the loop and do another set() overt the "merged_result")
set([[411,44,61,42,33], [42,33,44,411,61],[1,100], [44,42,33,411,61]])
-> [[411, 44, 61, 42, 33], [1,100]]
As the list of lists, and the sublists gets bigger and bigger by time as new data will be incoming, this will not be the function to use.
Can anyone tell me if there is a function, in Spark / Pyspark, to match / merge / groupby / reduce these nested lists much easier and faster?!
Thanks a lot in advance!
MG

Most rdd or dataframe based solutions will probably be fairly inefficient. This is because the nature of your problem requires every element of your data set to be compared to every other element potentially multiple times. This makes it so that distributing the work across a cluster is at best inefficient.
Perhaps a different way to do this would be to reformulate this as a graph problem. If you treat each item in a list as a node on a graph, and each list as a subgraph, then the connected components of a parent graph constructed from the subgraphs will be the desired result. Here is an example using the networkx package in python:
import networkx as nx
result = [[411, 44, 61], [42, 33], [1, 100], [44, 42]]
g = nx.DiGraph()
for subgraph in result:
g.add_path(subgraph)
u = g.to_undirected()
output=[]
for component in nx.connected_component_subgraphs(u):
output.append(component.nodes())
print(output)
# [[33, 42, 411, 44, 61], [1, 100]]
This should be fairly efficient, but if your data is very large it will make sense to use a more scalable graph analysis tool. Spark does have a graph processing library called GraphX:
https://spark.apache.org/docs/latest/graphx-programming-guide.html
Unfortunately the pyspark implementation is lagging behind a bit. So if you intend to use something like this, you might be stuck using scala-spark or a different framework entirely for right now.

I think you can use aggregate action from RDD. Below I'm putting example implementation in Scala. Please note that I've used recursion, to make it more readable, but to improve performance it's good idea to reimplement those functions.
def overlap(s1: Seq[Int], s2: Seq[Int]): Boolean =
s1.exists(e => s2.contains(e))
def mergeSeq(s1: Seq[Int], s2: Seq[Int]): Seq[Int] =
s1.union(s2).distinct
def mergeSeqWithSeqSeq(s: Seq[Int], ss: Seq[Seq[Int]]): Seq[Seq[Int]] = ss match {
case Nil => Seq(s)
case h +: tail =>
if(overlap(h, s)) mergeSeqWithSeqSeq(mergeSeq(h, s), tail)
else h +: mergeSeqWithSeqSeq(s, tail)
}
def mergeSeqSeqWithSeqSeq(s1: Seq[Seq[Int]], s2: Seq[Seq[Int]]): Seq[Seq[Int]] = s1 match {
case Nil => s2
case h +: tail => mergeSeqWithSeqSeq(h, mergeSeqSeqWithSeqSeq(tail, s2))
}
val result = rdd
.aggregate(Seq.empty[Seq[Int]]) (
{case (ss, s) => mergeSeqWithSeqSeq(s, ss)},
{case (s1, s2) => mergeSeqSeqWithSeqSeq(s1, s2)}
)

Related

How can I use broadcasting to code my program in one line?

I have a code that works fine, however, the exercise is to code it in one line using broadcasting and I've found it very complicated to do, this is the code:
import numpy as np
v1 = np.array([10, 20, 30, 40, 50])
v2 = np.array([0, 1, 2, 3 ])
matrix = []
for i in v1:
matrix.append(i**v2)
matrixx = np.array(matrix).reshape([5,4])
print(matrixx)
Please some help!
You don't need broadcasting (well it will occur automatically) in this case since both arrays have a dimension of size 1.
You can get the same output without loop/comprehension:
print(v1.reshape(5,1)**v2)

Why is taking a slice of a list which is assigned to another list not changing the original?

I have a class that is a representation of a mathematical tensor. The tensor in the class, is stored as a single list, not lists inside another list. That means [[1, 2, 3], [4, 5, 6]] would be stored as [1, 2, 3, 4, 5, 6].
I've made a __setitem__() function and a function to handle taking slices of this tensor while it's in single list format. For example slice(1, None, None) would become slice(3, None, None) for the list mentioned above. However when I assign this slice a new value, the original tensor isn't updated.
Here is what the simplified code looks like
class Tensor:
def __init__(self, tensor):
self.tensor = tensor # Here I would flatten it, but for now imagine it's already flattened.
def __setitem__(self, slices, value):
slices = [slices]
temp_tensor = self.tensor # any changes to temp_tensor should also change self.tensor.
for s in slices: # Here I would call self.slices_to_index(), but this is to keep the code simple.
temp_tensor = temp_tensor[slice]
temp_tensor = value # In my mind, this should have also changed self.tensor, but it hasn't.
Maybe i'm just being stupid and can't see why this isn't working. Maybe my actual questions isn't just ' why doesn't this work?' but also 'is there a better way to do this?'. Thanks for any help you can give me.
NOTES:
Each 'dimension' of the list must have the same shape, so [[1, 2, 3], [4, 5]] isn't allowed.
This code is massively simplified as there are many other helper functions and stuff like that.
in __init__() I would flatten the list but as I just said to keep things simple I left that out, along with self.slice_to_index().
You should not think of python variables as in c++ or java. Think of them as labels you place on values. Check this example:
>>> l = []
>>> l.append
<built-in method append of list object at 0x7fbb0d40cf88>
>>> l.append(10)
>>> l
[10]
>>> ll = l
>>> ll.append(10)
>>> l
[10, 10]
>>> ll
[10, 10]
>>> ll = ["foo"]
>>> l
[10, 10]
As you can see, ll variable first points to the same l list but later we just make it point to another list. Modifying the later ll won't modify the original list pointed by l.
So, in your case if you want self.tensor to point to a new value, just do it:
class Tensor:
def __init__(self, tensor):
self.tensor = tensor # Here I would flatten it, but for now imagine it's already flattened.
def __setitem__(self, slices, value):
slices = [slices]
temp_tensor = self.tensor # any changes to the list pointed by temp_tensor will be reflected in self.tensor since it is the same list
for s in slices:
temp_tensor = temp_tensor[slice]
self.tensor = value

I want to do the same transformation in Python as I did in Scala

I'm new to Python.
Scala Code:
rdd1 is in string format
rdd1=sc.parallelize("[Canada,47;97;33;94;6]", "[Canada,59;98;24;83;3]","[Canada,77;63;93;86;62]")
val resultRDD = rdd1.map { r =>
val Array(country, values) = r.replaceAll("\\[|\\]", "").split(",")
country -> values
}.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map {
case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))
Output:
Country,Values //I have puted the column name to make sure that the output should be in two column
Canada,183;258;150;263;71
Edit: OP wants to use map instead of flatMap, so I adjusted flatMap to map by which, you just need to take the first item out of the list comprehension, thus map(lambda x: [...][0]).
side-note: The above change is valid only to this particular case when list comprehension returns a list with only one item. for more general cases, you might need two map()s to replace what flatMap() does.
One way with RDD is to use a list comprehension to strip, split and convert the String into a key-value pair, with Country as key and a tuple of numbers as value. Since we use list comprehension, so we take flatMap on the RDD element, then use reduceByKey to do the calculation and mapValues to convert the resulting tuple back into string:
rdd1.map(lambda x: [ (e[0], tuple(map(int, e[1].split(';')))) for e in [x.strip('][').split(',')] ][0]) \
.reduceByKey(lambda x,y: tuple([ x[i]+y[i] for i in range(len(x))]) ) \
.mapValues(lambda x: ';'.join(map(str,x))) \
.collect()
output after flatMap:
[('Canada', (47, 97, 33, 94, 6)),
('Canada', (59, 98, 24, 83, 3)),
('Canada', (77, 63, 93, 86, 62))]
output after reduceByKey:
[('Canada', (183, 258, 150, 263, 71))]
output after mapValues:
[('Canada', '183;258;150;263;71')]
You can do something like this
import pyspark.sql.functions as f
from pyspark.sql.functions import col
myRDD = sc.parallelize([('Canada', '47;97;33;94;6'), ('Canada', '59;98;24;83;3'),('Canada', '77;63;93;86;62')])
df = myRDD.toDF()
>>> df.show(10)
+------+--------------+
| _1| _2|
+------+--------------+
|Canada| 47;97;33;94;6|
|Canada| 59;98;24;83;3|
|Canada|77;63;93;86;62|
+------+--------------+
df.select(
col("_1").alias("country"),
f.split("_2", ";").alias("values"),
f.posexplode(f.split("_2", ";")).alias("pos", "val")
)\
.drop("val")\
.select(
"country",
f.concat(f.lit("position"),f.col("pos").cast("string")).alias("name"),
f.expr("values[pos]").alias("val")
)\
.groupBy("country").pivot("name").agg(f.sum("val"))\
.show()
+-------+---------+---------+---------+---------+---------+
|country|position0|position1|position2|position3|position4|
+-------+---------+---------+---------+---------+---------+
| Canada| 183.0| 258.0| 150.0| 263.0| 71.0|
+-------+---------+---------+---------+---------+---------+

Tensorflow map_fn Out of Memory Issues

I am having issues with my code running out of memory on large data sets. I attempted to chunk the data to feed it into the calculation graph but I eventually get an out of memory error. Would setting it up to use the feed_dict functionality get around this problem?
My code is set up like the following, with a nested map_fn function due to a result of the tf_itertools_product_2D_nest function.
tf_itertools_product_2D_nest function is from Cartesian Product in Tensorflow
I also tried a variation where I made a list of tensor-lists which was significantly slower than doing it purely in tensorflow so I'd prefer to avoid that method.
import tensorflow as tf
import numpy as np
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.9
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess = tf.Session()
sess.run(tf.global_variables_initializer())
tensorboard_log_dir = "../log/"
def tf_itertools_product_2D_nest(a,b): #does not work on nested tensors
a, b = a[ None, :, None ], b[ :, None, None ]
#print(sess.run(tf.shape(a)))
#print(sess.run(tf.shape(b)))
n_feat_dimension_in_common = tf.shape(a)[-1]
c = tf.concat( [ a + tf.zeros_like( b ), tf.zeros_like( a ) + b ], axis = 2 )
return c
def do_calc(arr_pair):
arr_1 = arr_pair[0]
arr_binary = arr_pair[1]
return tf.reduce_max(tf.cumsum(arr_1*arr_binary))
def calc_row_wrapper(row):
return tf.map_fn(do_calc,row)
for i in range(0,10):
a = tf.constant(np.random.random((7,10))*10,tf.float64)
b = tf.constant(np.random.randint(2, size=(3,10)),tf.float64)
a_b_itertools_product = tf_itertools_product_2D_nest(a,b)
'''Creates array like this:
[ [[arr_a0,arr_b0], [arr_a1,arr_b0],...],
[[arr_a0,arr_b1], [arr_a1,arr_b1],...],
[[arr_a0,arr_b2], [arr_a1,arr_b2],...],
...]
'''
with tf.summary.FileWriter(tensorboard_log_dir, sess.graph) as writer:
result_array = sess.run(tf.map_fn(calc_row_wrapper,a_b_itertools_product),
options=run_options,run_metadata=run_metadata)
writer.add_run_metadata(run_metadata,"iteration {}".format(i))
print(result_array.shape)
print(result_array)
print("")
# result_array should be an array with 3 rows (1 for each binary vector in b) and 7 columns (1 for each row in a)
I can imagine that is unnecessarily consuming memory due to the extra dimension added. Is there a way to mimic the outcome of the standard itertools.product() function to output 1 long list of every possible combination of items in the 2 input iterables? Like the result of:
itertools.product([[1,2],[3,4]],[[5,6],[7,8]])
# [([1, 2], [5, 6]), ([1, 2], [7, 8]), ([3, 4], [5, 6]), ([3, 4], [7, 8])]
That would eliminate the need to call map_fn twice.
When map_fn is called within a loop as my code shows, will it keep spawning graphs for every iteration? There appears to be a big "map_" node for every iteration cycle in this code's Tensorboardgraph.
Tensorboard Default View (not enough reputation yet)
When I select a particular iteration based on the tag in Tensorboard, only the map node corresponding to the iteration is highlighted with all the others grayed out. Does that mean that for that cycle only the map node for that cycle is present (and the others no longer, if from a previous cycle , exist in memory)?
Tensorboard 1 iteration view

AttributeError: Filter attribute has no attribute append python 3.x [duplicate]

filter, map, and reduce work perfectly in Python 2. Here is an example:
>>> def f(x):
return x % 2 != 0 and x % 3 != 0
>>> filter(f, range(2, 25))
[5, 7, 11, 13, 17, 19, 23]
>>> def cube(x):
return x*x*x
>>> map(cube, range(1, 11))
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]
>>> def add(x,y):
return x+y
>>> reduce(add, range(1, 11))
55
But in Python 3, I receive the following outputs:
>>> filter(f, range(2, 25))
<filter object at 0x0000000002C14908>
>>> map(cube, range(1, 11))
<map object at 0x0000000002C82B70>
>>> reduce(add, range(1, 11))
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
reduce(add, range(1, 11))
NameError: name 'reduce' is not defined
I would appreciate if someone could explain to me why this is.
Screenshot of code for further clarity:
You can read about the changes in What's New In Python 3.0. You should read it thoroughly when you move from 2.x to 3.x since a lot has been changed.
The whole answer here are quotes from the documentation.
Views And Iterators Instead Of Lists
Some well-known APIs no longer return lists:
[...]
map() and filter() return iterators. If you really need a list, a quick fix is e.g. list(map(...)), but a better fix is often to use a list comprehension (especially when the original code uses lambda), or rewriting the code so it doesn’t need a list at all. Particularly tricky is map() invoked for the side effects of the function; the correct transformation is to use a regular for loop (since creating a list would just be wasteful).
[...]
Builtins
[...]
Removed reduce(). Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable.
[...]
The functionality of map and filter was intentionally changed to return iterators, and reduce was removed from being a built-in and placed in functools.reduce.
So, for filter and map, you can wrap them with list() to see the results like you did before.
>>> def f(x): return x % 2 != 0 and x % 3 != 0
...
>>> list(filter(f, range(2, 25)))
[5, 7, 11, 13, 17, 19, 23]
>>> def cube(x): return x*x*x
...
>>> list(map(cube, range(1, 11)))
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]
>>> import functools
>>> def add(x,y): return x+y
...
>>> functools.reduce(add, range(1, 11))
55
>>>
The recommendation now is that you replace your usage of map and filter with generators expressions or list comprehensions. Example:
>>> def f(x): return x % 2 != 0 and x % 3 != 0
...
>>> [i for i in range(2, 25) if f(i)]
[5, 7, 11, 13, 17, 19, 23]
>>> def cube(x): return x*x*x
...
>>> [cube(i) for i in range(1, 11)]
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]
>>>
They say that for loops are 99 percent of the time easier to read than reduce, but I'd just stick with functools.reduce.
Edit: The 99 percent figure is pulled directly from the What’s New In Python 3.0 page authored by Guido van Rossum.
As an addendum to the other answers, this sounds like a fine use-case for a context manager that will re-map the names of these functions to ones which return a list and introduce reduce in the global namespace.
A quick implementation might look like this:
from contextlib import contextmanager
#contextmanager
def noiters(*funcs):
if not funcs:
funcs = [map, filter, zip] # etc
from functools import reduce
globals()[reduce.__name__] = reduce
for func in funcs:
globals()[func.__name__] = lambda *ar, func = func, **kwar: list(func(*ar, **kwar))
try:
yield
finally:
del globals()[reduce.__name__]
for func in funcs: globals()[func.__name__] = func
With a usage that looks like this:
with noiters(map):
from operator import add
print(reduce(add, range(1, 20)))
print(map(int, ['1', '2']))
Which prints:
190
[1, 2]
Just my 2 cents :-)
Since the reduce method has been removed from the built in function from Python3, don't forget to import the functools in your code. Please look at the code snippet below.
import functools
my_list = [10,15,20,25,35]
sum_numbers = functools.reduce(lambda x ,y : x+y , my_list)
print(sum_numbers)
One of the advantages of map, filter and reduce is how legible they become when you "chain" them together to do something complex. However, the built-in syntax isn't legible and is all "backwards". So, I suggest using the PyFunctional package (https://pypi.org/project/PyFunctional/).
Here's a comparison of the two:
flight_destinations_dict = {'NY': {'London', 'Rome'}, 'Berlin': {'NY'}}
PyFunctional version
Very legible syntax. You can say:
"I have a sequence of flight destinations. Out of which I want to get
the dict key if city is in the dict values. Finally, filter out the
empty lists I created in the process."
from functional import seq # PyFunctional package to allow easier syntax
def find_return_flights_PYFUNCTIONAL_SYNTAX(city, flight_destinations_dict):
return seq(flight_destinations_dict.items()) \
.map(lambda x: x[0] if city in x[1] else []) \
.filter(lambda x: x != []) \
Default Python version
It's all backwards. You need to say:
"OK, so, there's a list. I want to filter empty lists out of it. Why?
Because I first got the dict key if the city was in the dict values.
Oh, the list I'm doing this to is flight_destinations_dict."
def find_return_flights_DEFAULT_SYNTAX(city, flight_destinations_dict):
return list(
filter(lambda x: x != [],
map(lambda x: x[0] if city in x[1] else [], flight_destinations_dict.items())
)
)
Here are the examples of Filter, map and reduce functions.
numbers = [10,11,12,22,34,43,54,34,67,87,88,98,99,87,44,66]
//Filter
oddNumbers = list(filter(lambda x: x%2 != 0, numbers))
print(oddNumbers)
//Map
multiplyOf2 = list(map(lambda x: x*2, numbers))
print(multiplyOf2)
//Reduce
The reduce function, since it is not commonly used, was removed from the built-in functions in Python 3. It is still available in the functools module, so you can do:
from functools import reduce
sumOfNumbers = reduce(lambda x,y: x+y, numbers)
print(sumOfNumbers)
from functools import reduce
def f(x):
return x % 2 != 0 and x % 3 != 0
print(*filter(f, range(2, 25)))
#[5, 7, 11, 13, 17, 19, 23]
def cube(x):
return x**3
print(*map(cube, range(1, 11)))
#[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]
def add(x,y):
return x+y
reduce(add, range(1, 11))
#55
It works as is. To get the output of map use * or list

Resources