How to divide the content of an RDD - apache-spark

I have an rdd that I will like to divide the content and return a list of tuple.
rdd_to_divide = [('Nate', (1.2, 1.2)), ('Mike', (5, 10)), ('Ben', (3, 7)), ('Chad', (12, 20))]
result_rdd = [('Nate', 1.2/1.2), ('Mike', 5/10), ('Ben', 3/7), ('Chad', 12/20)]
Thanks in advance

Use a lambda function to map the dataframe as below:
>>> rdd_to_divide = sc.parallelize([('Nate', (1.2, 1.2)), ('Mike', (5, 10)), ('Ben', (3, 7)), ('Chad', (12, 20))])
>>> result_rdd = rdd_to_divide.map(lambda x: (x[0], x[1][0]/x[1][1]))
>>> result_rdd.take(5)
[('Nate', 1.0), ('Mike', 0.5), ('Ben', 0.42857142857142855), ('Chad', 0.6)]

Related

Reduce key, value pair based on similarity of their value in PySpark

I am a beginner in PySpark.
I want to find the pairs of letters with the same numbers in values and then to find out which pair of letters appear more often.
Here is my data
data = sc.parallelize([('a', 1), ('b', 4), ('c', 10), ('d', 4), ('e', 4), ('f', 1), ('b', 5), ('d', 5)])
data.collect()
[('a', 1), ('b', 4), ('c', 10), ('d', 4), ('e', 4), ('f', 1), ('b', 5), ('d', 5)]
The result I want would look like this:
1: a,f
4: b, d
4: b, e
4: d, e
10: c
5: b, d
I have tried the following:
data1= data.map(lambda y: (y[1], y[0]))
data1.collect()
[(1, 'a'), (4, 'b'), (10, 'c'), (4, 'd'), (4, 'e'), (1, 'f'), ('b', 5), ('d', 5)]
data1.groupByKey().mapValues(list).collect()
[(10, ['c']), (4, ['b', 'd', 'e']), (1, ['a', 'f']), (5, ['b', 'd'])]
As I said I am very new to PySpark and tried to search the command for that but was not successful. Could anyone please help me with this?
You can use flatMap with python itertools.combinations to get combinations of 2 from the grouped values. Also, prefer using reduceByKey rather than groupByKey:
from itertools import combinations
result = data.map(lambda x: (x[1], [x[0]])) \
.reduceByKey(lambda a, b: a + b) \
.flatMap(lambda x: [(x[0], p) for p in combinations(x[1], 2 if (len(x[1]) > 1) else 1)])
result.collect()
#[(1, ('a', 'f')), (10, ('c',)), (4, ('b', 'd')), (4, ('b', 'e')), (4, ('d', 'e')), (5, ('b', 'd'))]
If you want to get None when tuple has only one element, you can use this:
.flatMap(lambda x: [(x[0], p) for p in combinations(x[1] if len(x[1]) > 1 else x[1] + [None], 2)])

How to avoid inplace removal of modification of a Networkx graph

I've a netwrokx graph, I'm trying to remove the edges of the graph using remove_edges.
I want to remove each edge in the original graph and post-process H to get further stats like edges connected to the edge that has been removed.
import networkx as nx
import matplotlib.pyplot as plt
# fig 1
n=10
G = nx.gnm_random_graph(n=10, m=10, seed=1)
nx.draw(G, with_labels=True)
plt.show()
for e in [[5, 0], [3, 6]]:
H = G.remove_edge(e[0], e[1])
nx.draw(G, with_labels=True)
plt.show()
In the above, the edge is removed inplace in G. So for the second iteration, the original graph is no
longer present. How can this be avoided? I want to retain the original graph for every iteration and instead store the graph that results after edge removal in another copy, H.
Any suggestions will be highly appreciated.
EDIT: Based on what's suggested below
n=10
G = nx.gnm_random_graph(n=10, m=10, seed=1)
nx.draw(G, with_labels=True)
plt.show()
G_copy = G.copy()
for e in [[5, 0], [3, 6]]:
print(G_copy.edges())
H = G_copy.remove_edge(e[0], e[1])
nx.draw(G_copy, with_labels=True)
plt.show()
print(G_copy.edges())
Obtained output:
[(0, 6), (0, 7), (0, 5), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
[(0, 6), (0, 7), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
Expected:
[(0, 6), (0, 7), (0, 5), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
[(0, 6), (0, 7), (0, 5), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
Make a copy of the original graph and modify the copy:
H = G.copy()
...
H.remove_edge(e[0], e[1])

Matplotlib not showing point in PyCharm

Using Python 3 in PyCharm on Windows 10
I have a list of tuples that I need to plot. But the matplotlib is showing an empty graph:
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.use('TkAgg')
input = [(1, 6), (4, 15), (7, 7), (10, 13), (11, 6),
(11, 18), (11, 21), (12, 10), (15, 18),
(16, 6), (18, 3), (18, 12), (19, 15), (22, 19)]
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
input_points = []
for array_x, array_y in input:
input_points.append(Point(array_x, array_y))
plt.plot(array_x, array_y)
plt.show()
How to get the points to show up and plotted in the graph?
To draw lines, plt.plot needs a list (or numpy array) of x-positions and a list of y-positions. The documentation lists the different options to draw markers and/or lines.
List comprehension is a handy way to extract x or y positions from a list of xy-coordinates.
import matplotlib.pyplot as plt
input_points = [(1, 6), (4, 15), (7, 7), (10, 13), (11, 6),
(11, 18), (11, 21), (12, 10), (15, 18),
(16, 6), (18, 3), (18, 12), (19, 15), (22, 19)]
array_x = [x for x, y in input_points]
array_y = [y for x, y in input_points]
plt.plot(array_x, array_y, marker='o', color='crimson', linestyle='-')
plt.show()

Convert tuple of tuples of floats to ints

Convert
((2.0,3.1),(7.0,4.2),(8.9,1.0),(-8.9,7))
to
((2,3),(7,4),(8,1),(-8,7))
It works to convert the tuple to a numpy array, and then apply .astype(int), but is there a more direct way? Also my 'solution' seems too special.
It works to use numpy
import numpy
data = ((2.0,3.1),(7.0,4.2),(8.9,1.0),(-8.9,7))
data1 = numpy.array(data)
data2 = data1.astype(int)
data3 = tuple(tuple(row) for row in data2)
data3 # ((2, 3), (7, 4), (8, 1), (-8, 7))
((2, 3), (7, 4), (8, 1), (-8, 7))
as expected and desired
In [16]: t = ((2.0,3.1),(7.0,4.2),(8.9,1.0),(-8.9,7))
In [17]: tuple(tuple(map(int, tup)) for tup in t)
Out[17]: ((2, 3), (7, 4), (8, 1), (-8, 7))
Using a simple list comprehension:
result = [(int(element[0]), int(element[1])) for element in t]
If you need it back as a tuple, just convert the list into a tuple:
result = tuple([(int(element[0]), int(element[1])) for element in t])

Python Spark - How to remove the duplicate element in set without the different ordering?

By using the .fliter(func), i got the output below.
My output:
[((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3))]
The output i need is only 3 coordinates.
My desired output:
((2, 1), (4, 2), (6, 3))
Any idea how to remove the duplicate set? i tested 'distinct.()' but it is not working due to the ordering of the element in the set is not the same.
Thanks.
Assign your output as a list:
x= [((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3))]
y = list(set(x))
print(y[0])
Than output is :
((2, 1), (4, 2), (6, 3))
You can sort before then use distinct function
>>> rdd = sc.parallelize([((2, 1), (4, 2), (6, 3)), ((2, 1), (6, 3), (4, 2)), ((2, 1), (4, 2), (6, 3))])
>>> for i in rdd.collect(): print(i)
...
((2, 1), (4, 2), (6, 3))
((2, 1), (6, 3), (4, 2))
((2, 1), (4, 2), (6, 3))
>>> rdd.map(lambda x: tuple(sorted(x))).distinct().collect()
[((2, 1), (4, 2), (6, 3))]
distinct seems to work. What I'm I missing? What about the ordering "is not the same"?
df = spark.createDataFrame([((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3))], ['tuple1', 'tuple2', 'tuple3'])
df.distinct().show()
+------+------+------+
|tuple1|tuple2|tuple3|
+------+------+------+
|[2, 1]|[4, 2]|[6, 3]|
+------+------+------+
If you mean that the order of the elements of tuples of tuples can be different then you can sort them as in the other answer. I don't know a convenient way to create an array literal in PySpark so we'll convert the above DataFrame into a single column of array.
from pyspark.sql import functions as F
mergedDf = df.select(F.array(df.tuple1, df.tuple2, df.tuple3).alias("merged"))
mergedDf.show()
+------------------------+
|merged |
+------------------------+
|[[2, 1], [4, 2], [6, 3]]|
|[[2, 1], [6, 3], [4, 2]]|
|[[4, 2], [2, 1], [6, 3]]|
+------------------------+
Now we can sort and distinct the array like
mergedDf.select(F.sort_array(mergedDf.merged).alias("sorted")).distinct().show(truncate=False)
+------------------------+
|sorted |
+------------------------+
|[[2, 1], [4, 2], [6, 3]]|
+------------------------+

Resources