I'm working on a skewed data problem, such that my smallest partitions are below 64MB and my largest partitions can be greater than 1GB. I've been contemplating a strategy to map a few small partitions to the same partition key, thus creating a partition comprised of partitions. This is all in the hope of reducing variance in task size as well as number of files stored on disk.
At one point in my Spark application, I need to operate on the (non-grouped) original partitions and to do so, will need to repartition by the original key. This brings me to my question:
Suppose I have two data sets as seen below. Each row is a tuple of the form (partition_key, (original_key, data)). In data0, you can see that original_key = 0 is on its own node, whereas original_key = 4 and original_key = 5 are together on the node containing partition_key = 3. In data1, things are not as organized.
If data0 is partitioned by partition_key, and then partitioned by original_key, will a shuffle occur? In other words, does it matter during the second partitionBy call that data0 is more organized than data1?
data0 = [
(0, (0, 'a')),
(0, (0, 'b')),
(0, (0, 'c')),
(1, (1, 'd')),
(1, (1, 'e')),
(1, (2, 'f')),
(1, (2, 'g')),
(2, (3, 'h')),
(2, (3, 'i')),
(2, (3, 'j')),
(3, (4, 'k')),
(3, (4, 'l')),
(3, (5, 'm')),
(3, (5, 'n')),
(3, (5, 'o')),
]
data1 = [
(0, (0, 'a')),
(1, (0, 'b')),
(0, (0, 'c')),
(1, (1, 'd')),
(2, (1, 'e')),
(1, (2, 'f')),
(3, (2, 'g')),
(2, (3, 'h')),
(0, (3, 'i')),
(3, (3, 'j')),
(3, (4, 'k')),
(3, (4, 'l')),
(1, (5, 'm')),
(2, (5, 'n')),
(3, (5, 'o')),
]
rdd0 = sc.parallelize(data0, 3).cache()
partitioned0 = rdd0.partitionBy(4)
partitioned0.map(lambda row: (row[1][0], row[1])).partitionBy(6).collect()
rdd1 = sc.parallelize(data1, 3).cache()
partitioned1 = rdd1.partitionBy(4)
partitioned1.map(lambda row: (row[1][0], row[1])).partitionBy(6).collect()
When you call re-partition shuffle kicks in.
How much of data get's shuffled is based on the original RDD.
As a side note: when you do sc.parallelize(data0,3) the 3 is mere guideline. If the default partition is <=3 then your rdd0 will have 3 partitions. If your data0 is on more HDFS blocks providing the partition number has no effect.
Related
For a MILP project planning problem I would like to formulate the constraint that activity i must be finished before activity j starts. The activities are to be ordered by duration p and the one of three modes m per activity is to be used which takes the shortest time.
So I created a dictionary with the minimum durations of activity i in mode m.
p_im_min = {i: np.min([p[i,m] for m in M_i[i]]) for i in V}
Then I sorted the durations by size:
p_sort = (sorted(p_im_min.items(), key = lambda kv: kv[1]))
Which gives (i,p) in the right order:
p_sort = [(0, 0),
(3, 1),
(4, 1),
(5, 1),
(7, 1),
(13, 1),
(14, 1),
(15, 1),
(19, 1),
(1, 2),
(2, 2),
(8, 2),
(16, 2),
(17, 2),
(18, 2),
(20, 2),
(6, 3),
(10, 3),
(9, 4),
(12, 4),
(11, 5)]
But now I want a list with (i,j), where i must always be finished before j starts. Since I could not find the function, I created this list manually, thus
order_act = [(0,3),
(3,4),
(4,5),
(5,7), etc.
And finally (after formulating the parameters, variables and sets) added the following constraint:
mdl.addConstrs(y[i,j] == 1
for (i,j) in order_act)
My question:
Is there any way to use a formula/command in Python to create the list (i,j)? Because as it is now, it is not ideal and the solution is not satisfactory.
I've a netwrokx graph, I'm trying to remove the edges of the graph using remove_edges.
I want to remove each edge in the original graph and post-process H to get further stats like edges connected to the edge that has been removed.
import networkx as nx
import matplotlib.pyplot as plt
# fig 1
n=10
G = nx.gnm_random_graph(n=10, m=10, seed=1)
nx.draw(G, with_labels=True)
plt.show()
for e in [[5, 0], [3, 6]]:
H = G.remove_edge(e[0], e[1])
nx.draw(G, with_labels=True)
plt.show()
In the above, the edge is removed inplace in G. So for the second iteration, the original graph is no
longer present. How can this be avoided? I want to retain the original graph for every iteration and instead store the graph that results after edge removal in another copy, H.
Any suggestions will be highly appreciated.
EDIT: Based on what's suggested below
n=10
G = nx.gnm_random_graph(n=10, m=10, seed=1)
nx.draw(G, with_labels=True)
plt.show()
G_copy = G.copy()
for e in [[5, 0], [3, 6]]:
print(G_copy.edges())
H = G_copy.remove_edge(e[0], e[1])
nx.draw(G_copy, with_labels=True)
plt.show()
print(G_copy.edges())
Obtained output:
[(0, 6), (0, 7), (0, 5), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
[(0, 6), (0, 7), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
Expected:
[(0, 6), (0, 7), (0, 5), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
[(0, 6), (0, 7), (0, 5), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
Make a copy of the original graph and modify the copy:
H = G.copy()
...
H.remove_edge(e[0], e[1])
This is the code that i have written which basically describes the flight connectivity having one city in common between the source and the destination. It seems right for most of the test cases but isn't satisfying this particular one.
def onehop(lis):
hop=[]
for (i,j) in lis:
for (k,l) in lis:
if i==k and j!=l:
return sorted(lis)
if (i!=k and j!=l)and(i==l or j==k) and (((i,j) not in hop) and ((k,l) not in hop)):
m=lis.pop(lis.index((i,j)))
n=lis.pop(lis.index((k,l)))
hop.extend([m,n])
for i in range(len(hop)):
if hop[i][0]>hop[i][1]:
hop[i]=(hop[i][1],hop[i][0])
ans=sorted(hop,key=lambda item: (item[0],item[1]))
return ans
onehop([(2,3),(1,2),(3,1),(1,3),(3,2),(2,4),(4,1)])
Output I expected:
[(1, 2), (1, 3), (1, 4), (2, 1), (3, 2), (3, 4), (4, 2), (4, 3)]
Output I obtained:
[(1, 2), (1, 3), (2, 3), (2, 4), (3, 1), (3, 2), (4, 1)]
def onehop(lis):
hop=[]
for (i,j) in lis:
for (k,l) in lis:
if j==k and i!=l :
hop.append([i,l])
unique = [list(x) for x in set(tuple(x) for x in hop)]
ans=sorted(unique,key=lambda item: (item[0],item[1]))
ans1 = [tuple(l) for l in ans]
return(ans1)
I need executors to finish processing data at different times.
I think the easiest way is to make RDD partitions have not uniform sizes. How can I do this?
Not sure what you are trying to achieve, but you can partition the RDD anyway you like using partitionBy eg:
sc.parallelize(xrange(10)).zipWithIndex()
.partitionBy(2, lambda x: 0 if x<2 else 1)
.glom().collect()
[[(0, 0), (1, 1)], [(2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9)]]
Note that it works on a (k,v) RDD and the partitioning function takes only k as a param
I'm working through these two concepts right now and would like some clarity. From working through the command line, I've been trying to identify the differences and when a developer would use repartition vs partitionBy.
Here is some sample code:
rdd = sc.parallelize([('a', 1), ('a', 2), ('b', 1), ('b', 3), ('c',1), ('ef',5)])
rdd1 = rdd.repartition(4)
rdd2 = rdd.partitionBy(4)
rdd1.glom().collect()
[[('b', 1), ('ef', 5)], [], [], [('a', 1), ('a', 2), ('b', 3), ('c', 1)]]
rdd2.glom().collect()
[[('a', 1), ('a', 2)], [], [('c', 1)], [('b', 1), ('b', 3), ('ef', 5)]]
I took a look at the implementation of both, and the only difference I've noticed for the most part is that partitionBy can take a partitioning function, or using the portable_hash by default. So in partitionBy, all the same keys should be in the same partition. In repartition, I would expect the values to be distributed more evenly over the partitions, but this isnt the case.
Given this, why would anyone ever use repartition? I suppose the only time I could see it being used is if I'm not working with PairRDD, or I have large data skew?
Is there something that I'm missing, or could someone shed light from a different angle for me?
repartition() is used for specifying the number of partitions considering the number of cores and the amount of data you have.
partitionBy() is used for making shuffling functions more efficient, such as reduceByKey(), join(), cogroup() etc.. It is only beneficial in cases where a RDD is used for multiple times, so it is usually followed by persist().
Differences between the two in action:
pairs = sc.parallelize([1, 2, 3, 4, 2, 4, 1, 5, 6, 7, 7, 5, 5, 6, 4]).map(lambda x: (x, x))
pairs.partitionBy(3).glom().collect()
[[(3, 3), (6, 6), (6, 6)],
[(1, 1), (4, 4), (4, 4), (1, 1), (7, 7), (7, 7), (4, 4)],
[(2, 2), (2, 2), (5, 5), (5, 5), (5, 5)]]
pairs.repartition(3).glom().collect()
[[(4, 4), (2, 2), (6, 6), (7, 7), (5, 5), (5, 5)],
[(1, 1), (4, 4), (6, 6), (4, 4)],
[(2, 2), (3, 3), (1, 1), (5, 5), (7, 7)]]
repartition already exists in RDDs, and does not handle partitioning by key (or by any other criterion except Ordering). Now PairRDDs add the notion of keys and subsequently add another method that allows to partition by that key.
So yes, if your data is keyed, you should absolutely partition by that key, which in many cases is the point of using a PairRDD in the first place (for joins, reduceByKey, and so on).