Pyspark: repartition vs partitionBy - apache-spark

I'm working through these two concepts right now and would like some clarity. From working through the command line, I've been trying to identify the differences and when a developer would use repartition vs partitionBy.
Here is some sample code:
rdd = sc.parallelize([('a', 1), ('a', 2), ('b', 1), ('b', 3), ('c',1), ('ef',5)])
rdd1 = rdd.repartition(4)
rdd2 = rdd.partitionBy(4)
rdd1.glom().collect()
[[('b', 1), ('ef', 5)], [], [], [('a', 1), ('a', 2), ('b', 3), ('c', 1)]]
rdd2.glom().collect()
[[('a', 1), ('a', 2)], [], [('c', 1)], [('b', 1), ('b', 3), ('ef', 5)]]
I took a look at the implementation of both, and the only difference I've noticed for the most part is that partitionBy can take a partitioning function, or using the portable_hash by default. So in partitionBy, all the same keys should be in the same partition. In repartition, I would expect the values to be distributed more evenly over the partitions, but this isnt the case.
Given this, why would anyone ever use repartition? I suppose the only time I could see it being used is if I'm not working with PairRDD, or I have large data skew?
Is there something that I'm missing, or could someone shed light from a different angle for me?

repartition() is used for specifying the number of partitions considering the number of cores and the amount of data you have.
partitionBy() is used for making shuffling functions more efficient, such as reduceByKey(), join(), cogroup() etc.. It is only beneficial in cases where a RDD is used for multiple times, so it is usually followed by persist().
Differences between the two in action:
pairs = sc.parallelize([1, 2, 3, 4, 2, 4, 1, 5, 6, 7, 7, 5, 5, 6, 4]).map(lambda x: (x, x))
pairs.partitionBy(3).glom().collect()
[[(3, 3), (6, 6), (6, 6)],
[(1, 1), (4, 4), (4, 4), (1, 1), (7, 7), (7, 7), (4, 4)],
[(2, 2), (2, 2), (5, 5), (5, 5), (5, 5)]]
pairs.repartition(3).glom().collect()
[[(4, 4), (2, 2), (6, 6), (7, 7), (5, 5), (5, 5)],
[(1, 1), (4, 4), (6, 6), (4, 4)],
[(2, 2), (3, 3), (1, 1), (5, 5), (7, 7)]]

repartition already exists in RDDs, and does not handle partitioning by key (or by any other criterion except Ordering). Now PairRDDs add the notion of keys and subsequently add another method that allows to partition by that key.
So yes, if your data is keyed, you should absolutely partition by that key, which in many cases is the point of using a PairRDD in the first place (for joins, reduceByKey, and so on).

Related

MILP: Formulating a sorted list for a constraint

For a MILP project planning problem I would like to formulate the constraint that activity i must be finished before activity j starts. The activities are to be ordered by duration p and the one of three modes m per activity is to be used which takes the shortest time.
So I created a dictionary with the minimum durations of activity i in mode m.
p_im_min = {i: np.min([p[i,m] for m in M_i[i]]) for i in V}
Then I sorted the durations by size:
p_sort = (sorted(p_im_min.items(), key = lambda kv: kv[1]))
Which gives (i,p) in the right order:
p_sort = [(0, 0),
(3, 1),
(4, 1),
(5, 1),
(7, 1),
(13, 1),
(14, 1),
(15, 1),
(19, 1),
(1, 2),
(2, 2),
(8, 2),
(16, 2),
(17, 2),
(18, 2),
(20, 2),
(6, 3),
(10, 3),
(9, 4),
(12, 4),
(11, 5)]
But now I want a list with (i,j), where i must always be finished before j starts. Since I could not find the function, I created this list manually, thus
order_act = [(0,3),
(3,4),
(4,5),
(5,7), etc.
And finally (after formulating the parameters, variables and sets) added the following constraint:
mdl.addConstrs(y[i,j] == 1
for (i,j) in order_act)
My question:
Is there any way to use a formula/command in Python to create the list (i,j)? Because as it is now, it is not ideal and the solution is not satisfactory.

Python Spark - How to remove the duplicate element in set without the different ordering?

By using the .fliter(func), i got the output below.
My output:
[((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3))]
The output i need is only 3 coordinates.
My desired output:
((2, 1), (4, 2), (6, 3))
Any idea how to remove the duplicate set? i tested 'distinct.()' but it is not working due to the ordering of the element in the set is not the same.
Thanks.
Assign your output as a list:
x= [((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3))]
y = list(set(x))
print(y[0])
Than output is :
((2, 1), (4, 2), (6, 3))
You can sort before then use distinct function
>>> rdd = sc.parallelize([((2, 1), (4, 2), (6, 3)), ((2, 1), (6, 3), (4, 2)), ((2, 1), (4, 2), (6, 3))])
>>> for i in rdd.collect(): print(i)
...
((2, 1), (4, 2), (6, 3))
((2, 1), (6, 3), (4, 2))
((2, 1), (4, 2), (6, 3))
>>> rdd.map(lambda x: tuple(sorted(x))).distinct().collect()
[((2, 1), (4, 2), (6, 3))]
distinct seems to work. What I'm I missing? What about the ordering "is not the same"?
df = spark.createDataFrame([((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3))], ['tuple1', 'tuple2', 'tuple3'])
df.distinct().show()
+------+------+------+
|tuple1|tuple2|tuple3|
+------+------+------+
|[2, 1]|[4, 2]|[6, 3]|
+------+------+------+
If you mean that the order of the elements of tuples of tuples can be different then you can sort them as in the other answer. I don't know a convenient way to create an array literal in PySpark so we'll convert the above DataFrame into a single column of array.
from pyspark.sql import functions as F
mergedDf = df.select(F.array(df.tuple1, df.tuple2, df.tuple3).alias("merged"))
mergedDf.show()
+------------------------+
|merged |
+------------------------+
|[[2, 1], [4, 2], [6, 3]]|
|[[2, 1], [6, 3], [4, 2]]|
|[[4, 2], [2, 1], [6, 3]]|
+------------------------+
Now we can sort and distinct the array like
mergedDf.select(F.sort_array(mergedDf.merged).alias("sorted")).distinct().show(truncate=False)
+------------------------+
|sorted |
+------------------------+
|[[2, 1], [4, 2], [6, 3]]|
+------------------------+

Python: Split list into list of lists

Suppose that I have list:
list = [(4, 7), (3, 7), (5, 7), (4, 6), (4, 8), (2, 7), (3, 6), (3, 8), (6, 7)]
That I want to divide the list into sublists of lengths: [2, 3, 4] (these lengths can vary)
To produce: sublist_list = [[(4, 7), (3, 7)],[(5, 7), (4, 6), (4, 8)], [(2, 7), (3, 6), (3, 8), (6, 7)]]
What's the quickest way that I can do this? Thanks in advance.
myList = [(4, 7), (3, 7), (5, 7), (4, 6), (4, 8), (2, 7), (3, 6), (3, 8), (6, 7)]
listOfLengths = [2, 3, 4]
def getSublists(listOfLengths,myList):
listOfSublists = []
for i in range(0,len(listOfLengths)):
if i == 0:
listOfSublists.append(myList[:listOfLengths[i]])
else:
listOfSublists.append(myList[listOfLengths[i-1]:listOfLengths[i-1]+listOfLengths[i]])
return listOfSublists
Then if you call getSublists on your myList (original list input) and listOfLengths (a list containing the length of your sublists), you get
#In: getSublists(listOfLengths,myList)
#Out: [[(4, 7), (3, 7)], [(5, 7), (4, 6), (4, 8)], [(4, 6), (4, 8), (2, 7), (3, 6)]]
You can user list[i:j] feature in python which returns a new list contains
list[i] to list[j-1] elements of original list.
base = 0
Lengths =[] #list of lengths
for num in Length:
sub_list.append(List[base:num+base])
base += num #jump to next length
What about simply iterating the list and appending to the new lists?
c = 0
for sublist in list:
sublistlist[len(sublistlist)-1].append(sublist)
c += 1
if c % 2:
sublistlist.append([])

Repartitioning partitioned data

I'm working on a skewed data problem, such that my smallest partitions are below 64MB and my largest partitions can be greater than 1GB. I've been contemplating a strategy to map a few small partitions to the same partition key, thus creating a partition comprised of partitions. This is all in the hope of reducing variance in task size as well as number of files stored on disk.
At one point in my Spark application, I need to operate on the (non-grouped) original partitions and to do so, will need to repartition by the original key. This brings me to my question:
Suppose I have two data sets as seen below. Each row is a tuple of the form (partition_key, (original_key, data)). In data0, you can see that original_key = 0 is on its own node, whereas original_key = 4 and original_key = 5 are together on the node containing partition_key = 3. In data1, things are not as organized.
If data0 is partitioned by partition_key, and then partitioned by original_key, will a shuffle occur? In other words, does it matter during the second partitionBy call that data0 is more organized than data1?
data0 = [
(0, (0, 'a')),
(0, (0, 'b')),
(0, (0, 'c')),
(1, (1, 'd')),
(1, (1, 'e')),
(1, (2, 'f')),
(1, (2, 'g')),
(2, (3, 'h')),
(2, (3, 'i')),
(2, (3, 'j')),
(3, (4, 'k')),
(3, (4, 'l')),
(3, (5, 'm')),
(3, (5, 'n')),
(3, (5, 'o')),
]
data1 = [
(0, (0, 'a')),
(1, (0, 'b')),
(0, (0, 'c')),
(1, (1, 'd')),
(2, (1, 'e')),
(1, (2, 'f')),
(3, (2, 'g')),
(2, (3, 'h')),
(0, (3, 'i')),
(3, (3, 'j')),
(3, (4, 'k')),
(3, (4, 'l')),
(1, (5, 'm')),
(2, (5, 'n')),
(3, (5, 'o')),
]
rdd0 = sc.parallelize(data0, 3).cache()
partitioned0 = rdd0.partitionBy(4)
partitioned0.map(lambda row: (row[1][0], row[1])).partitionBy(6).collect()
rdd1 = sc.parallelize(data1, 3).cache()
partitioned1 = rdd1.partitionBy(4)
partitioned1.map(lambda row: (row[1][0], row[1])).partitionBy(6).collect()
When you call re-partition shuffle kicks in.
How much of data get's shuffled is based on the original RDD.
As a side note: when you do sc.parallelize(data0,3) the 3 is mere guideline. If the default partition is <=3 then your rdd0 will have 3 partitions. If your data0 is on more HDFS blocks providing the partition number has no effect.

Spark RDD: Vary the size of each partition

I need executors to finish processing data at different times.
I think the easiest way is to make RDD partitions have not uniform sizes. How can I do this?
Not sure what you are trying to achieve, but you can partition the RDD anyway you like using partitionBy eg:
sc.parallelize(xrange(10)).zipWithIndex()
.partitionBy(2, lambda x: 0 if x<2 else 1)
.glom().collect()
[[(0, 0), (1, 1)], [(2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9)]]
Note that it works on a (k,v) RDD and the partitioning function takes only k as a param

Resources