Spark RDD: Vary the size of each partition - apache-spark

I need executors to finish processing data at different times.
I think the easiest way is to make RDD partitions have not uniform sizes. How can I do this?

Not sure what you are trying to achieve, but you can partition the RDD anyway you like using partitionBy eg:
sc.parallelize(xrange(10)).zipWithIndex()
.partitionBy(2, lambda x: 0 if x<2 else 1)
.glom().collect()
[[(0, 0), (1, 1)], [(2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9)]]
Note that it works on a (k,v) RDD and the partitioning function takes only k as a param

Related

MILP: Formulating a sorted list for a constraint

For a MILP project planning problem I would like to formulate the constraint that activity i must be finished before activity j starts. The activities are to be ordered by duration p and the one of three modes m per activity is to be used which takes the shortest time.
So I created a dictionary with the minimum durations of activity i in mode m.
p_im_min = {i: np.min([p[i,m] for m in M_i[i]]) for i in V}
Then I sorted the durations by size:
p_sort = (sorted(p_im_min.items(), key = lambda kv: kv[1]))
Which gives (i,p) in the right order:
p_sort = [(0, 0),
(3, 1),
(4, 1),
(5, 1),
(7, 1),
(13, 1),
(14, 1),
(15, 1),
(19, 1),
(1, 2),
(2, 2),
(8, 2),
(16, 2),
(17, 2),
(18, 2),
(20, 2),
(6, 3),
(10, 3),
(9, 4),
(12, 4),
(11, 5)]
But now I want a list with (i,j), where i must always be finished before j starts. Since I could not find the function, I created this list manually, thus
order_act = [(0,3),
(3,4),
(4,5),
(5,7), etc.
And finally (after formulating the parameters, variables and sets) added the following constraint:
mdl.addConstrs(y[i,j] == 1
for (i,j) in order_act)
My question:
Is there any way to use a formula/command in Python to create the list (i,j)? Because as it is now, it is not ideal and the solution is not satisfactory.

algorithms and run-time analysis

A file (included with two examples) is a list of banned number intervals. A line that contains, for example, 12-18, indicates that all numbers 12 to (inclusive) 18 are prohibited. The intervals may overlap.
We want to know what the minimum number is.
Use variables to analyze run-time (not necessarily need all them):
• N: Maximum (not maximum permissible) number; So the numbers are between 0 and N
• K: number of intervals in a file
• M: width of maximum interval.
A. There is an obvious way to solve this problem: we're checking all numbers until we run into the smallest allowed.
• How fast is such an algorithm?
B. You can probably imagine another simple algorithm that uses N bytes (or bits) of memory.
(Hint: strikethrough.)
• Describe it with words. For example, you can make your own assignment (say a few intervals with numbers between 0 and 20), and show the algorithm on them. However, it also draws up a general description.
• How fast is this algorithm? When thinking, use N, K, and M (if you need it).
C. Make an algorithm that does not consume additional memory (more accurately: the memory consumption should be independent of N, K and M), but it is faster than the algorithm under point A.
• Describe it.
• How fast is it? Is it faster than the B algorithm?
D. Now we are interested in how many numbers are allowed (between 0 and N). How would you adjust the above algorithms for this question? What happens to their rates?
file = "0-19.txt"
intervals = [tuple(map(int, v.split("-"))) for v in open(file)]
#example# intervals = [(12, 18), (2, 5), (3, 8), (0, 4), (15, 19), (6, 9), (13, 17), (4, 8)]#
my current code just executes the program but better algorithms for the code i am yet to figure, still need a lot of work to understand, i would need a quick solution code/algorithm for examples A, B, and C and maybe D. Then i can study the time analysis myself. Appreciate help!
def generator_intervala(start, stop, step):
forbidden_numbers = set()
while start <= stop:
forbidden_numbers.add(start)
start += step
return (forbidden_numbers)
mnozica = set()
for interval in intervals:
a, b = interval
values = (generator_intervala(a, b, 1))
for i in values:
mnozica.add(i)
allowed_numbers = set()
N = max(mnozica)
for i in range(N):
if i not in mnozica:
allowed_numbers.add(i)
print(intervals)
print(mnozica)
print(min(allowed_numbers))
print(max(mnozica))
Output:
[(12, 18), (2, 5), (3, 8), (0, 4), (15, 19), (6, 9), (13, 17), (4, 8)]
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18, 19}
10
19
Your set approach is needlessly complex:
N = 100
ranges = [(12, 18), (2, 5), (3, 8), (0, 4), (15, 19), (6, 9), (13, 17), (4, 8)]
do_not_use = set()
for (a,b) in ranges:
do_not_use.update(range(a,b+1))
print(do_not_use)
print( min(a for a in range(N+1) if a not in do_not_use))
Is about all that is needed. Output:
set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18, 19])
10
This is independend of N it just depends on how many numbers are in the ranges.
Storing only forbidden numbers in a set takes O(1) for checking, using the min() buildin over a range to get the minimum.
You can make it faster if you sort your tuples first and then iterate them until you find the first gap making it Θ(N log N) for the sort, followed by Θ(N) for the search:
def findme():
ranges = [(12, 18), (2, 5), (3, 8), (0, 4), (15, 19), (6, 9), (13, 17), (4, 8)]
ranges.sort() # inplace sort, no additional space requirements
if ranges[0][0]>0:
return 0
for ((a_min,a_max),(b_min,b_max)) in zip(ranges,ranges[1:]):
if a_max < b_min-1:
return a_max+1
return ranges[-1][1]+1 # might give you N+1 if no solution in 0-N exists
timeit of yours vs mine:
Your code uses 2 sets, as well as multiple loops, incremental addition to your set and function calls that makes it slower:
N = 100
def findme():
ranges = [(12, 18), (2, 5), (3, 8), (0, 4), (15, 19), (6, 9), (13, 17), (4, 8)]
ranges.sort()
if ranges[0][0]>0:
return 0
for ((a_min,a_max),(b_min,b_max)) in zip(ranges,ranges[1:]):
if a_max < b_min-1:
return a_max+1
return ranges[-1][1]+1
def mine():
ranges = [(12, 18), (2, 5), (3, 8), (0, 4), (15, 19), (6, 9), (13, 17), (4, 8)]
N = 100
do_not_use = set()
for (a,b) in ranges:
do_not_use.update(range(a,b+1))
return min(a for a in range(N+1) if a not in do_not_use)
def yours():
ranges = [(12, 18), (2, 5), (3, 8), (0, 4), (15, 19), (6, 9), (13, 17), (4, 8)]
def generator_intervala(start, stop, step):
forbidden_numbers = set()
while start <= stop:
forbidden_numbers.add(start)
start += step
return (forbidden_numbers)
mnozica = set()
for interval in ranges:
a, b = interval
values = (generator_intervala(a, b, 1))
for i in values:
mnozica.add(i)
allowed_numbers = set()
N = max(mnozica)
for i in range(N):
if i not in mnozica:
allowed_numbers.add(i)
return min(allowed_numbers)
import timeit
print("yours", timeit.timeit(yours,number=100000))
print("mine", timeit.timeit(mine,number=100000))
print("findme", timeit.timeit(findme,number=100000))
Output:
yours 1.3931225209998956
mine 1.263602267999886
findme 0.1711935210005322

ReKeying an RDD

I have a key value RDD with 1 key and multiple values. How can I create a new RDD to have one of the values become the key and the key become a value?
Ex the existing RDD is (16, (1002, 'US')), (9, (1001, 'MX')), (1, (1004, 'MX')), (17, (1004, 'MX'))]. I want tomake a new RDD such that (1002, (16, 'US')), (1001, (9, 'MX')), (1004, (1, 'MX')), (1004, (17, 'MX'))
and the new RDD desired is
(1002, (16, 'US')), (1001, (9, 'MX')), (1004, (1, 'MX')), (1004, (17, 'MX'))
rdd.map(lambda x: (x[1][0],(x[0],x[1][1])))

Repartitioning partitioned data

I'm working on a skewed data problem, such that my smallest partitions are below 64MB and my largest partitions can be greater than 1GB. I've been contemplating a strategy to map a few small partitions to the same partition key, thus creating a partition comprised of partitions. This is all in the hope of reducing variance in task size as well as number of files stored on disk.
At one point in my Spark application, I need to operate on the (non-grouped) original partitions and to do so, will need to repartition by the original key. This brings me to my question:
Suppose I have two data sets as seen below. Each row is a tuple of the form (partition_key, (original_key, data)). In data0, you can see that original_key = 0 is on its own node, whereas original_key = 4 and original_key = 5 are together on the node containing partition_key = 3. In data1, things are not as organized.
If data0 is partitioned by partition_key, and then partitioned by original_key, will a shuffle occur? In other words, does it matter during the second partitionBy call that data0 is more organized than data1?
data0 = [
(0, (0, 'a')),
(0, (0, 'b')),
(0, (0, 'c')),
(1, (1, 'd')),
(1, (1, 'e')),
(1, (2, 'f')),
(1, (2, 'g')),
(2, (3, 'h')),
(2, (3, 'i')),
(2, (3, 'j')),
(3, (4, 'k')),
(3, (4, 'l')),
(3, (5, 'm')),
(3, (5, 'n')),
(3, (5, 'o')),
]
data1 = [
(0, (0, 'a')),
(1, (0, 'b')),
(0, (0, 'c')),
(1, (1, 'd')),
(2, (1, 'e')),
(1, (2, 'f')),
(3, (2, 'g')),
(2, (3, 'h')),
(0, (3, 'i')),
(3, (3, 'j')),
(3, (4, 'k')),
(3, (4, 'l')),
(1, (5, 'm')),
(2, (5, 'n')),
(3, (5, 'o')),
]
rdd0 = sc.parallelize(data0, 3).cache()
partitioned0 = rdd0.partitionBy(4)
partitioned0.map(lambda row: (row[1][0], row[1])).partitionBy(6).collect()
rdd1 = sc.parallelize(data1, 3).cache()
partitioned1 = rdd1.partitionBy(4)
partitioned1.map(lambda row: (row[1][0], row[1])).partitionBy(6).collect()
When you call re-partition shuffle kicks in.
How much of data get's shuffled is based on the original RDD.
As a side note: when you do sc.parallelize(data0,3) the 3 is mere guideline. If the default partition is <=3 then your rdd0 will have 3 partitions. If your data0 is on more HDFS blocks providing the partition number has no effect.

Pyspark: repartition vs partitionBy

I'm working through these two concepts right now and would like some clarity. From working through the command line, I've been trying to identify the differences and when a developer would use repartition vs partitionBy.
Here is some sample code:
rdd = sc.parallelize([('a', 1), ('a', 2), ('b', 1), ('b', 3), ('c',1), ('ef',5)])
rdd1 = rdd.repartition(4)
rdd2 = rdd.partitionBy(4)
rdd1.glom().collect()
[[('b', 1), ('ef', 5)], [], [], [('a', 1), ('a', 2), ('b', 3), ('c', 1)]]
rdd2.glom().collect()
[[('a', 1), ('a', 2)], [], [('c', 1)], [('b', 1), ('b', 3), ('ef', 5)]]
I took a look at the implementation of both, and the only difference I've noticed for the most part is that partitionBy can take a partitioning function, or using the portable_hash by default. So in partitionBy, all the same keys should be in the same partition. In repartition, I would expect the values to be distributed more evenly over the partitions, but this isnt the case.
Given this, why would anyone ever use repartition? I suppose the only time I could see it being used is if I'm not working with PairRDD, or I have large data skew?
Is there something that I'm missing, or could someone shed light from a different angle for me?
repartition() is used for specifying the number of partitions considering the number of cores and the amount of data you have.
partitionBy() is used for making shuffling functions more efficient, such as reduceByKey(), join(), cogroup() etc.. It is only beneficial in cases where a RDD is used for multiple times, so it is usually followed by persist().
Differences between the two in action:
pairs = sc.parallelize([1, 2, 3, 4, 2, 4, 1, 5, 6, 7, 7, 5, 5, 6, 4]).map(lambda x: (x, x))
pairs.partitionBy(3).glom().collect()
[[(3, 3), (6, 6), (6, 6)],
[(1, 1), (4, 4), (4, 4), (1, 1), (7, 7), (7, 7), (4, 4)],
[(2, 2), (2, 2), (5, 5), (5, 5), (5, 5)]]
pairs.repartition(3).glom().collect()
[[(4, 4), (2, 2), (6, 6), (7, 7), (5, 5), (5, 5)],
[(1, 1), (4, 4), (6, 6), (4, 4)],
[(2, 2), (3, 3), (1, 1), (5, 5), (7, 7)]]
repartition already exists in RDDs, and does not handle partitioning by key (or by any other criterion except Ordering). Now PairRDDs add the notion of keys and subsequently add another method that allows to partition by that key.
So yes, if your data is keyed, you should absolutely partition by that key, which in many cases is the point of using a PairRDD in the first place (for joins, reduceByKey, and so on).

Resources