ReKeying an RDD - python-3.x

I have a key value RDD with 1 key and multiple values. How can I create a new RDD to have one of the values become the key and the key become a value?
Ex the existing RDD is (16, (1002, 'US')), (9, (1001, 'MX')), (1, (1004, 'MX')), (17, (1004, 'MX'))]. I want tomake a new RDD such that (1002, (16, 'US')), (1001, (9, 'MX')), (1004, (1, 'MX')), (1004, (17, 'MX'))
and the new RDD desired is
(1002, (16, 'US')), (1001, (9, 'MX')), (1004, (1, 'MX')), (1004, (17, 'MX'))

rdd.map(lambda x: (x[1][0],(x[0],x[1][1])))

Related

How to change two-dimension label to one-dimension label?

I have a two-dimension (10,2) coordinate which indicates each points label, like
coord_list = [(19, 17), (19, 17), (5, 26), (19, 17), (5, 26), (5, 26), (15, 17), (19, 5), (18, 6), (5, 26)]
I want to change it to a label list that only have one dimension (10,1),(assign a "label" to every unique item and replace each item by its label),like
label_list = [1,1,0....2,3]
I just want to classified points that have same coordinate in a same label, is there some more simple way can achieve it?
I tried to use this code,
label_list = []
for idx, coord in enumerate(coord_list):
if coord == (19,17):
label = 1
label_list.append(label)
if ...
But the problem is I don't know how many different coordinate in my coord_list, so I cannot write all if sentence in my code
Here's what I think you're after. I convert the list to a set, which eliminates duplicates. Then back to a list, and I sort it. Then I map each element of the original list to its index in that sorted list. There are only 5 unique points here, so the indexes will be from 0 to 4:
coord_list = [(19, 17), (19, 17), (5, 26), (19, 17), (5, 26), (5, 26), (15, 17), (19, 5), (18, 6), (5, 26)]
a = sorted(list(set(coord_list)))
print(a)
b = [a.index(i) for i in coord_list]
print(b)
Output:
[(5, 26), (15, 17), (18, 6), (19, 5), (19, 17)]
[4, 4, 0, 4, 0, 0, 1, 3, 2, 0]

How do I count list items by every 10th element

I've got a list of tuples where I need to return a list of the frequency of elements per every 10-second interval depending on a variable ie
data_list =[(0, 84), (1, 84), (2, 84), (3, 84), (4, 84), (5, 84), (6, 84), (7, 84), (8, 84), (9, 84), (10, 84), (11, 84), (12, 84), (13, 84), (14, 84), (15, 84), (16, 84), (17, 84), (18, 84), (19, 84), (20, 84)]
and size = 3
should return
[[0, 10], [1, 10], [2, 1]]
as there are 10(index 2) elements in range 0-9(index 1), 10 elements in range 10-19 and 1 element in range 20-29
I was thinking about creating a for loop that creates x many lists depending on the variable size but not sure that would work at all. Then I tried using a counter but not sure how I would group them in groups of 10 and by the index
Any ideas would be much appreciated.
from collections import Counter
def get_frequency(tuple_list):
x = Counter(elem[0] for elem in tuple_list)
return x

Repartitioning partitioned data

I'm working on a skewed data problem, such that my smallest partitions are below 64MB and my largest partitions can be greater than 1GB. I've been contemplating a strategy to map a few small partitions to the same partition key, thus creating a partition comprised of partitions. This is all in the hope of reducing variance in task size as well as number of files stored on disk.
At one point in my Spark application, I need to operate on the (non-grouped) original partitions and to do so, will need to repartition by the original key. This brings me to my question:
Suppose I have two data sets as seen below. Each row is a tuple of the form (partition_key, (original_key, data)). In data0, you can see that original_key = 0 is on its own node, whereas original_key = 4 and original_key = 5 are together on the node containing partition_key = 3. In data1, things are not as organized.
If data0 is partitioned by partition_key, and then partitioned by original_key, will a shuffle occur? In other words, does it matter during the second partitionBy call that data0 is more organized than data1?
data0 = [
(0, (0, 'a')),
(0, (0, 'b')),
(0, (0, 'c')),
(1, (1, 'd')),
(1, (1, 'e')),
(1, (2, 'f')),
(1, (2, 'g')),
(2, (3, 'h')),
(2, (3, 'i')),
(2, (3, 'j')),
(3, (4, 'k')),
(3, (4, 'l')),
(3, (5, 'm')),
(3, (5, 'n')),
(3, (5, 'o')),
]
data1 = [
(0, (0, 'a')),
(1, (0, 'b')),
(0, (0, 'c')),
(1, (1, 'd')),
(2, (1, 'e')),
(1, (2, 'f')),
(3, (2, 'g')),
(2, (3, 'h')),
(0, (3, 'i')),
(3, (3, 'j')),
(3, (4, 'k')),
(3, (4, 'l')),
(1, (5, 'm')),
(2, (5, 'n')),
(3, (5, 'o')),
]
rdd0 = sc.parallelize(data0, 3).cache()
partitioned0 = rdd0.partitionBy(4)
partitioned0.map(lambda row: (row[1][0], row[1])).partitionBy(6).collect()
rdd1 = sc.parallelize(data1, 3).cache()
partitioned1 = rdd1.partitionBy(4)
partitioned1.map(lambda row: (row[1][0], row[1])).partitionBy(6).collect()
When you call re-partition shuffle kicks in.
How much of data get's shuffled is based on the original RDD.
As a side note: when you do sc.parallelize(data0,3) the 3 is mere guideline. If the default partition is <=3 then your rdd0 will have 3 partitions. If your data0 is on more HDFS blocks providing the partition number has no effect.

Spark RDD: Vary the size of each partition

I need executors to finish processing data at different times.
I think the easiest way is to make RDD partitions have not uniform sizes. How can I do this?
Not sure what you are trying to achieve, but you can partition the RDD anyway you like using partitionBy eg:
sc.parallelize(xrange(10)).zipWithIndex()
.partitionBy(2, lambda x: 0 if x<2 else 1)
.glom().collect()
[[(0, 0), (1, 1)], [(2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9)]]
Note that it works on a (k,v) RDD and the partitioning function takes only k as a param

Pyspark: repartition vs partitionBy

I'm working through these two concepts right now and would like some clarity. From working through the command line, I've been trying to identify the differences and when a developer would use repartition vs partitionBy.
Here is some sample code:
rdd = sc.parallelize([('a', 1), ('a', 2), ('b', 1), ('b', 3), ('c',1), ('ef',5)])
rdd1 = rdd.repartition(4)
rdd2 = rdd.partitionBy(4)
rdd1.glom().collect()
[[('b', 1), ('ef', 5)], [], [], [('a', 1), ('a', 2), ('b', 3), ('c', 1)]]
rdd2.glom().collect()
[[('a', 1), ('a', 2)], [], [('c', 1)], [('b', 1), ('b', 3), ('ef', 5)]]
I took a look at the implementation of both, and the only difference I've noticed for the most part is that partitionBy can take a partitioning function, or using the portable_hash by default. So in partitionBy, all the same keys should be in the same partition. In repartition, I would expect the values to be distributed more evenly over the partitions, but this isnt the case.
Given this, why would anyone ever use repartition? I suppose the only time I could see it being used is if I'm not working with PairRDD, or I have large data skew?
Is there something that I'm missing, or could someone shed light from a different angle for me?
repartition() is used for specifying the number of partitions considering the number of cores and the amount of data you have.
partitionBy() is used for making shuffling functions more efficient, such as reduceByKey(), join(), cogroup() etc.. It is only beneficial in cases where a RDD is used for multiple times, so it is usually followed by persist().
Differences between the two in action:
pairs = sc.parallelize([1, 2, 3, 4, 2, 4, 1, 5, 6, 7, 7, 5, 5, 6, 4]).map(lambda x: (x, x))
pairs.partitionBy(3).glom().collect()
[[(3, 3), (6, 6), (6, 6)],
[(1, 1), (4, 4), (4, 4), (1, 1), (7, 7), (7, 7), (4, 4)],
[(2, 2), (2, 2), (5, 5), (5, 5), (5, 5)]]
pairs.repartition(3).glom().collect()
[[(4, 4), (2, 2), (6, 6), (7, 7), (5, 5), (5, 5)],
[(1, 1), (4, 4), (6, 6), (4, 4)],
[(2, 2), (3, 3), (1, 1), (5, 5), (7, 7)]]
repartition already exists in RDDs, and does not handle partitioning by key (or by any other criterion except Ordering). Now PairRDDs add the notion of keys and subsequently add another method that allows to partition by that key.
So yes, if your data is keyed, you should absolutely partition by that key, which in many cases is the point of using a PairRDD in the first place (for joins, reduceByKey, and so on).

Resources