zip RDDs constructed from different input files - apache-spark

I have two files in HDFS with the same number of lines. Lines from the files corresponds to each other by line number.
lines1=sc.textFile('1.txt')
lines2=sc.textFile('2.txt')
My question is how to correctly zip rdd lines1 with lines2?
zipped=lines1.zip(lines2)
Zip requires the same size of RDDs and the same partitions (as I understood not only partitions count but also equal number of elements in each partition). First requirement is already satisfied.
How to ensure the second one?
Thanks!
Sergey.

In general none of the conditions will be satisfied and zip is not a good tool to perform operation like this. Both number of partitions and number of elements per partition depend not only on a number of lines but also size of the file, size of the individual files and configuration.
zip is useful when you connect RDDs which can common ancestor and are not separated by shuffle for example:
parent = sc.parallelize(range(100))
child1 = parent.map(some_func)
child2 = parent.map(other_func)
child1.zip(child2)
To merge RDDs by line you can do something like this:
def index_and_sort(rdd):
def swap(xy):
x, y = xy
return y, x
return rdd.zipWithIndex().map(swap).sortByKey()
index_and_sort(lines1).join(index_and_sort(lines)).values()
It should be safe to zip after indexing and sorting:
from pyspark import RDD
RDD.zip(*(index_and_sort(rdd).values() for rdd in [lines1, lines2]))
but why even bother?
Scala equivalent:
import org.apache.spark.rdd.RDD
def indexAndSort(rdd: RDD[String]) = rdd.zipWithIndex.map(_.swap).sortByKey()
indexAndSort(lines1).join(indexAndSort(lines2)).values

Related

Why is UDF not running in parallel on available executors?

I have a tiny spark Dataframe that essentially pushes a string into a UDF. I'm expecting, because of .repartition(3), which is the same length as targets, for the processing inside run_sequential to be applied on available executors - i.e. applied to 3 different executors.
The issue is that only 1 executor is used. How can I parallelise this processing to force my pyspark script to assign each element of target to a different executor?
import pandas as pd
import pyspark.sql.functions as F
def run_parallel(config):
def run_sequential(target):
#process with target variable
pass
return F.udf(run_sequential)
targets = ["target_1", "target_2", "target_3"]
config = {}
pdf = spark.createDataFrame(pd.DataFrame({"targets": targets})).repartition(3)
pdf.withColumn(
"apply_udf", run_training_parallel(config)("targets")
).collect()
The issue here is that repartitioning a DataFrame does not guarantee that all the created partitions will be of the same size. With such a small number of records there is a pretty high chance that some of them will map into the same partition. Spark is not meant to process such small datasets and its algorithms are tailored to work efficiently with large amounts of data - if your dataset has 3 million records and you split it in 3 partitions of approximately 1 million records each, a difference of several records per partition will be insignificant in most cases. This is obviously not the case when repartitioning 3 records.
You can use df.rdd.glom().map(len).collect() to examine the size of the partitions before and after repartitioning to see how the distribution changes.
$ pyspark --master "local[3]"
...
>>> pdf = spark.createDataFrame([("target_1",), ("target_2",), ("target_3",)]).toDF("targets")
>>> pdf.rdd.glom().map(len).collect()
[1, 1, 1]
>>> pdf.repartition(3).rdd.glom().map(len).collect()
[0, 2, 1]
As you can see, the resulting partitioning is uneven and the first partition in my case is actually empty. The irony here is that the original dataframe has the desired property and that one is getting destroyed by repartition().
While your particular case is not what Spark typically targets, it is still possible to forcefully distribute three records in three partitions. All you need to do is to provide an explicit partition key. RDDs have the zipWithIndex() method that extends each record with its ID. The ID is the perfect partition key since its value starts with 0 and increases by 1.
>>> new_df = (pdf
.coalesce(1) # not part of the solution - see below
.rdd # Convert to RDD
.zipWithIndex() # Append ID to each record
.map(lambda x: (x[1], x[0])) # Make record ID come first
.partitionBy(3) # Repartition
.map(lambda x: x[1]) # Remove record ID
.toDF()) # Turn back into a dataframe
>>> new_df.rdd.glom().map(len).collect()
[1, 1, 1]
In the above code, coalesce(1) is added only to demonstrate that the final partitioning is not influenced by the fact that pdf initially has one record in each partition.
A DataFrame-only solution is to first coalesce pdf to a single partition and then use repartition(3). With no partitioning column(s) provided, DataFrame.repartition() uses the round-robin partitioner and hence the desired partitioning will be achieved. You cannot simply do pdf.coalesce(1).repartition(3) since Catalyst (the Spark query optimisation engine) optimises out the coalesce operation, so a partitioning-dependent operation must be inserted in between. Adding a column containing F.monotonically_increasing_id() is a good candidate for such an operation.
>>> new_df = (pdf
.coalesce(1)
.withColumn("id", F.monotonically_increasing_id())
.repartition(3))
>>> new_df.rdd.glom().map(len).collect()
[1, 1, 1]
Note that, unlike in the RDD-based solution, coalesce(1) is required as part of the solution.

A quick way to get the mean of each position in large RDD

I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like
[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]
Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:
(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4
which is:
[(3745.75,2729.25,4397.0,3981.0)]
Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.
I don't think there is anything faster than calculating the mean (or sum) for each column
If you are using the DataFrame API you can simply aggregate multiple columns:
import os
import time
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
# start local spark session
spark = SparkSession.builder.getOrCreate()
# load as rdd
def localpath(path):
return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
rdd = spark._sc.textFile(localpath('myPosts/'))
# create data frame from rdd
df = spark.createDataFrame(rdd)
means_df = df.agg(*[f.avg(c) for c in df.columns])
means_dict = means_df.first().asDict()
print(means_dict)
Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command

Avoid repartition costs when filtering and then coalescing

I am implementing a range query on an RDD of (x,y) points in pyspark. I partitioned the xy space into a 16*16 grid (256 cells) and assigned each point in my RDD to one of these cells.
The gridMappedRDD is a PairRDD: (cell_id, Point object)
I partitioned this RDD to 256 partitions, using:
gridMappedRDD.partitionBy(256)
The range query is a rectangular box. I have a method for my Grid object which can return the list of cell ids which overlap with the query range. So, I used this as a filter to prune the unrelated cells:
filteredRDD = gridMappedRDD.filter(lambda x: x[0] in candidateCells)
But the problem is that when running the query and then collecting the results, all the 256 partitions are evaluated; A task is created for each partition.
To avoid this problem, I tried coalescing the filteredRDD to the length of candidateCell list and I hoped this could solve the problem.
filteredRDD.coalesce(len(candidateCells))
In fact the resulting RDD has len(candidateCells) partitions but the partitions are not the same as gridMappedRDD.
As stated in the coalesce documentation, the shuffle parameter is False and no shuffle should be performed among partitions but I can see (with the help of glom()) that this is not the case.
For example after a coalesce(4) with candidateCells=[62, 63, 78, 79] the partitions are like this:
[[(62, P), (62, P) .... , (63, P)],
[(78, P), (78, P) .... , (79, P)],
[], []
]
Actually, by coalescing, I have a shuffle read which equals to the size of my whole dataset for every task, which takes a significant time. What I need is an RDD with only partitions related to cells in candidateCells, without any shuffles.
So, my question is that is it possible to filter only some partitions without reshuffling? For the above example, my filteredRDD would have 4 partitions with exactly the same data as originalRDD's 62, 63, 78, 79th partitions. Doing so, the query could be directed to affecting partitions only.
You made a few incorrect assumptions here:
The shuffle is not related to coalesce (nor coalesce is useful here). It is caused by partitionBy. Partitioning by definition requires shuffle.
Partitioning cannot be used to optimize filter. Spark knows nothing about the function you use (it is a black box).
Partitioning doesn't uniquely map keys to partitions. Multiple keys can be placed on the same partition - How does HashPartitioner work?
What can you do:
If resulting subset is small repartition and apply lookup for each key:
from itertools import chain
partitionedRDD = gridMappedRDD.partitionBy(256)
chain.from_iterable(
((c, x) for x in partitionedRDD.lookup(c))
for c in candidateCells
)
If data is large you can try to skip scanning partitions (number of tasks won't change, but some task can be short circuited):
candidatePartitions = [
partitionedRDD.partitioner.partitionFunc(c) for c in candidateCells
]
partitionedRDD.mapPartitionsWithIndex(
lambda i, xs: (x for x in xs if x[0] in candidateCells) if i in candidatePartitions else []
)
This two methods make sense only if you perform multiple "lookups". If it is one-off operation, it is better to perform linear filter:
It is cheaper than shuffle and repartitioning.
If initial data is uniformly distributed downstream processing will be able to better utilize available resources.

Avoiding a shuffle in Spark by pre-partitioning files (PySpark)

I have a dataset dataset which is partitioned on values 00-99 and want to create an RDD first_rdd to read in the data.
I then want to count how many times the word "foo" occurs in the second element of each partition and store the records of each partition in a list. My output would be final_rdd where each record is of the form (partition_key, (count, record_list)).
def to_list(a):
return [a]
def append(a, b):
a.append(b)
return a
def extend(a, b):
a.extend(b)
return a
first_rdd = sqlContext.sql("select * from dataset").rdd
kv_rdd = first_rdd.map(lambda x: (x[4], x)) # x[4] is the partition value
# Group each partition to (partition_key, [list_of_records])
grouped_rdd = kv_rdd.combineByKey(to_list, append, extend)
def count_foo(x):
count = 0
for record in x:
if record[1] == "foo":
count = count + 1
return (count, x)
final_rdd = grouped_rdd.mapValues(count_foo)
print("Counted 'foo' for %s partitions" % (final_rdd.count))
Since each partition of the dataset is totally independent from one another computationally, Spark shouldn't need to shuffle, yet when I look at the SparkUI, I notice that the combineByKey is resulting in a very large shuffle.
I have the correct number of initial partitions, and have also tried reading from the partitioned data in HDFS. Each way I try it, I still get a shuffle. What am I doing wrong?
I've solved my problem by using the mapPartitions function and passing it my own reduce function so that it "reduces" locally on each node and will never perform a shuffle.
In the scenario where data are isolated between each partition, it works perfectly. When the same key exists on more than one partition, this is where a shuffle would be necessary, but this case needs to be detected and handled separately.

Is there a method in Spark to obtain an RDD which is a random subset, of a given exact size, of another RDD?

I know RDD's have the 'sample' method which returns a new RDD containing a given fraction of the original RDD, randomly selected. However, as each element is selected randomly, the size of the new RDD is not deterministic.
There's also the 'takeSample' method which returns an input integer number of elements of an RDD. However, this returns a list rather than a new RDD.
Is there a method that returns an RDD with a specified exact number of randomly selected elements? Of course one can use takeSample and create a new RDD from that, but this means sending a lot of data back and forth between driver and executors.
It will expensive but you can sort by random number:
import os
import binascii
import random
rdd = spark.sparkContext.range(100)
def with_rand(iter):
random_ = random.Random(int(binascii.hexlify(os.urandom(4)), 16))
for x in iter:
yield random_.random(), x
rdd_sorted = rdd.mapPartitions(with_rand).sortByKey()
remove random number, add index and filter
n = 42
result = rdd_sorted.values().zipWithIndex().filter(lambda x: x[1] < n).keys()

Resources