the usage of aggregate(0, lambda,lambda) in pyspark - apache-spark

There is a pyspark code segment
seqOp = (lambda x,y: x+y)
sum_temp = df.rdd.map(lambda x: len(x.timestamp)).aggregate(0, seqOp, seqOp)
The output of sum_temp is a numerical value. But I am not clear how does the aggregate(0, seqOp, seqOp) work. It seems to me that normally, the aggregate just use a single function form like "avg"
Moreover, df.rdd.map(lambda x: len(x.timestamp)) is of type pyspark.rdd.PipelinedRDD. How can we get its contents?

According to the docs, the aggregation process:
Starts from the first argument as the zero-value (0),
Then each partition of the RDD is aggregated using the second argument, and
Finally the aggregated partitions are combined into the final result using the third argument. Here, you sum up each partition, and then you sum up the sums from each partition into the final result.
You might have confused this aggregate with the aggregate method of dataframes. RDDs are lower-level objects and you cannot use dataframe aggregation methods here, such as avg/mean/etc.
To get the contents of the RDD, you can do rdd.take(1) to check a random element, or use rdd.collect() to check the whole RDD (mind that this will collect all data onto the driver and could cause memory errors if the RDD is huge).

Related

Why is UDF not running in parallel on available executors?

I have a tiny spark Dataframe that essentially pushes a string into a UDF. I'm expecting, because of .repartition(3), which is the same length as targets, for the processing inside run_sequential to be applied on available executors - i.e. applied to 3 different executors.
The issue is that only 1 executor is used. How can I parallelise this processing to force my pyspark script to assign each element of target to a different executor?
import pandas as pd
import pyspark.sql.functions as F
def run_parallel(config):
def run_sequential(target):
#process with target variable
pass
return F.udf(run_sequential)
targets = ["target_1", "target_2", "target_3"]
config = {}
pdf = spark.createDataFrame(pd.DataFrame({"targets": targets})).repartition(3)
pdf.withColumn(
"apply_udf", run_training_parallel(config)("targets")
).collect()
The issue here is that repartitioning a DataFrame does not guarantee that all the created partitions will be of the same size. With such a small number of records there is a pretty high chance that some of them will map into the same partition. Spark is not meant to process such small datasets and its algorithms are tailored to work efficiently with large amounts of data - if your dataset has 3 million records and you split it in 3 partitions of approximately 1 million records each, a difference of several records per partition will be insignificant in most cases. This is obviously not the case when repartitioning 3 records.
You can use df.rdd.glom().map(len).collect() to examine the size of the partitions before and after repartitioning to see how the distribution changes.
$ pyspark --master "local[3]"
...
>>> pdf = spark.createDataFrame([("target_1",), ("target_2",), ("target_3",)]).toDF("targets")
>>> pdf.rdd.glom().map(len).collect()
[1, 1, 1]
>>> pdf.repartition(3).rdd.glom().map(len).collect()
[0, 2, 1]
As you can see, the resulting partitioning is uneven and the first partition in my case is actually empty. The irony here is that the original dataframe has the desired property and that one is getting destroyed by repartition().
While your particular case is not what Spark typically targets, it is still possible to forcefully distribute three records in three partitions. All you need to do is to provide an explicit partition key. RDDs have the zipWithIndex() method that extends each record with its ID. The ID is the perfect partition key since its value starts with 0 and increases by 1.
>>> new_df = (pdf
.coalesce(1) # not part of the solution - see below
.rdd # Convert to RDD
.zipWithIndex() # Append ID to each record
.map(lambda x: (x[1], x[0])) # Make record ID come first
.partitionBy(3) # Repartition
.map(lambda x: x[1]) # Remove record ID
.toDF()) # Turn back into a dataframe
>>> new_df.rdd.glom().map(len).collect()
[1, 1, 1]
In the above code, coalesce(1) is added only to demonstrate that the final partitioning is not influenced by the fact that pdf initially has one record in each partition.
A DataFrame-only solution is to first coalesce pdf to a single partition and then use repartition(3). With no partitioning column(s) provided, DataFrame.repartition() uses the round-robin partitioner and hence the desired partitioning will be achieved. You cannot simply do pdf.coalesce(1).repartition(3) since Catalyst (the Spark query optimisation engine) optimises out the coalesce operation, so a partitioning-dependent operation must be inserted in between. Adding a column containing F.monotonically_increasing_id() is a good candidate for such an operation.
>>> new_df = (pdf
.coalesce(1)
.withColumn("id", F.monotonically_increasing_id())
.repartition(3))
>>> new_df.rdd.glom().map(len).collect()
[1, 1, 1]
Note that, unlike in the RDD-based solution, coalesce(1) is required as part of the solution.

Use Spark groupByKey to dedup RDD which causes a lot of shuffle overhead

I have a key-value pair RDD. The RDD contains some elements with duplicate keys, and I want to split original RDD into two RDDs: One stores elements with unique keys, and another stores the rest elements. For example,
Input RDD (6 elements in total):
<k1,v1>, <k1,v2>, <k1,v3>, <k2,v4>, <k2,v5>, <k3,v6>
Result:
Unique keys RDD (store elements with unique keys; For the multiple elements with the same key, any element is accepted):
<k1,v1>, <k2, v4>, <k3,v6>
Duplicated keys RDD (store the rest elements with duplicated keys):
<k1,v2>, <k1,v3>, <k2,v5>
In the above example, unique RDD has 3 elements, and the duplicated RDD has 3 elements too.
I tried groupByKey() to group elements with the same key together. For each key, there is a sequence of elements. However, the performance of groupByKey() is not good because the data size of element value is very big which causes very large data size of shuffle write.
So I was wondering if there is any better solution. Or is there a way to reduce the amount of data being shuffled when using groupByKey()?
EDIT: given the new information in the edit, I would first create the unique rdd, and than the the duplicate rdd using the unique and the original one:
val inputRdd: RDD[(K,V)] = ...
val uniqueRdd: RDD[(K,V)] = inputRdd.reduceByKey((x,y) => x) //keep just a single value for each key
val duplicateRdd = inputRdd
.join(uniqueRdd)
.filter {case(k, (v1,v2)) => v1 != v2}
.map {case(k,(v1,v2)) => (k, v1)} //v2 came from unique rdd
there is some room for optimization also.
In the solution above there will be 2 shuffles (reduceByKey and join).
If we repartition the inputRdd by the key from the start, we won't need any additional shuffles
using this code should produce much better performance:
val inputRdd2 = inputRdd.partitionBy(new HashPartitioner(partitions=200) )
Original Solution:
you can try the following approach:
first count the number of occurrences of each pair, and then split into the 2 rdds
val inputRdd: RDD[(K,V)] = ...
val countRdd: RDD[((K,V), Int)] = inputRDD
.map((_, 1))
.reduceByKey(_ + _)
.cache
val uniqueRdd = countRdd.map(_._1)
val duplicateRdd = countRdd
.filter(_._2>1)
.flatMap { case(kv, count) =>
(1 to count-1).map(_ => kv)
}
Please use combineByKey resulting in use of combiner on the Map Task and hence reduce shuffling data.
The combiner logic depends on your business logic.
http://bytepadding.com/big-data/spark/groupby-vs-reducebykey/
There are multiple ways to reduce shuffle data.
1. Write less from Map task by use of combiner.
2. Send Aggregated serialized objects from Map to reduce.
3. Use combineInputFormts to enhance efficiency of combiners.

First element of each dataframe partition Spark 2.0

I need to retrieve the first element of each dataframe partition.
I know that I need to use mapPartitions but it is not clear for me how to use it.
Note: I am using Spark2.0, the dataframe is sorted.
I believe it should look something like following:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
...
implicit val encoder = RowEncoder(df.schema)
val newDf = df.mapPartitions(iterator => iterator.take(1))
This will take 1 element from each partition in DataFrame. Then you can collect all the data to your driver i.e.:
nedDf.collect()
This will return you an array with a number of elements equal to number of your partitions.
UPD updated in order to support Spark 2.0

How does Spark keep track of the splits in randomSplit?

This question explains how Spark's random split works, How does Sparks RDD.randomSplit actually split the RDD, but I don't understand how spark keeps track of what values went to one split so that those same values don't go to the second split.
If we look at the implementation of randomSplit:
def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = {
// It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its
// constituent partitions each time a split is materialized which could result in
// overlapping splits. To prevent this, we explicitly sort each input partition to make the
// ordering deterministic.
val sorted = Sort(logicalPlan.output.map(SortOrder(_, Ascending)), global = false, logicalPlan)
val sum = weights.sum
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
new DataFrame(sqlContext, Sample(x(0), x(1), withReplacement = false, seed, sorted))
}.toArray
}
we can see that it creates two DataFrames that share the same sqlContext and with two different Sample(rs).
How are these two DataFrame(s) communicating with each other so that a value that fell in the first one is not included in the second one?
And is the data being fetched twice? (Assume the sqlContext is selecting from a DB, is the select being executed twice?).
It's exactly the same as sampling an RDD.
Assuming you have the weight array (0.6, 0.2, 0.2), Spark will generate one DataFrame for each range (0.0, 0.6), (0.6, 0.8), (0.8, 1.0).
When it's time to read the result DataFrame, Spark will just go over the parent DataFrame. For each item, generate a random number, if that number fall in the the specified range, then emit the item. All child DataFrame share the same random number generator (technically, different generators with the same seed), so the sequence of random number is deterministic.
For your last question, if you did not cache the parent DataFrame, then the data for the input DataFrame will be re-fetch each time an output DataFrame is computed.

How Can I Obtain an Element Position in Spark's RDD?

I am new to Apache Spark, and I know that the core data structure is RDD. Now I am writing some apps which require element positional information. For example, after converting an ArrayList into a (Java)RDD, for each integer in RDD, I need to know its (global) array subscript. Is it possible to do it?
As I know, there is a take(int) function for RDD, so I believe the positional information is still maintained in RDD.
I believe in most cases, zipWithIndex() will do the trick, and it will preserve the order. Read the comments again. My understanding is that it exactly means keep the order in the RDD.
scala> val r1 = sc.parallelize(List("a", "b", "c", "d", "e", "f", "g"), 3)
scala> val r2 = r1.zipWithIndex
scala> r2.foreach(println)
(c,2)
(d,3)
(e,4)
(f,5)
(g,6)
(a,0)
(b,1)
Above example confirm it. The red has 3 partitions, and a with index 0, b with index 1, etc.
Essentially, RDD's zipWithIndex() method seems to do this, but it won't preserve the original ordering of the data the RDD was created from. At least you'll get a stable ordering.
val orig: RDD[String] = ...
val indexed: RDD[(String, Long)] = orig.zipWithIndex()
The reason you're unlikely to find something that preserves the order in the original data is buried in the API doc for zipWithIndex():
"Zips this RDD with its element indices. The ordering is first based
on the partition index and then the ordering of items within each
partition. So the first item in the first partition gets index 0, and
the last item in the last partition receives the largest index. This
is similar to Scala's zipWithIndex but it uses Long instead of Int as
the index type. This method needs to trigger a spark job when this RDD
contains more than one partitions."
So it looks like the original order is discarded. If preserving the original order is important to you, it looks like you need to add the index before you create the RDD.

Resources