How to effectively cache/perssit Spark RDD join result - apache-spark

What am I trying to do:
read large Terabyte size RDD
filter it using broadcast variable, it'll reduce it to few gigabytes
join filtered RDD with another RDD which few gigabytes too
persist join result and reuse multiple times
Expectation:
join executed once
join result persisted
join result reused several times w/o recomputation
IRL:
join recomputed several times.
half of entire job runtime spent on re-computing same thing several times.
My presudo-code
val nonPartitioned = sparkContext.readData("path")
val terabyteSizeRDD = nonPartitioned
.keyBy(_.joinKey)
.partitionBy(new HashPartitioner(nonPartitioned.getNumPartitions))
//filters down to few Gigabytes
val filteredTerabyteSizeRDD = terabyteSizeDataset.mapPartitions(filterAndMapPartitionFunc, preservesPartitioning = true)
val (joined, count) = {
val result = filteredTerabyteSizeRDD
.leftOuterJoin(anotherFewGbRDD, filteredTerabyteSizeRDD.partitioner.get)
.map(mapJoinRecordFunc)
result.persist()
result -> result.count()
}
DAG says that join is executed several times
first time
another time for .count() I don't know how to trigger persist is another way
three more times since code uses joined three times to create another RDDs.
How can I align expectation and reality?

You can cache or persist data in spark with df.cache() or df.persist(). If you are using persist you have further options then using cache. If you are using persist without an argument its just like a simple cache()see here . Why dont you cache your filteredTerabyteSizeRDD? It should fit in memory if its just a few GB? If it doesnt fit in memory, you could tryfilteredTerabyteSizeRDD.persist(StorageLevel.MEMORY_AND_DISK).
Hope I could answer your question.

Related

kafka streaming behaviour for more than one partition

I am consuming from Kafka topic. This topic has 3 partitions.
I am using foreachRDD to process each batch RDD (using processData method to process each RDD, and ultimately create a DataSet from that).
Now, you can see that i have count variable , and i am incrementing this count variable in "processData" method to check how many actual records i have processed. (i understand , each RDD is collection of kafka topic records , and the number depends on batch interval size)
Now , the output is something like this :
1 1 1 2 3 2 4 3 5 ....
This makes me think that its because i might have 3 consumers( as i have 3 partitions), and each of these will call "foreachRDD" method separately, so the same count is being printed more than once, as each consumer might have cached its copy of count.
But the final output DataSet that i get has all the records.
So , does Spark internally union all the data? How does it makes out what to union?
I am trying to understand the behaviour , so that i can form my logic
int count = 0;
messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<K, String>>>() {
public void call(JavaRDD<ConsumerRecord<K, V>> rdd) {
System.out.println("NUmber of elements in RDD : "+ rdd.count());
List<Row> rows = rdd.map(record -> processData(record))
.reduce((rows1, rows2) -> {
rows1.addAll(rows2);
return rows1;
});
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(rows, schema);
ds.createOrReplaceTempView("trades");
ds.show();
}
});
The assumptions are not completely accurate.
foreachRDD is one of the so-called output operations in Spark Streaming. The function of output operations is to schedule the provided closure at the interval dictated by the batch interval. The code in that closure executes once each batch interval on the spark driver. Not distributed in the cluster.
In particular, foreachRDD is a general purpose output operation that provides access to the underlying RDD within the DStream. Operations applied on that RDD will execute on the Spark cluster.
So, coming back to the code of the original question, code in the foreachRDD closure such as System.out.println("NUmber of elements in RDD : "+ rdd.count()); executes on the driver. That's also the reason why we can see the output in the console. Note that the rdd.count() in this print will trigger a count of the RDD on the cluster, so count is a distributed operation that returns a value to the driver, then -on the driver- the print operation takes place.
Now comes a transformation of the RDD:
rdd.map(record -> processData(record))
As we mentioned, operations applied to the RDD will execute on the cluster. And that execution will take place following the Spark execution model; that is, transformations are assembled into stages and applied to each partition of the underlying dataset. Given that we are dealing with 3 kafka topics, we will have 3 corresponding partitions in Spark. Hence, processData will be applied once to each partition.
So, does Spark internally union all the data? How does it make out what to union?
The same way we have output operations for Spark Streaming, we have actions for Spark. Actions will potentially apply an operation to the data and bring the results to the driver. The most simple operation is collect which brings the complete dataset to the driver, with the risk that it might not fit in memory. Other common action, count summarizes the number of records in the dataset and returns a single number to the driver.
In the code above, we are using reduce, which is also an action that applies the provided function and brings the resulting data to the driver. It's the use of that action that is "internally union all the data" as expressed in the question. In the reduce expression, we are actually collecting all the data that was distributed into a single local collection. It would be equivalent to do: rdd.map(record -> processData(record)).collect()
If the intention is to create a Dataset, we should avoid "moving" all the data to the driver first.
A better approach would be:
val rows = rdd.map(record -> processData(record))
val df = ss.createDataFrame(rows, schema);
...
In this case, the data of all partitions will remain local to the executor where they are located.
Note that moving data to the driver should be avoided. It is slow and in cases of large datasets will probably crash the job as the driver cannot typically hold all data available in a cluster.

Spark Streaming appends to S3 as Parquet format, too many small partitions

I am building an app that uses Spark Streaming to receive data from Kinesis streams on AWS EMR. One of the goals is to persist the data into S3 (EMRFS), and for this I am using a 2 minutes non-overlapping window.
My approaches:
Kinesis Stream -> Spark Streaming with batch duration about 60 seconds, using a non-overlapping window of 120s, save the streamed data into S3 as:
val rdd1 = kinesisStream.map( rdd => /* decode the data */)
rdd1.window(Seconds(120), Seconds(120).foreachRDD { rdd =>
val spark = SparkSession...
import spark.implicits._
// convert rdd to df
val df = rdd.toDF(columnNames: _*)
df.write.parquet("s3://bucket/20161211.parquet")
}
Here is what s3://bucket/20161211.parquet looks like after a while:
As you can see, lots of fragmented small partitions (which is horrendous for read performance)...the question is, is there any way to control the number of small partitions as I stream data into this S3 parquet file?
Thanks
What I am thinking to do, is to each day do something like this:
val df = spark.read.parquet("s3://bucket/20161211.parquet")
df.coalesce(4).write.parquet("s3://bucket/20161211_4parition.parquet")
where I kind of repartition the dataframe to 4 partitions and save them back....
It works, I feel that doing this every day is not elegant solution...
That's actually pretty close to what you want to do, each partition will get written out as an individual file in Spark. However coalesce is a bit confusing since it can (effectively) apply upstream of where the coalesce is called. The warning from the Scala doc is:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
this may result in your computation taking place on fewer nodes than
you like (e.g. one node in the case of numPartitions = 1). To avoid this,
you can pass shuffle = true. This will add a shuffle step, but means the
current upstream partitions will be executed in parallel (per whatever
the current partitioning is).
In Dataset's its a bit easier to persist and count to do wide evaluation since the default coalesce function doesn't take repartition as a flag for input (although you could construct an instance of Repartition manually).
Another option is to have a second periodic batch job (or even a second streaming job) that cleans up/merges the results, but this can be a bit complicated as it introduces a second moving part to keep track of.

Spark RDD - avoiding shuffle - Does partitioning help to process huge files?

I have an application with around 10 flat files each worth more than 200MM+ records in them. Business logic involves in joining all of them sequentially.
my environment:
1 master - 3 slaves (for testing i have assigned a 1GB memory to each node)
Most of the code just does the below for each join
RDD1 = sc.textFile(file1).mapToPair(..)
RDD2 = sc.textFile(file2).mapToPair(..)
join = RDD1.join(RDD2).map(peopleObject)
Any suggestion for tuning , like repartitioning, parallelize ..? if so, any best practices in coming up with good number for repartitioning?
with the current config the job takes more than an hour and i see the shuffle write for almost every file is > 3GB
In practice, with large datasets (5, 100G+ each), I have seen that the join works best when you co-partition all the RDDs involved in a series of join before you start joining them.
RDD1 = sc.textFile(file1).mapToPair(..).partitionBy(new HashPartitioner(2048))
RDD2 = sc.textFile(file2).mapToPair(..).partitionBy(new HashPartitioner(2048))
.
.
.
RDDN = sc.textFile(fileN).mapToPair(..).partitionBy(new HashPartitioner(2048))
//start joins
RDD1.join(RDD2)...join(RDDN)
Side note:
I refer to this kind of a join as a tree join (each RDD used once). The rationale is presented in the following beautiful pic taken from the Spark-UI:
If we are always joining one RDD (say rdd1) with all the others, the idea is to partition that RDD and then persist it.
Here is sudo-Scala implementation (can easily be converted to Python or Java):
val rdd1 = sc.textFile(file1).mapToPair(..).partitionBy(new HashPartitioner(200)).cache()
Up to here we have rdd1 to be hashed into 200 partitions. The first time it will get evaluated it will be persisted (cached).
Now let's read two more rdds and join them.
val rdd2 = sc.textFile(file2).mapToPair(..)
val join1 = rdd1.join(rdd2).map(peopleObject)
val rdd3 = sc.textFile(file3).mapToPair(..)
val join2 = rdd1.join(rdd3).map(peopleObject)
Note that for the remanning RDDs we do not partition them nor do we cache them.
Spark will see that rdd1 is already hashed partition and it will use the same partitions for all remaining joins. So rdd2 and rdd3 will shuffle their keys to the same locations where the keys of rdd1 are located.
To make it more clear, let's assume that we don't do the partition and we use the same code shown by the question; Each time we do a join both rdds will be shuffled. This means that if we have N joins to rdd1, the non partition version will shuffle rdd1 N times. The partitioned approach will shuffle rdd1 just once.

Why does Spark DataFrame run out of memory when same process on RDD completes fine?

I'm working with a fairly large amount of data (a few TBs). When I use a subset of the data, I find that Spark dataframes are great to work with. However, when I try calculations on my full dataset the same code returns me a dreaded "java.lang.OutOfMemoryError: GC overhead limit exceeded". What surprised me is that the process completes fine doing the same thing with an RDD. I thought dataframes were supposed to have better optimization. Is this a mistake in my approach or a limitation of dataframes?
For example, here is a simple task using dataframes that completes fine for a subset of my data and chokes on the full sample:
val records = sqlContext.read.avro(datafile)
val uniqueIDs = records.select("device_id").dropDuplicates(Array("device_id"))
val uniqueIDsCount = uniqueIDs.count().toDouble
val sampleIDs = uniqueIDs.sample(withReplacement = false, 100000/uniqueIDsCount)
sampleIDs.write.format("com.databricks.spark.csv").option("delimiter", "|").save(outputfile)
In this case it even chokes on the count.
However, when I try the same thing using RDDs in the following way it calculates fine (and pretty quickly at that).
val rawinput = sc.hadoopFile[AvroWrapper[Observation],NullWritable,
AvroInputFormat[Observation]](rawinputfile).map(x=> x._1.datum)
val tfdistinct = rawinput.map(x => x.getDeviceId).distinct
val distinctCount = tfdistinct.count().toDouble
tfdistinct.sample(false, 100000/distinctCount.toDouble).saveAsTextFile(outputfile)
I'd love to keep using dataframes in the future, am I approaching this wrong?

Does UpdateStateByKey in Spark shuffles the data across

I'm a newbie in Spark and i would like to understand whether i need to aggregate the DStream data by key before calling updateStateByKey?
My application basically counts the number of words in every second using Spark Streaming where i perform couple of map operations before doing a state-full update as follows,
val words = inputDstream.flatMap(x => x.split(" "))
val wordDstream = words.map(x => (x, 1))
val stateDstream = wordDstream.updateStateByKey(UpdateFunc _)
stateDstream.print()
Say after the second Map operation, same keys (words) might present across worker nodes due to various partitions, So i assume that the updateStateByKey method internally shuffles and aggregates the key values as Seq[Int] and calls the updateFunc. Is my assumption correct?
correct: as you can see in the method signature it takes an optional partitionNum/Partitioner argument, which denotes the number of reducers i.e. state updaters. This leads to a shuffle.
Also, I suggest to explicitly put a number there otherwise Spark may significantly decrease your job's parallelism trying to run tasks locally with respect to the location of the blocks of the HDFS checkpoint files
updateStateByKey() does not shuffle the state , rather the new data is brought to the nodes containing the state for the same key.
Link to Tathagat's answer to a similar question : https://www.mail-archive.com/user#spark.apache.org/msg43512.html

Resources