Spark Streaming Dynamic Allocation ExecutorAllocationManager - apache-spark

We have a spark 2.1 streaming application with a mapWithState, enabling spark.streaming.dynamicAllocation.enabled=true. The pipeline is as follows:
var rdd_out = ssc.textFileStream()
.map(convertToEvent(_))
.combineByKey(...., new HashPartitioner(partitions))
.mapWithState(stateSpec)
.map(s => sessionAnalysis(s))
.foreachRDD( rdd => rdd.toDF().....save(output));
The streaming app starts with 2 executors, after some time it creates new executors, as the load increases as expected. The problem is that the load is not shared by those executors.
The number of Partitions is big enough to spill over to the new executors, and the key is equally distributed, I set it up with 40+ partitions, but I can see only 8 partitions (2 executors x 4 cores each) on the mapWithState storage. I am expecting when new executors are allocated, those 8 partitions get split and assigned to the new ones, but this never happens.
Please advise.
Thanks,

Apparently the answer was staring at my face al along :). RDDs as per documentation below, should inherit the upstream partitions.
* Otherwise, we use a default HashPartitioner. For the number of partitions, if
* spark.default.parallelism is set, then we'll use the value from SparkContext
* defaultParallelism, otherwise we'll use the max number of upstream partitions.
The state inside a mapWithState however does not have an upstream RDD. Therefore is set to the default parallelism unless you specify the partitions directly in the state, as the example bellow.
val stateSpec = StateSpec.function(crediting.formSession _)
.timeout(timeout)
.numPartitions(partitions) // <----------
var rdd_out = ssc.textFileStream()
.map(convertToEvent(_))
.combineByKey(...., new HashPartitioner(partitions))
.mapWithState(stateSpec)
.map(s => sessionAnalysis(s))
.foreachRDD( rdd => rdd.toDF().....save(output));
Still need to figure out how to make the number of partitions dynamic, as with dynamic allocation, this should change at runtime.

Related

Spark configuration based on my data size

I know there's a way to configure a Spark Application based in your cluster resources ("Executor memory" and "number of Executor" and "executor cores") I'm wondering if exist a way to do it considering the data input size?
What would happen if data input size does not fit into all partitions?
Example:
Data input size = 200GB
Number of partitions in cluster = 100
Size of partitions = 128MB
Total size that partitions could handle = 100 * 128MB = 128GB
What about the rest of the data (72GB)?
I guess Spark will wait to have free the resources free due to is designed to process batches of data Is this a correct assumption?
Thank in advance
I recommend for best performance, don't set spark.executor.cores. You want one executor per worker. Also, use ~70% of the executor memory in spark.executor.memory. Finally- if you want real-time application statistics to influence the number of partitions, use Spark 3, since it will come with Adaptive Query Execution (AQE). With AQE, Spark will dynamically coalesce shuffle partitions. SO you set it to an arbitrarily-large number of partitions, such as:
spark.sql.shuffle.partitions=<number of cores * 50>
Then just let AQE do its thing. You can read more about it here:
https://www.databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
There are 2 aspects to your question. The first is regarding storage of this data, & the second is regarding data execution.
With regards to storage, when you say Size of partitions = 128MB, I assume you use HDFS to store this data & 128M is your default block size. HDFS itself internally decides how to split this 200GB file & store in chunks not exceeding 128M. And your HDFS cluster should have more than 200GB * replication factor of combined storage to persist this data.
Coming to the Spark execution part of the question, once you define spark.default.parallelism=100, it means that Spark will use this value as the default level of parallelism while performing certain operations (like join etc). Please note that the amount of data being processed by each executor is not affected by the block size (128M) in any way. Which means each executor task will work on 200G/100 = 2G of data (provided the executor memory is sufficient for the required operation being performed). In case there isn't enough capacity in the spark cluster to run 100 executors in parallel, then it will launch as many executors it can in batches as and when resources are available.

How is a Spark Dataframe partitioned by default?

I know that an RDD is partitioned based on the key values using the HashPartitioner. But how is a Spark Dataframe partitioned by default as it does not have the concept of key/value.
A Dataframe is partitioned dependent on the number of tasks that run to create it.
There is no "default" partitioning logic applied. Here are some examples how partitions are set:
A Dataframe created through val df = Seq(1 to 500000: _*).toDF() will have only a single partition.
A Dataframe created through val df = spark.range(0,100).toDF() has as many partitions as the number of available cores (e.g. 4 when your master is set to local[4]). Also, see remark below on the "default parallelism" that comes into effect for operations like parallelize with no parent RDD.
A Dataframe derived from an RDD (spark.createDataFrame(rdd, schema)) will have the same amount of partitions as the underlying RDD. In my case, as I have locally 6 cores, the RDD got created with 6 partitions.
A Dataframe consuming from a Kafka topic will have the amount of partitions matching with the partitions of the topic because it can use as many cores/slots as the topic has partitions to consume the topic.
A Dataframe created by reading a file e.g. from HDFS will have the amount of partitions matching them of the file unless individual files have to be splitted into multiple partitions based on spark.sql.files.maxPartitionBytes which defaults to 128MB.
A Dataframe derived from a transformation requiring a shuffle will have the configurable amount of partitions set by spark.sql.shuffle.partitions (200 by default).
...
One of the major disctinctions between RDD and Structured API is that you do not have as much control over the partitions as you have with RDDs where you can even define a custom partitioner. This is not possible with Dataframes.
Default Parallelism
The documentation of the Execution Behavior configuration spark.default.parallelism explains:
For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

why are there so many partitions in a single-element RDD

The following code returns 16 partitions. How is that possible to have 16 partitions for an array of 1 thing?
rdd = sc.parallelize([""])
rdd.getNumPartitions()
The number of partitions in RDD created by sc.parallelize depends on the scheduler implementation used.
SchedulerBackend trait has this method -
def defaultParallelism(): Int
The CoarseGrainedSchedulerBackend (which is used by yarn) has this implementation -
override def defaultParallelism(): Int = {
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
}
LocalSchedulerBackend has following implementation
override def defaultParallelism(): Int =
scheduler.conf.getInt("spark.default.parallelism", totalCores)
Thats why your RDD has 16 partitions.
In this case of parallelize api it depends on the
Cluster manager.
In local mode it is the total number of cores of your machine
In Mesos fine grain mode it is 8
In yarn it’s total number of cores on all executor nodes or 2 whichever is higher.
These are the default settings if you won’t provide the number of partitions explicitly
Yes, your rdd will have 16 partitions, but 15 of them will be empty. You can check this e.g. with rdd.mapPartitions (see Apache Spark: Get number of records per partition). The number 16 comes from spark.default.parallelism in your case and depends on your environment, but not on the size of your data.
In general empty partitions do not hurt, they will be finished very fast. You could also repartition or coalesce to 1 partition if you don't like empty partitions (see e.g. Dropping empty DataFrame partitions in Apache Spark), but I would not recommend that

Repartitioning of dataframe in spark does not work

I have a cassandra database with large numbers of records ~4 million. I have 3 slave machines and one driver. I want to load this data in spark memory and do processing of it. When I do the following it reads all the data in one slave machine(300 mb out of 6 Gb) and all other slave machines memory is unused. I did a reparition on the dataframe into 3 but still the data is there on one machine. Because of this it takes a lot of time to process data since every job is executed on one machine. This is what I am doing
val tabledf = _sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "events", "keyspace" -> "sams")).load
tabledf.registerTempTable("tempdf");
_sqlContext.cacheTable("tempdf");
val rdd = _sqlContext.sql(query);
val partitionedRdd = rdd.repartition(3)
val count = partitionedRdd.count.toInt
When I do some operations on partitionedRdd it is executed only on one machine since all data is present on one machine only
UPDATE
I am using this in the configuration --conf spark.cassandra.input.split.size_in_mb=32, still all my data is loaded into one executor
Update
I am using spark version 1.4 and spark cassandra connector version 1.4 released
If "Query" only accesses a single C* partition key you will only get a single task because we don't have a way (yet) of automatically getting a single cassandra partition in parallel. If you are accessing multiple C* partitions then try futher shrinking the input split_size in mb.

Spark Streaming not distributing task to nodes on cluster

I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.
sparkConf.setMaster("spark://rsplws224:7077")
val ssc=new StreamingContext()
println(ssc.sparkContext.master)
val inDStream = ssc.receiverStream //batch of 500 ms as i would like to have 1 sec latency
val filteredDStream = inDStream.filter // filtering unwanted tuples
val keyDStream = filteredDStream.map // converting to pair dstream
val stateStream = keyDStream .updateStateByKey //updating state for history
stateStream.checkpoint(Milliseconds(2500)) // to remove long lineage and meterilizing state stream
stateStream.count()
val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system
My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.
I have tried below things but nothing is working
1) spark.locality.wait to 1 sec
2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.
3) increased spark.streaming.concurrentJobs from 1 (default) to 3
4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.
If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.
Edit
One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated
Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#performance-tuning for more information.
If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.
By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data.
Change this property in spark-defaults.conf file in order to increase the parallelism level.

Resources