How to Dynamically Increase Active Tasks in Spark running on Yarn - apache-spark

I am running a spark streaming process where I got a batch of 6000 events. But when I look at executors only one active task is running. I tried dynamic allocation and as well as setting number of executors etc. Even if I have 15 executors only one active task is running at a time. Can any one please guide me what am I doing wrong here.

It looks like you're having only one partition in your DStream. You should try to explicitly repartition your input stream:
val input: DStream[...] = ...
val partitionedInput = input.repartition(numPartitions = 16)
This way you would have 16 partitions in your input DStream, and each of those partitions could be processed in a separate task (and each of those tasks could be executed on a separate executor)

Related

why does a single core of a Spark worker complete each task faster than the rest of the cores in the other workers?

I have three nodes in a cluster, each with a single active core. I mean, I have 3 cores in the cluster.
Assuming that all partitions have almost the same number of records, why does a single core of a worker complete each task faster than the rest of the cores in the other workers?
Please observe this screenshot. The timeline shows that the latency of the worker core (x.x.x.230) is notably shorter than the other two worker core (x.x.x.210 and x.x.x.220) latencies.
This means that the workers x.x.x.210 and x.x.x.220 are doing the same job in a longer time compared to the worker x.x.x.230. This also happens when all the available cores in the cluster are used, but its delay is not so critial.
I submitted this application again. Look at this new screenshot. Now the fastest worker is the x.x.x.210. Observe that tasks 0, 1 and 2 process partitions with almost the same number of records. This execution time discrepancy is not good, is it?
I don't understand!!!
What I'm really doing is creating a DataFrame and doing a mapping operation to get a new DataFrame, saving the result in a Parquet file.
val input: DataFrame = spark.read.parquet(...)
val result: DataFrame = input.map(row => /* ...operations... */)
result.write.parquet(...)
Any idea why this happens? Is that how Spark operates normally?
Thanks in advance.

spark web UI notations

I am running the sample job on my end and the spark job UI says the total uptime is 26 sec but when i add up the duration column for jobs it is only around 17-18 sec .Which one should i rely on in order to determine the total time to run the execution logic of my job .I am not concerned about the time ot takes to start and stop the cluster .Is 26 sec including that time ,is thats the case how do i ignore the time to start the and stop the cluster and get the final execution time for my logic .
Spark job UI
Also my spark configuration looks like this :
val conf = new SparkConf().setAppName("Metrics").setMaster("spark://master:7077").set("spark.executor.memory", "5g").set("spark.cores.max", "4").set("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")set("spark.executor.memory", "5g")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
I have 2 physical core and 2 virtual core machine i.e 4 logical core .I am trying to use all the core by setting it to 4 core in the configuration but for some reason only 1 executor is used to run the job .Can somebody explain me the reason as to why only 1 executor is spawned and what is the relation between a core and the executor in the spark world .I am new to spark so any help would be great.
Executor for the job here
A single executor can use multiple threads like in your case. You have one executor with 4 cores.
Each executor thread can process a single partition at the time so your cluster can process four partitions concurrently.
In a small setup like this there is no reason to start multiple executor JVM but if you can use spark.executor.cores to configure how many cores a single executor can use.

How does Apache Spark assign partition-ids to its executors

I have a long-running Spark streaming job which uses 16 executors which only one core each.
I use default partitioner(HashPartitioner) to equally distribute data to 16 partitions. Inside updateStateByKeyfunction, i checked for the partition id from TaskContext.getPartitionId() for multiple batches and found out the partition-id of a executor is quite consistent but still changing to another id after a long run.
I'm planing to do some optimization to spark "updateStateByKey" API, but it can't be achieved if the partition-id keeps changing among batches.
So when does Spark change the partition-id of a executor?
Most probably, the task has failed and restart again, so the TaskContext has changed, and so as the partitionId.

what factors affect how many spark job concurrently

We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?
Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.
Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.

Spark Streaming not distributing task to nodes on cluster

I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.
sparkConf.setMaster("spark://rsplws224:7077")
val ssc=new StreamingContext()
println(ssc.sparkContext.master)
val inDStream = ssc.receiverStream //batch of 500 ms as i would like to have 1 sec latency
val filteredDStream = inDStream.filter // filtering unwanted tuples
val keyDStream = filteredDStream.map // converting to pair dstream
val stateStream = keyDStream .updateStateByKey //updating state for history
stateStream.checkpoint(Milliseconds(2500)) // to remove long lineage and meterilizing state stream
stateStream.count()
val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system
My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.
I have tried below things but nothing is working
1) spark.locality.wait to 1 sec
2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.
3) increased spark.streaming.concurrentJobs from 1 (default) to 3
4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.
If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.
Edit
One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated
Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#performance-tuning for more information.
If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.
By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data.
Change this property in spark-defaults.conf file in order to increase the parallelism level.

Resources