Spark on Kubernetes, pods crashing abruptly - apache-spark

Below is the scenario being tested,
Job :
Spark SQK job is written in Scala, and to run on 1TB TPCDS BENCHMARK DATA which is in parquet,snappy format and hive tables created on top of it.
Cluster manager :
Kubernetes
Spark sql configuration :
Set 1 :
spark.executor.heartbeatInterval 20s
spark.executor.cores 4
spark.driver.cores 4
spark.driver.memory 15g
spark.executor.memory 15g
spark.cores.max 220
spark.rpc.numRetries 5
spark.rpc.retry.wait 5
spark.network.timeout 1800
spark.sql.broadcastTimeout 1200
spark.sql.crossJoin.enabled true
spark.sql.starJoinOptimization true
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
spark.sql.codegen true
spark.kubernetes.allocation.batch.size 30
Set 2 :
spark.executor.heartbeatInterval 20s
spark.executor.cores 4
spark.driver.cores 4
spark.driver.memory 11g
spark.driver.memoryOverhead 4g
spark.executor.memory 11g
spark.executor.memoryOverhead 4g
spark.cores.max 220
spark.rpc.numRetries 5
spark.rpc.retry.wait 5
spark.network.timeout 1800
spark.sql.broadcastTimeout 1200
spark.sql.crossJoin.enabled true
spark.sql.starJoinOptimization true
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
spark.sql.codegen true
spark.kubernetes.allocation.batch.size 30
Kryoserialiser is being used and with spark.kryoserializer.buffer.mb value of 64mb.
50 executors are being spawned using spark.executor.instances=50 submit argument.
Issues Observed:
Spark SQL job is terminating abruptly and the drivers,executors are being killed randomly.
driver and executors pods gets killed suddenly the job fails.
Few different stack traces are found across different runs,
Stack Trace 1:
"2018-05-10 06:31:28 ERROR ContextCleaner:91 - Error cleaning broadcast 136
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)"
File attached : StackTrace1.txt
Stack Trace 2:
"org.apache.spark.shuffle.FetchFailedException: Failed to connect to /192.178.1.105:38039^M
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)^M
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)"
File attached : StackTrace2.txt
Stack Trace 3:
"18/05/10 11:21:17 WARN KubernetesTaskSetManager: Lost task 3.0 in stage 48.0 (TID 16486, 192.178.1.35, executor 41): FetchFailed(null, shuffleId=29, mapId=-1, reduceId=3, message=^M
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 29^M
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)^M
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)"
File attached : StackTrace3.txt
Stack Trace 4:
"ERROR KubernetesTaskSchedulerImpl: Lost executor 11 on 192.178.1.123: Executor lost for unknown reasons."
This is repeating constantly until the executors are dead completely without any stack traces.
Also, we see 18/05/11 07:23:23 INFO DAGScheduler: failed: Set()
what does this mean ? anything is wrong or it says failed set is empty that means no failure ?
Observations or changes tried out :
- Monitored memory and CPU utilization across executors and none of them are hitting the limits.
As per few readings and suggestions
spark.network.timeout was increased to 1800 from 600, but did not help.
Also, driver and executor memory overhead was kept default in set 1 of the config and it was 0.1*15g=1.5gb.
Increased this value also, explicitly to 4gb and reduced driver and executor memory values to 11gb from 15gb as per set 2.
this did not yield any valuable results, same failures are being observed.
Spark SQL is being used to run the queries,
sample code lines :
val qresult = spark.sql(q)
qresult.show()
No manual repartitioning is being done in the code.

Related

Running more than one spark applications in cluster, all spark applications are not running optimally as some are getting completed sooner

I am running 20 spark applications on an emr cluster of 2 workers and 1 master node with c5.24xlarge instances, thus I have 192 cores in total and 1024 gb ram in total.
Each application is processing around 1.5gb data!
I am having dynamic allocation as enabled and other spark configuration as following!
spark.executor.memory = 9000M,
spark.executor.memoryoverhead = 1000M,
spark.executor.cores = 5,
spark.sql.shuffle.partitions = 40,
spark.dynamicallocation.initialExecutor = 2,
spark.dynamicallocation.maxExecutor = 10,
spark.dynamicallocation.minExecutor = 10,
spark.dynamicallocation.executorIdleTimeOut = 60s,
spark.dynamicallocation.SchedulerBacklogTimeout = 120s,
spark.driver.memory = 9000M,
spark.driver.memoryoverhead = 1000M,
spark.driver.cores = 9
And spark.default.parallelism =384 (This is decided by emr, not sure how it is decided)
These, I have set at the cluster level which means for all 20 applications this would be the spark confs properties.
having these configurations settings, I can see that only few applications are getting completed in around 20 mins, and around 10 applications keep running for more than 2 hours with only one task running and for some around 10.
Questions:
Why other applications are not getting completed like other
completed one?
I am giving partitions count as 40, then why around 350 tasks
are being added (is it because of parallelism) for each application.
the data for each task is showing like 1.4 gb for task 0 , then
1.3gb for task 1 and so on, is that how data is shown for task pane or data is not being divided properly? (Although I can see in event matrics that partitions seem to be of same size, execution timewise.)
Data to executor in executor summary tab is showing more
than for 1.4 gb, which is sum of all input for each task, but its
processing 1.5gb only (from spark sql tab), so it means that's how
data is shown in executor pane?
Thanks!

Spark Streaming receiver is processing only one record

I have 16 receivers in Spark Streaming 2.2.1 job. After a while, some of the receivers are processing less and less records, eventually processing only one record/second. The behaviour can be observed on the screenshot:
While I understand the root-cause can be difficult to find and not obvious, is there a way I could debug this problem further? Currently I have no idea where to start digging. Could it be related to back-pressure?
Spark streaming properties:
spark.app.id application_1599135282140_1222
spark.cores.max 64
spark.driver.cores 4
spark.driver.extraJavaOptions -XX:+PrintFlagsFinal -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dump/ -Dlog4j.configuration=file:///tmp/4f892127ad794245aef295c97ccbc5c9/driver_log4j.properties
spark.driver.maxResultSize 3840m
spark.driver.memory 4g
spark.driver.port 36201
spark.dynamicAllocation.enabled false
spark.dynamicAllocation.maxExecutors 10000
spark.dynamicAllocation.minExecutors 1
spark.eventLog.enabled false
spark.executor.cores 4
spark.executor.extraJavaOptions -XX:+PrintFlagsFinal -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dump/
spark.executor.id driver
spark.executor.instances 16
spark.executor.memory 4g
spark.jars file:/tmp/4f892127ad794245aef295c97ccbc5c9/main-e41d1cc.jar
spark.master yarn
spark.rpc.message.maxSize 512
spark.scheduler.maxRegisteredResourcesWaitingTime 300s
spark.scheduler.minRegisteredResourcesRatio 1.0
spark.scheduler.mode FAIR
spark.shuffle.service.enabled true
spark.sql.cbo.enabled true
spark.streaming.backpressure.enabled true
spark.streaming.backpressure.initialRate 25
spark.streaming.backpressure.pid.minRate 1
spark.streaming.concurrentJobs 1
spark.streaming.receiver.maxRate 100
spark.submit.deployMode client
Seems that the problem started manifesting after running for 30 mins. I think back-pressure could be a reason. According to this article:
With activated backpressure, the driver monitors the current batch scheduling delays and processing times and dynamically adjusts the maximum rate of the receivers. The communication of new rate limits can be verified in the receiver log:
2016-12-06 08:27:02,572 INFO org.apache.spark.streaming.receiver.ReceiverSupervisorImpl Received a new rate limit: 51.
Here is what I would recommend you to try:
Check the receiver log to see if backpress is triggerred.
Check your stream sink to see if there is any error.
Check YARN resource manager for resource utilization.
Tune Spark parameters to see if that makes a difference.

Hadoop: spark job fails to process small dataset

Our trajectory data mining code finished quickly with a 2M data, but it failed with a larger data like 20M due to many failed tasks. We tried to increase the memory but still failed. We have 3 machines cluster with 4 cores and 32GB RAM.
And our configuration is
spark.executor.memory 26g
spark.executor.cores 2
spark.driver.memory 6g
The error information appeared when we try to solve the problem, like "missing an output for shuffle location", "max number of executor failed (3) reached".
It doesn't seem to be a memory issue. Did you enable dynamic resource allocation - spark.dynamicAllocation.enabled? That will dynamically increase your executor count till the physical limits are reached. Also, hope you're submitting the job in cluster mode.
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

Losing Spark Executors with many tasks outstanding

Whether I use dynamic allocation or explicitly specify executors (16) and executor cores (8), I have been losing executors even though the tasks outstanding are well beyond the current number of executors.
For example, I have a job (Spark SQL) running with over 27,000 tasks and 14,000 of them were complete, but executors "decayed" from 128 down to as few as 16 with thousands of tasks still outstanding. The log doesn't note any errors/exceptions precipitating these lost executors.
It is a Cloudera CDH 5.10 cluster running on AWS EC2 instances with 136 CPU cores and Spark 2.1.0 (from Cloudera).
17/05/23 18:54:17 INFO yarn.YarnAllocator: Driver requested a total number of 91 executor(s).
17/05/23 18:54:17 INFO yarn.YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 91 executors.
It's a slow decay where every minute or so more executors are removed.
Some potentially relevant configuration options:
spark.dynamicAllocation.maxExecutors = 136
spark.dynamicAllocation.minExecutors = 1
spark.dynamicAllocation.initialExecutors = 1
yarn.nodemanager.resource.cpu-vcores = 8
yarn.scheduler.minimum-allocation-vcores = 1
yarn.scheduler.increment-allocation-vcores = 1
yarn.scheduler.maximum-allocation-vcores = 8
Why are the executors decaying away and how can I prevent it?

Executor without H2O instance discovered, killing the cloud

I'm running Tweedie GLM using sparkling water for different sized data ie 20 MB, 400 MB, 2GB,25 GB. Code works fine for Sampling iteration 10. But I have to test for large sampling scenario..
Sampling iteration is 500
In this case code working well for 20 and 400 mb data.But It starts throwing issue when data is larger than 2 GB
After doing search I found one solution disabling change listener but that did not worked for large data.
--conf "spark.scheduler.minRegisteredResourcesRatio=1" "spark.ext.h2o.topology.change.listener.enabled=false"
Here is my spark submit configuration
spark-submit \
--packages ai.h2o:sparkling-water-core_2.10:1.6.1, log4j:log4j:1.2.17\
--driver-memory 8g \
--executor-memory 10g \
--num-executors 10\
--executor-cores 5 \
--class TweedieGLM target/SparklingWaterGLM.jar \
$1\
$2\
--conf "spark.scheduler.minRegisteredResourcesRatio=1" "spark.ext.h2o.topology.change.listener.enabled=false"
This is what I got as an error
16/07/08 20:39:55 ERROR YarnScheduler: Lost executor 2 on cfclbv0152.us2.oraclecloud.com: Executor heartbeat timed out after 175455 ms
16/07/08 20:40:00 ERROR YarnScheduler: Lost executor 2 on cfclbv0152.us2.oraclecloud.com: remote Rpc client disassociated
16/07/08 20:40:00 ERROR LiveListenerBus: Listener anon1 threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
at org.apache.spark.h2o.H2OContext$$anon$1.onExecutorAdded(H2OContext.scala:203)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:58)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
After reading carefully the issue posted on github https://github.com/h2oai/sparkling-water/issues/32. I tried couple of options here is what I tried
Added
--conf "spark.scheduler.minRegisteredResourcesRatio=1" "spark.ext.h2o.topology.change.listener.enabled=false" "spark.locality.wait=3000" "spark.ext.h2o.network.mask=10.196.64.0/24"
Changed the :
Executors from 10 to 3,6 9
executor-memory from 4 to 12 and 12 to 24gb
driver-memory from 4 to 12 and 12 to 24gb
This is what I learned: GLM is memory intensive job so we have to provide sufficient memory to execute the job.
I will do trouble shoot this problem with using Sparkling water shell and executing one line at a time
Start shell
Start H2O
Monitor the state of cluster
Then
Read Input data and cache it
Read Yarn logs to find why my tasks are getting killed , many times Yarn preemption kills the executors.
Increasing Spark wait time for starting H2O process
Decreasing the number of executors to just 3 / increasing cores to 3 / increasing executor memory to 6 GB
Monitor Spark UI and H2O Flow UI to see whats going on with memory in each stage
As general rule the memory size of H2O cluster should be 5 times the data input size. With each iteration are you crossing that limit ? 2 GB seems to be very small. We process huge volumes everyday using Sparkling water and Spark.
there are some suggestions on H2o website
https://github.com/h2oai/sparkling-water/blob/master/doc/configuration/internal_backend_tuning.rst

Resources