The running spark streaming job, which is supposed to run continuously, exited abruptly with the following error (found in the executor logs):
2017-07-28 00:19:38,807 [SIGTERM handler] ERROR org.apache.spark.util.SignalUtils$$anonfun$registerLogger$1$$anonfun$apply$1 (SignalUtils.scala:43) - RECEIVED SIGNAL TERM
The spark streaming job ran for ~62 hours before receiving this signal.
I couldn't find any other ERROR/ WARN in the executor logs. Unfortunately I haven't set up the driver logs yet, so I am not able to check on this specific issue deeper.
I am using Spark cluster in Standalone mode.
Any reason why driver might send this signal? (after spark streaming ran well and good for more than 60 hours)
Related
We are migrating 2 Spark Streaming jobs using Structured Streaming from on-prem to GCP.
One of them stream messages from Kafka and saves in GCS. And the other, stream from GCS and save in BigQuery.
Sometimes this jobs fails because of any problem, for example: (OutOfMemoryError, Connection reset by peer, Java heap space, etc).
When we get an Exception in on-prem environment, YARN marks the job as FAILLED and we have a scheduler flow that will rise the job again.
In GCP, we developed the same flow, that will rise the job again when fails. But when we get an Exception in DataProc, YARN marks the job as SUCCEEDED and DataProc remain with the status RUNNING.
You can see in this image the log with StreamingQueryException and the status of the job is Running ("Em execução" is running in Portuguese).
Dataproc job
There is structured streaming running on aws EMR, apparently everything ok, but after sometime, approximately 24 hours, the streaming stop to processing message comming from kafka, so I restarted the streaming and it processes again.
There are some config for this issue ?
The only error logs we get is :
Application application_1575912755011_0035 failed 5 times due to ApplicationMaster for attempt appattempt_1575912755011_0035_000005 timed out. Failing the application.
What are the causes of it ?
What is the resolution?
Check if this is related: https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-troubleshoot-application-stops
If that doesn't help, you might try running your streaming jobs in Qubole using Quest (Spark Streaming with ease of use built in) or by running Spark Streaming as is: https://docs.qubole.com/en/latest/user-guide/data-engineering/quest/quest-intro.html
best,
minesh
We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene several jobs launched simultaneously.
We found that by launching the job spark in yarn-client mode we do not have this problem, unlike launching it in yarn-cluster mode.
it could be a trail to find the cause.
we changed the code to add a sparkContext.stop ()
Indeed, the SparkContext was created (val sparkContext = createSparkContext) but not stopped. this solution has allowed us to decrease the number of jobs that remains blocked but nevertheless we still have some jobs blocked.
by analyzing the logs we have found this log that repeats without stopping:
17/09/29 11:04:37 DEBUG SparkEventPublisher: Enqueue SparkListenerExecutorMetricsUpdate(1,WrappedArray())
17/09/29 11:04:41 DEBUG ApplicationMaster: Sending progress
17/09/29 11:04:41 DEBUG ApplicationMaster: Number of pending allocations is 0. Sleeping for 5000.
it seems that the job block whene we call newAPIHadoopRDD to get data from Hbase. it may be the issue !!
Does someone have any idea about this issue ?
Thank you in advance
I am using Spark API (Spark core API, not Stream,SQL etc.)
I often see this kind of error in spark dumped log:
Spark environment: 1.3.1 yarn-client
ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
Who triggers SIGTERM. YARN,Spark or myself?
Will this signal terminate Spark Executor? If not, wow will it affect spark program.
I do press Ctrl+c, but that whould be SIGINT. If YARN kill executor, that would be SIGKILL.
You will likely find the reason in yarn logs. If you activated log aggregation, you can type
yarn logs -applicationId [app_id]
and lookup for exceptions.