I have a spark structured streaming application where every batch gets processed in a few seconds. Right now, the current batch is stuck with all tasks in RUNNING status from more than an hour.
How can I specify a timeout in Spark at task level to tell spark that it should retry if task is not completed within defined time?
Related
I am running a spark job which partitions data on two columns as an EMR step. The spark job has spark.sql.sources.partitionOverwriteMode set to dynamic and SaveMode as overwrite.
I can see the Spark job has finished execution by looking at the Spark UI but the EMR step continues to be in RUNNING state for more than an hour. I can also see _SUCCESS file in the root directory with timestamp in line with the spark job completion.
Any idea why the EMR step isn't completing or best practices to speed up the process?
I have a structured streaming job which reads from kafka, perform aggregations and write to hdfs. The job is running in cluster mode in yarn. I am using spark2.4.
Every 2-3 days this job gets stuck. It doesn't fail but gets stuck at some microbatch microbatch. The microbatch doesn't even tend to start. The driver keeps printing following log multiple times for hours.
Got an error when resolving hostNames. Falling back to /default-rack for all.
When I kill the streaming job and start again, the job again starts running fine.
How to fix this ?
See this issue https://issues.apache.org/jira/browse/SPARK-28005
This is fixed in spark 3.0. It seems that this happens because there are no active executers.
We have in our hadoop cluster Spark Batch jobs and and Spark streaming jobs.
We would like to schedule and manage them both on the same platform.
We came across airflow, Which fits our need for a
"platform to author, schedule, and monitor workflows".
I just want to be able to stop and start spark streaming job. Using airflow graphs and profiling is less of an issue.
My question is,
Beside losing some functionality(graphs, profiling) , Why shouldn't I use Airflow to run spark streaming jobs?
I came across this question :
Can airflow be used to run a never ending task?
which says it's possible and not why you shouldn't.
#mMorozonv's Looks good. You could have one DAG start the stream if it does not exist. Then a second DAG as a health checker to track it's progress. If the health check fails you could trigger the first DAG again.
Alternatively you can run the stream with a trigger interval of once[1].
# Load your Streaming DataFrame
sdf = spark.readStream.load(path="data/", format="json", schema=my_schema)
# Perform transformations and then write…
sdf.writeStream.trigger(once=True).start(path="/out/path", format="parquet")
This gives you all the same benefits of spark streaming, with the flexibility of batch processing.
You can simply point the stream at your data and this job will detect all the new files since the last iteration (using checkpointing), run a streaming batch, then terminate. You could trigger your airflow DAG's schedule to suit whatever lag you'd like to process data at (every minute, hour, etc.).
I wouldn't recommend this for low latency requirements, but its very suitable to be run every minute.
[1] https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
Using Airflow branching functionality we can have one dag which will do both scheduling and monitoring of our streaming job. Dag will do a status check of the application and in case application is not running dag will submit a streaming job. In another case dag execution can be finished or you can add a sensor which will check streaming job status after some time with alerts and other stuff you need.
There are two main problems:
Submit streaming application without waiting until it will be
finished. Otherwise our operator will run until it will reach execution_timeout;
That problem can be solved by scheduling out streaming job under cluster mode with spark.yarn.submit.waitAppCompletion configuration parameter set tofalse
Check the status of our streaming operator;
We can check streaming application status using Yarn. For example we can use command yarn application -list -appStates RUNNING . In case our application will be among the list of running applications we should no trigger our streaming job. The only thing is to make streaming job name unique.
There are no strict reasons why you shouldn't use Airflow to run Spark Streaming job. In fact you can monitor your process by periodically logging some metrics with
LOG.info(query.lastProgress)
LOG.info(query.status)
and see them in task log
I have a pyspark job that I submit to a standalone spark cluster - this is an auto scaling cluster on ec2 boxes so when jobs are submitted and not enough nodes are available, after a few minutes a few more boxes spin up and become available.
We have a #timeout decorator on the main part of the spark job to timeout and error when it's exceeded a certain time threshold (put in place because of some jobs hanging). The issue is that sometimes a job may not have gotten to actually starting because its waiting on resources yet #timeout function is evaluated and jobs error out as a result.
So I'm wondering if there's anyway to tell from within the application itself, with code, if the job is waiting for resources?
To know the status of the application then you need to access the Spark Job History server from where you can get the current status of the job.
You can solve your problem as follows:
Get Application ID of your job through sc.applicationId.
Then use this application Id with Spark History Server REST APIs to get the status of the submitted job.
You can find the Spark History Server Rest APIs at link.
We are currently working on a system using kafka, spark streaming, and Cassandra as DB. We are using checkpointing based on the content here [http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing]. Inside the function used to create the Streaming context, we use createDirectStream to create our DStream and from this point, we execute several transformations and actions derived from call saveToCassandra on different RDDs
We are running different tests to establish how the application should recover when a failure occurs. Some key points about our scenario are:
We are testing with a fixed number of records in kafka (between 10 million and 20 million), that means, we consume from kafka once and the application brings all the records from kafka.
We are executing the application in --deploy-mode 'client' inside one of the workers, that means that we stop and start the driver manually.
We are not sure how to handle exceptions after DStreams were created, for example, if while writing to cassandra all nodes are dead, we get an exception that aborts the job, but after re-submitting the application, that job is not re-scheduled and the application keeps consuming from kafka getting multiple 'isEmpty' calls.
We made a couple of tests using 'cache' on the repartitioned RDD (which didn't work after a failure different than just stopping and starting the driver), and changing the parameters "query.retry.count", "query.retry.delay" and "spark.task.maxFailures" without success, e.g., the job is aborted after x failed times.
At this point we are confused on how should we use the checkpoint to re-schedule jobs after a failure.