Structured Streaming is failing but DataProc continue with Running status

Structured Streaming is failing but DataProc continue with Running status - apache-spark

We are migrating 2 Spark Streaming jobs using Structured Streaming from on-prem to GCP.
One of them stream messages from Kafka and saves in GCS. And the other, stream from GCS and save in BigQuery.
Sometimes this jobs fails because of any problem, for example: (OutOfMemoryError, Connection reset by peer, Java heap space, etc).
When we get an Exception in on-prem environment, YARN marks the job as FAILLED and we have a scheduler flow that will rise the job again.
In GCP, we developed the same flow, that will rise the job again when fails. But when we get an Exception in DataProc, YARN marks the job as SUCCEEDED and DataProc remain with the status RUNNING.
You can see in this image the log with StreamingQueryException and the status of the job is Running ("Em execução" is running in Portuguese).
Dataproc job

Related

Is AWS EMR suited for an HA spark direct streaming application

I am trying to run a apache spark direct streaming application in AWS EMR.
The application receives and sends data to AWS kinesis and needs to be running the whole time.
If course if a core node is killed, it stops. But it should self-heal when the core node is replaced.
Now I noticed: When I kill one of the core nodes (simulating a problem), it is replaced by AWS EMR. But the application stops working (no output is send to kinesis anymore) and in also does not continue working unless I restart it.
What I get in the logs is:
ERROR YarnClusterScheduler: Lost executor 1 on ip-10-1-10-100.eu-central-1.compute.internal: Slave lost
Which is expected. But then I get:
20/11/02 13:15:32 WARN TaskSetManager: Lost task 193.1 in stage 373.0 (TID 37583, ip-10-1-10-225.eu-central-1.compute.internal, executor 2): FetchFailed(null, shuffleId=186, mapIndex=-1, mapId=-1, reduceId=193, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 186
These are just warnings, still the application does not produce any output anymore.
So I stop the application and start it again. Now it produces output again.
My question: Is AWS EMR suited for a self-healing application, like the one I need? Or am I using the wrong tool?
If yes, how do I get my spark application to continue after a core node is replaced?

Its recommended to use On-Demand for CORE instances
And at the same time use TASK instance to leverage SPOT instances.
Have a look
Amazon Emr - What is the need of Task nodes when we have Core nodes?
AWS DOC on Master, Core, and Task Nodes

How to debug why pending stage is in the unknown state?

I have a Spark batch job, which reads some json files writes them to Hive and then queries some other Hive tables, does computation and writes output in Orc format back to Hive.
What I experience is job gets stuck with one stage in pending state.
The DAG looks as follows:
I'm using Hadoop 2.7.3.2.6.5.0-292 and Spark is running on YARN.
I looked at the yarn logs, spark event logs, but do not see an issue.
Just rerunning the job results in same behavior.
The question is: what unknown state in stage means, how to debug why job is in it ?

Using airflow to run spark streaming jobs?

We have in our hadoop cluster Spark Batch jobs and and Spark streaming jobs.
We would like to schedule and manage them both on the same platform.
We came across airflow, Which fits our need for a
"platform to author, schedule, and monitor workflows".
I just want to be able to stop and start spark streaming job. Using airflow graphs and profiling is less of an issue.
My question is,
Beside losing some functionality(graphs, profiling) , Why shouldn't I use Airflow to run spark streaming jobs?
I came across this question :
Can airflow be used to run a never ending task?
which says it's possible and not why you shouldn't.

#mMorozonv's Looks good. You could have one DAG start the stream if it does not exist. Then a second DAG as a health checker to track it's progress. If the health check fails you could trigger the first DAG again.
Alternatively you can run the stream with a trigger interval of once[1].
# Load your Streaming DataFrame
sdf = spark.readStream.load(path="data/", format="json", schema=my_schema)
# Perform transformations and then write…
sdf.writeStream.trigger(once=True).start(path="/out/path", format="parquet")
This gives you all the same benefits of spark streaming, with the flexibility of batch processing.
You can simply point the stream at your data and this job will detect all the new files since the last iteration (using checkpointing), run a streaming batch, then terminate. You could trigger your airflow DAG's schedule to suit whatever lag you'd like to process data at (every minute, hour, etc.).
I wouldn't recommend this for low latency requirements, but its very suitable to be run every minute.
[1] https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html

Using Airflow branching functionality we can have one dag which will do both scheduling and monitoring of our streaming job. Dag will do a status check of the application and in case application is not running dag will submit a streaming job. In another case dag execution can be finished or you can add a sensor which will check streaming job status after some time with alerts and other stuff you need.
There are two main problems:
Submit streaming application without waiting until it will be
finished. Otherwise our operator will run until it will reach execution_timeout;
That problem can be solved by scheduling out streaming job under cluster mode with spark.yarn.submit.waitAppCompletion configuration parameter set tofalse
Check the status of our streaming operator;
We can check streaming application status using Yarn. For example we can use command yarn application -list -appStates RUNNING . In case our application will be among the list of running applications we should no trigger our streaming job. The only thing is to make streaming job name unique.

There are no strict reasons why you shouldn't use Airflow to run Spark Streaming job. In fact you can monitor your process by periodically logging some metrics with
LOG.info(query.lastProgress)
LOG.info(query.status)
and see them in task log

What is 'Active Jobs' in Spark History Server Spark UI Jobs section

I'm trying to understand Spark History server components.
I know that, History server shows completed Spark applications.
Nonetheless, I see 'Active Jobs' set to 1 for a completed Spark application. I'm trying to understand what is 'Active Jobs' mean in Jobs section.
Also, Application completed within 30 minutes, but when I opened History Server after 8 hours, 'Duration' shows 8.0h.
Please see the screenshot.
Could you please help me understand 'Active Jobs', 'Duration' and 'Stages: Succeeded/Total' items in above image?

Finally after some research, found answer for my question.
A Spark application consists of a driver and one or more executors. The driver program instantiates SparkContext, which coordinates the executors to run the Spark application. This information is displayed on Spark History Server Web UI 'Active Jobs' section.
The executors run tasks assigned by the driver.
When Spark application runs on YARN, it has its own implementation of yarn client and yarn application master.
YARN application has a yarn client, yarn application master and list of container running on node managers.
In my case Yarn is running in standalone mode, thus driver program is running as a thread of the yarn application master. The Yarn client pulls status from the application master and application master coordinates the containers to run the tasks.
This running job could be monitored in YARN applications page in the Cloudera Manager Admin Console, while it is running.
If application succeeds, then History server will show list of 'Completed Jobs' and also 'Active Jobs' section will be removed.
If application fails at the containers level and YARN communicates this information to Driver then, History server will show list of 'Failed Jobs' and also 'Active Jobs' section will be removed.
Nonetheless, if application fails at the containers level and YARN couldn't communicate that to driver, then Driver instantiated job gets into oblivion state. It thinks job is still being run and keeps waiting to hear from YARN application master for the job status. Hence, in History Server, it still shows up in 'Active Jobs' as running.
So my take away from this is:
To check the status of running job, go to YARN applications page in the Cloudera Manager Admin Console or use YARN CLI command.
After job completion/failure, Open the Spark History Server to get more details on resources usage, DAG and execution timeline information.

Invoking an action(count is action in your case) inside a Spark application triggers the launch of a job to fulfill it. Spark examines the dataset on which that action depends and formulates an execution plan. The execution plan assembles the dataset transformations into stages.
A stage is a physical unit of the execution plan. In shorts, Stage is a set of parallel tasks i.e. one task per partition. Basically, each job which gets divided into smaller sets of tasks is a stage. Although, it totally depends on each other. However, it somewhat same as the map and reduce stages in MapReduce.
each type of Spark Stages in detail:
a. ShuffleMapStage in Spark
ShuffleMapStage is considered as an intermediate Spark stage in the physical execution of DAG.
Basically, it produces data for another stage(s).
consider ShuffleMapStage in Spark as input for other following Spark stages in the DAG of stages.
However, it is possible that there is n number of multiple pipeline operations, in ShuffleMapStage.
like map and filter, before shuffle operation. Furthermore, we can share single ShuffleMapStage among different jobs.
b. ResultStage in Spark
By running a function on a spark RDD Stage which executes a Spark action in a user program is a ResultStage.It is considered as a final stage in spark. ResultStage implies as a final stage in a job that applies a function on one or many partitions of the target RDD in Spark, helps for computation of the result of an action.
coming back to the question of active jobs on history sever there some notes listed on official docs
as history server.Also there is jira [SPARK-7889] issue regarding the same link.
for more details follow the link
source-1

How to tell if your spark job is waiting for resources

I have a pyspark job that I submit to a standalone spark cluster - this is an auto scaling cluster on ec2 boxes so when jobs are submitted and not enough nodes are available, after a few minutes a few more boxes spin up and become available.
We have a #timeout decorator on the main part of the spark job to timeout and error when it's exceeded a certain time threshold (put in place because of some jobs hanging). The issue is that sometimes a job may not have gotten to actually starting because its waiting on resources yet #timeout function is evaluated and jobs error out as a result.
So I'm wondering if there's anyway to tell from within the application itself, with code, if the job is waiting for resources?

To know the status of the application then you need to access the Spark Job History server from where you can get the current status of the job.
You can solve your problem as follows:
Get Application ID of your job through sc.applicationId.
Then use this application Id with Spark History Server REST APIs to get the status of the submitted job.
You can find the Spark History Server Rest APIs at link.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string