spark structure streaming with efs is causing delay in job - apache-spark

Im using Spark Structure streaming 2.4.4. Im using Spark with kubernetes. But when i enable local checkpointing in some /tmp/ folder, jobs finishes in 7-8s. If EFS is mounted and checkpointing location is used on that then jobs are taking more than 5 mins and its quite unstable.
Please find the screenshot from spark sql tab.

Related

AWS EMR step in RUNNING state even when job has been completed

I am running a spark job which partitions data on two columns as an EMR step. The spark job has spark.sql.sources.partitionOverwriteMode set to dynamic and SaveMode as overwrite.
I can see the Spark job has finished execution by looking at the Spark UI but the EMR step continues to be in RUNNING state for more than an hour. I can also see _SUCCESS file in the root directory with timestamp in line with the spark job completion.
Any idea why the EMR step isn't completing or best practices to speed up the process?

Spark RDD S3 saveAsTextFile taking long time

I have a Spark Streaming job on EMR which runs on batches of 30 mins, processes the data and finally writes the output to several different files in S3. Now the output step to S3 is taking too long (about 30mins) to write the files to S3. On investigating further, I found that the majority time taken is after all tasks have written the data in temporary folder (happens within 20s) and rest of the time taken is due to the fact that the master node is moving the S3 files from _temporary folder to destination folder and renaming them etc. (Similar to: Spark: long delay between jobs)
Some other details on the job configurations, file format etc are as below:
EMR version: emr-5.22.0
Hadoop version:Amazon 2.8.5
Applications:Hive 2.3.4, Spark 2.4.0, Ganglia 3.7.2
S3 files: Done using RDD saveAsTextFile API with S3A URL, S3 file format is text
Now although the EMRFS output committer is enabled by default in the job but it is not working since we are using RDDs and text file format which is supported post EMR 6.40 version only. One way that I can think of for optimizing the time taken in S3 save is by upgrading the EMR version, converting RDDs to DataFrames/Datasets and using their APIs instead of saveAsTextFile. Is there any other simpler solution possible to optimize the time taken for the job?
Is there any other simpler solution possible to optimize the time taken for the job?
unless you use an s3-specific committer, your jobs will not only be slow, they will be incorrect in the presence of failures. As this may matter to you,it is good that the slow job commits are providing an early warning of problems even before worker failures result in invalid output
options
upgrade. the committers were added for a reason.
use a real cluster fs (e.g HDFS) as the output then upload afterwards.
The s3a zero rename committers do work in saveAsTextFile, but they aren't supported by AWS and the ASF developers don't test on EMR as it is amazon's own fork. you might be able to get any s3a connector amazon ship to work, but you'd be on your own if it didn't.

Spark checkpointing has a lot of tmp.crc files

I am using spark structured streaming where I read a stream from Kafka and after some transformation I write the resulted stream to Kafka.
I see a lot of hidden ..*tmp.crc files within my checkpoint directory. These files are not getting cleaned up and ever growing in number.
Am I missing some configuration?
I am not running spark on Hadoop. Using EBS based volume for checkpointing.

Use GCS staging directory for Spark jobs (on Dataproc)

I'm trying to change the Spark staging directory to prevent the loss of data on worker decommisionning (on google dataproc with Spark 2.4).
I want to switch the HDFS staging to Google Cloud Storage staging.
When I run this command :
spark-submit --conf "spark.yarn.stagingDir=gs://my-bucket/my-staging/" gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py
I have this error :
org.apache.spark.SparkException: Application application_1560413919313_0056 failed 2 times due to AM Container for appattempt_1560413919313_0056_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: [2019-06-20 07:58:04.462]File not found : gs:/my-staging/.sparkStaging/application_1560413919313_0056/pyspark.zip
java.io.FileNotFoundException: File not found : gs:/my-staging/.sparkStaging/application_1560413919313_0056/pyspark.zip
The Spark job fails but the .sparkStaging/ directory is created on GCS.
Any idea on this issue ?
Thanks.
First, it's important to realize that the staging directory is primarily used for staging artifacts for executors (primarily jars and other archives) rather than for storing intermediate data as a job executes. If you want to preserve intermediate job data (primarily shuffle data) following worker decommissioning (e.g., after machine preemption or scale down), then Dataproc Enhanced Flexibility Mode (currently in alpha) may help you.
Your command works for me on both Dataproc image versions 1.3 and 1.4. Make sure that your target staging bucket exists and that the Dataproc cluster (i.e., the service account that the cluster runs as) has read and write access to the bucket. Note that the GCS connector will not create buckets for you.

Failure recovery in spark running on HDinsight

I was trying to get Apache spark run on Azure HDinsight by following the steps from http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-spark-install/
I was wondering if I have to manage the master/slave failure recovery myself, or will HDinsight take care of it.
I'm also working on Spark Streaming applications on Azure HDInsight. Inside the Spark job, Spark and Yarn can provide some Fault-Tolerance for Master and Slave.
But sometimes, the driver and worker will also crash by the user-code error, spark internal issues, and Azure HDInsight issues. So, we need to make our own monitoring/daemon process, and maintain the recovery.
For Streaming Scenarios, it's even harder. As the Spark Streaming Job which need keep 7*24 running, there's the concern that how to keep the job recovery for the machine reboot and reimage.

Resources