I have a Spark Streaming job on EMR which runs on batches of 30 mins, processes the data and finally writes the output to several different files in S3. Now the output step to S3 is taking too long (about 30mins) to write the files to S3. On investigating further, I found that the majority time taken is after all tasks have written the data in temporary folder (happens within 20s) and rest of the time taken is due to the fact that the master node is moving the S3 files from _temporary folder to destination folder and renaming them etc. (Similar to: Spark: long delay between jobs)
Some other details on the job configurations, file format etc are as below:
EMR version: emr-5.22.0
Hadoop version:Amazon 2.8.5
Applications:Hive 2.3.4, Spark 2.4.0, Ganglia 3.7.2
S3 files: Done using RDD saveAsTextFile API with S3A URL, S3 file format is text
Now although the EMRFS output committer is enabled by default in the job but it is not working since we are using RDDs and text file format which is supported post EMR 6.40 version only. One way that I can think of for optimizing the time taken in S3 save is by upgrading the EMR version, converting RDDs to DataFrames/Datasets and using their APIs instead of saveAsTextFile. Is there any other simpler solution possible to optimize the time taken for the job?
Is there any other simpler solution possible to optimize the time taken for the job?
unless you use an s3-specific committer, your jobs will not only be slow, they will be incorrect in the presence of failures. As this may matter to you,it is good that the slow job commits are providing an early warning of problems even before worker failures result in invalid output
options
upgrade. the committers were added for a reason.
use a real cluster fs (e.g HDFS) as the output then upload afterwards.
The s3a zero rename committers do work in saveAsTextFile, but they aren't supported by AWS and the ASF developers don't test on EMR as it is amazon's own fork. you might be able to get any s3a connector amazon ship to work, but you'd be on your own if it didn't.
Related
I'm currently playing around with Spark on an EMR cluster. I noticed that if I perform the reads/writes into/out of my cluster in Spark script itself, there is an absurd wait time for my outputted data to show up in the S3 console, even for relatively lightweight files. Would this write be expedited by writing to HDFS in my Spark script and then adding an additional step to transfer the output from HDFS -> S3 using s3-dist-cp?
I have a use case where spark application is running in one spark version, the event data is published to s3, and start history server from the same s3 path, but with different spark version. Will this cause any problems?
No, it will not cause any problem as long as you can read from S3 bucket using that specific format. Spark versions are mostly compatible. As long as you can figure out how to work in specific version, you're good.
EDIT:
Spark will write to S3 bucket in the data format that you specify. For example, on PC if you create txt file any computer can open that file. Similarly on S3, once you've created Parquet file any Spark version can open it, jus the API may be different.
I have several Spark jobs that write data to and read data from S3. Occasionally (about once per week for approximately 3 hours), the Spark jobs will fail with the following exception:
org.apache.spark.sql.AnalysisException: Path does not exist.
I've uncovered that this is likely due to the consistency model in S3, where list operations are eventually consistent. S3 Guard claims to solve this issue, but I'm in a Spark environment that doesn't support that utility.
Has anyone else run into this issue and figured out a reasonable approach for dealing with it?
If you are using AWS EMR, they offer consistent EMR.
if you are using Databricks: they offer a consistency mechanism in their transactional IO
Both HDP and CDH ship with S3Guard
if you are running your own home-rolled spark stack, , move to Hadoop 2.9+ to get S3Guard, even better: Hadoop 3.1 for the zero-rename S3A committer.
Otherwise: don't use S3 as your direct destination of work.
I have a single file located on S3 that I want to process using Spark using multiple nodes. How spark implements that under the hood? Does each of the worker node read a portion of data from S3 (using byte range request)? I'm trying to understand what are the differences between using Spark on HDFS and S3 in terms of parallel processing. Does it matter when I use EMR?
How spark implements that under the hood?
There are many public articles explaining how spark works like this.
I'm trying to understand what are the differences between using Spark on HDFS and S3 in terms of parallel processing. Does it matter when I use EMR?
It depends on what is your use case. In general, it boils down to :
You would choose S3 over HDFS as a persistent storage option which can contain your data beyond your EMR cluster lifetime.
Unlimited (theoretically) storage limit.
High SLA and durability.
Cost. HDFS on EMR is ephemeral. So you do not need to keep clusters running to have data available.
etc.
Vs
HDFS is faster in I/O operations, intermediate/temporary data locations as S3 communication involves API calls over internet.
Can S3DistCp merge multiple files stored as .snappy.parquet output by a Spark app into one file and have the resulting file be readable by Hive?
I was also trying to merge smaller snappy parquet files into larger snappy parquet files.
Used
aws emr add-steps --cluster-id {clusterID} --steps file://filename.json
and
aws emr wait step-complete --cluster-id {clusterID} --step-id {stepID}
Command runs fine but when I try to read the merged file back using parquet-tools, read fails with java.io.EOFException.
Reached out to AWS support team. They said they have a known issue when using s3distcp on parquet files and they are working on a fix but don't have an ETA for the fix.