we have a spark cluster on aws ec2 that has 60 X i3.4xlarge.
the spark job running on that cluster reads from an S3 bucket and writes to that bucket.
the bucket and the ec2 run in the same region.
As part of our efforts to reduce the runtime of our spark jobs we found there's serious latency when reading from S3.
When the job:
reads the parquet files from S3 and also writes to S3, it takes 22 min
reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)
reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min
the spark job has the following S3-related configuration:
spark.hadoop.fs.s3a.connection.establish.timeout=5000
spark.hadoop.fs.s3a.connection.maximum=200
when reading from S3 we tried to increase the spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but it didn't reduce the S3 latency.
Do you have any idea for the cause of the read latency from S3?
I saw this post to improve the transfer speed, is something here relevant?
Related
I'm trying to read a TSV dataset with 37K files (totaling 5-8 TB) from S3 in Spark on Apache EMR. I'm using a cluster of 91 r5d.4xlarge hosts with executor memory as 36GB and executor cores as 15. It currently takes me up to an hour to download this dataset. AWS documentation for r5d.4xlarge promises network speeds of up to 10 Gbps (ie 1.25GBps) for each host (See link below for reference). However, when I go to ganglia and select bytes_in, I see each node downloading the data at only 20 MBps. Does anyone know how I can increase my download speeds (or any other way to read the dataset quickly)? Wondering if the bottleneck is in how spark is reading the data from S3, S3 itself or that AWS over-promised me?
Thanks
https://aws.amazon.com/ec2/instance-types/r5/
I am designing solution that reads DDB stream and write to S3 using Lambda for writing to S3 datalake
I will be reading max batch size of 10,000 changes in a batch window of like 2 minutes. Based on the velocity of change in the DB, my design might also create small S3 Files. I am seeing customers having issues with Spark not working properly with small S3 files. I am wondering how this common this issue is?
I'm currently playing around with Spark on an EMR cluster. I noticed that if I perform the reads/writes into/out of my cluster in Spark script itself, there is an absurd wait time for my outputted data to show up in the S3 console, even for relatively lightweight files. Would this write be expedited by writing to HDFS in my Spark script and then adding an additional step to transfer the output from HDFS -> S3 using s3-dist-cp?
I have a Spark Streaming job on EMR which runs on batches of 30 mins, processes the data and finally writes the output to several different files in S3. Now the output step to S3 is taking too long (about 30mins) to write the files to S3. On investigating further, I found that the majority time taken is after all tasks have written the data in temporary folder (happens within 20s) and rest of the time taken is due to the fact that the master node is moving the S3 files from _temporary folder to destination folder and renaming them etc. (Similar to: Spark: long delay between jobs)
Some other details on the job configurations, file format etc are as below:
EMR version: emr-5.22.0
Hadoop version:Amazon 2.8.5
Applications:Hive 2.3.4, Spark 2.4.0, Ganglia 3.7.2
S3 files: Done using RDD saveAsTextFile API with S3A URL, S3 file format is text
Now although the EMRFS output committer is enabled by default in the job but it is not working since we are using RDDs and text file format which is supported post EMR 6.40 version only. One way that I can think of for optimizing the time taken in S3 save is by upgrading the EMR version, converting RDDs to DataFrames/Datasets and using their APIs instead of saveAsTextFile. Is there any other simpler solution possible to optimize the time taken for the job?
Is there any other simpler solution possible to optimize the time taken for the job?
unless you use an s3-specific committer, your jobs will not only be slow, they will be incorrect in the presence of failures. As this may matter to you,it is good that the slow job commits are providing an early warning of problems even before worker failures result in invalid output
options
upgrade. the committers were added for a reason.
use a real cluster fs (e.g HDFS) as the output then upload afterwards.
The s3a zero rename committers do work in saveAsTextFile, but they aren't supported by AWS and the ASF developers don't test on EMR as it is amazon's own fork. you might be able to get any s3a connector amazon ship to work, but you'd be on your own if it didn't.
I'd like to stream the contents of a couchbase bucket to S3 in parquet file using a spark job. I'm currently leveraging Couchbase's spark streaming integration with the couchbase connector to generate a Dstream, but within each dstream there are multiple RDDs that only contain around 10 records each. I could create a file for each RDD and upload them individually to s3, but considering that I have 12 million records to import I would be left with around a million small files in s3 which is not ideal. What would be the best way to load the contents of a couchbase bucket and load it into s3 using a spark job? I'd ultimately like to have a single parquet file with all the contents of the couchbase bucket if possible.