I'm trying to read a TSV dataset with 37K files (totaling 5-8 TB) from S3 in Spark on Apache EMR. I'm using a cluster of 91 r5d.4xlarge hosts with executor memory as 36GB and executor cores as 15. It currently takes me up to an hour to download this dataset. AWS documentation for r5d.4xlarge promises network speeds of up to 10 Gbps (ie 1.25GBps) for each host (See link below for reference). However, when I go to ganglia and select bytes_in, I see each node downloading the data at only 20 MBps. Does anyone know how I can increase my download speeds (or any other way to read the dataset quickly)? Wondering if the bottleneck is in how spark is reading the data from S3, S3 itself or that AWS over-promised me?
Thanks
https://aws.amazon.com/ec2/instance-types/r5/
Related
I have been running spark statefull structured streaming in EMR production, before statestore was running on HDFS backend and accumulating Few GB;s like (2.5 gb) in hdfs directories, Later when I moved to rocksdb backend with 3.2.1, It accumulates close to 45GB in last 10days.
Also I using default rocksDB spark configurations. Im sure, It was clearing the state after the processing time out, Even now I see only 25MB and 6.5 million records in spark UI in overall.
How can we pruge rocks db statestore beyond some storage limit ?
I have a Spark Streaming job on EMR which runs on batches of 30 mins, processes the data and finally writes the output to several different files in S3. Now the output step to S3 is taking too long (about 30mins) to write the files to S3. On investigating further, I found that the majority time taken is after all tasks have written the data in temporary folder (happens within 20s) and rest of the time taken is due to the fact that the master node is moving the S3 files from _temporary folder to destination folder and renaming them etc. (Similar to: Spark: long delay between jobs)
Some other details on the job configurations, file format etc are as below:
EMR version: emr-5.22.0
Hadoop version:Amazon 2.8.5
Applications:Hive 2.3.4, Spark 2.4.0, Ganglia 3.7.2
S3 files: Done using RDD saveAsTextFile API with S3A URL, S3 file format is text
Now although the EMRFS output committer is enabled by default in the job but it is not working since we are using RDDs and text file format which is supported post EMR 6.40 version only. One way that I can think of for optimizing the time taken in S3 save is by upgrading the EMR version, converting RDDs to DataFrames/Datasets and using their APIs instead of saveAsTextFile. Is there any other simpler solution possible to optimize the time taken for the job?
Is there any other simpler solution possible to optimize the time taken for the job?
unless you use an s3-specific committer, your jobs will not only be slow, they will be incorrect in the presence of failures. As this may matter to you,it is good that the slow job commits are providing an early warning of problems even before worker failures result in invalid output
options
upgrade. the committers were added for a reason.
use a real cluster fs (e.g HDFS) as the output then upload afterwards.
The s3a zero rename committers do work in saveAsTextFile, but they aren't supported by AWS and the ASF developers don't test on EMR as it is amazon's own fork. you might be able to get any s3a connector amazon ship to work, but you'd be on your own if it didn't.
we have a spark cluster on aws ec2 that has 60 X i3.4xlarge.
the spark job running on that cluster reads from an S3 bucket and writes to that bucket.
the bucket and the ec2 run in the same region.
As part of our efforts to reduce the runtime of our spark jobs we found there's serious latency when reading from S3.
When the job:
reads the parquet files from S3 and also writes to S3, it takes 22 min
reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)
reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min
the spark job has the following S3-related configuration:
spark.hadoop.fs.s3a.connection.establish.timeout=5000
spark.hadoop.fs.s3a.connection.maximum=200
when reading from S3 we tried to increase the spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but it didn't reduce the S3 latency.
Do you have any idea for the cause of the read latency from S3?
I saw this post to improve the transfer speed, is something here relevant?
I am trying to understand if spark is an alternative to the vanilla MapReduce approach for analysis of BigData. Since spark saves the operations on the data in the memory so while using the HDFS as storage system for spark , does it take the advantage of distributed storage of the HDFS? For instance suppose i have 100GB CSV file stored in HDFS, now i want to do analysis on it. If i load that from HDFS to spark , will spark load the complete data in-memory to do the transformations or it will use the distributed environment for doing its jobs that HDFS provides for Storage which is leveraged by the MapReduce programs written in hadoop. If not then what is the advantage of using spark over HDFS ?
PS: I know spark spills on the disks if there is RAM overflow but does this spill occur for data per node(suppose 5 GB per node) of the cluster or for the complete data(100GB)?
Spark jobs can be configured to spill to local executor disk, if there is not enough memory to read your files. Or you can enable HDFS snapshots and caching between Spark stages.
You mention CSV, which is just a bad format to have in Hadoop in general. If you have 100GB of CSV, you could just as easily have less than half that if written in Parquet or ORC...
At the end of the day, you need some processing engine, and some storage layer. For example, Spark on Mesos or Kubernetes might work just as well as on YARN, but those are separate systems, and are not bundled and tied together as nicely as HDFS and YARN. Plus, like MapReduce, when using YARN, you are moving the execution to the NodeManagers on the datanodes, rather than pulling over data over the network, which you would be doing with other Spark execution modes. The NameNode and ResourceManagers coordinate this communication for where data is stored and processed
If you are convinced that MapReduceV2 can be better than Spark, I would encourage looking at Tez instead
I have around 100 GB of data in CSV format on which I intend to do some transformation like aggregation, data splitting and after that do some clustering using ML package of Apache Spark.
I have tried it by uploading data on MYSQ trying to automate the process on python but it's taking too much time to build any solution.
What is the configuration I need to setup and how I should start with the spark?
I am new in spark. I am planning to use cloud services.
I'm going to recommend you learn to use spark locally with a small subset of the data; you can run it standalone with a few tens moving to hundreds of MB. Its limited, but you can learn the tooling without paying. Your first spark dataframe query could be sampling the source data and saving it into a more efficient query format.
CSV isn't a great format for big data; Spark likes Parquet and for 2.3+ ORC). Embrace them for better perf.
Play with "notebooks"; Apache Zeppelin is one you can install and run locally.
Like I say, learn to play with small amounts. Spark is very interactive & working with small datasets is an easy way to learn fast.
There are many ways to do that but it depends on your case. As far as I know, HDFS with default configuration(without any specific tuning) works fine. Majority of Hadoop tuning guides are focused on YARN side. So, let me make a plan like below:
Generally speaking, you can put your (raw) data in HDFS and load them in Apache Spark and save them in Parquet/ORC like below:
from pyspark.sql.types import StructType,StructField,StringType
myschema = StructType([StructField("FirstName",StringType(),True),StructField("LastName",StringType(),True)])
mydf = spark.read.format("com.databricks.spark.csv").option("header","true").schema(myschema).option("delimiter",",").load("hdfs://hadoopmaster:9000/user/hduser/mydata.csv")
mydf.count()
mydf.repartition(6).write.format("parquet").save("hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
newdf = spark.read.parquet("hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
newdf.count()
Finally, compare mydf.count() with newdf.count(). That will run faster than raw format. In addition, your data size will decrease from 100GB to ~24GB.
If you are new to hadoop, spark and interested to setup hadoop environment in cloud. I would suggest you to go with Elastic Map Reduce(EMR) powered by AWS. You can create On demand spark cluster with the user defined configuration to process a wide range of data sets.
https://aws.amazon.com/emr/
https://aws.amazon.com/emr/details/spark/
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-launch.html
Or
You can setup a hadoop cluster on top of EC2 instance or in any cloud platform with the required number of nodes with sufficient RAM and CPU. Storage optimized instances is preferred over here to analyze a large data set.
We do not need to bother about storage cost, For storage optimized instances, AWS offers free ephemeral storage data disk with size 1 - 2TB depends on instance size.
Note: Data in the ephemeral storage will be lost when the VM is rebooted. We can persist the processed data in S3 at the cheapest cost.
When it comes to cluster configuration, the list of things to be checked.
Spark on YARN is preferred
Set minimum and maximum core and memory in yarn node manager container settings for your spark executors.
Enable dynamic memory allocation in spark
Set container size to the maximum and spark memory fraction to maximum to avoid shuffling multiple times and frequent spilling and cached data eviction.
Use kryo serialization to get high performance.
Enable compression for map outputs before shuffling.
Enable spark web UI to track your application tasks and its stages.
Apache Spark Config Reference: https://spark.apache.org/docs/2.1.0/configuration.html