Apache Spark performance on AWS S3 vs EC2 HDFS - apache-spark

What is the performance difference in spark reading file from S3 vs EC2 HDFS. Also Please explain how it works in both case?

Reading S3 is a matter of performing authenticating HTTPS requests with the content-range header set to point to the start of the read (0 or the location you've just done a seek to), and the end (historically the end of the file; this is now optional and should be avoided for the seek-heavy ORC and Parquet inputs).
Key performance points:
Read: you don't get the locality of access; network bandwidth limited by the VMs you rent.
S3 is way slower on seeks, partly addressed in the forthcoming Hadoop 2.8
S3 is way, way slower on metadata operations (list, getFileStatus()). This hurts job setup.
Write: not so bad, except that pre Hadoop 2.8 the client waits until the close() Call to do the upload, which may add delays.
rename(): really a COPY; as rename() is used for committing tasks and jobs, this hurts performance when using s3 as a destination of work. As S3 is eventually consistent, you could lose data anyway. Write to hdfs:// then copy to s3a://
How is this implemented? Look in the Apache Hadoop source tree for the implementations of the abstract org.apache.fs.FileSystem class; HDFS and S3A are both examples. Here's the S3A one. The input stream, with the Hadoop 2.8 lazy seek and fadvise=random option for faster Random IO is S3AInputStream.
Looking at the article the other answer covers, it's a three year old article talking about S3 when it was limited to 5GB; misses out some key points on both sides of the argument.
I think the author had some bias towards S3 in the first place "S3 supports compression!":, as well as some ignorance of aspects of both. (Hint, while both parquet and ORC need seek(), we do this in the s3n and s3a S3 clients by way of the Content-Range HTTP header)
S3 is, on non-EMR systems, a dangerous place to store intermediate data, and performance wise, an inefficient destination of work. This is due to its eventual consistency meaning newly created data may not be picked up by the next stage in the workflow, and because committing work with rename() doesn't work with big datasets. It all seems to work well in development, but production is where the scale problems hit
Looking at the example code,
You'll need the version of amazon-s3 SDK JAR to match your Hadoop versions; for Hadoop 2.7 that's 1.7.4. That's proven to be very brittle.
best to put the s3a secrets into spark-defaults.conf; or leave them as AWS_ environment variables and let spark-submit automatically propagate them. Putting them on the command line makes them visible in a ps command, and you don't want that.
S3a will actually use IAM authentication: if you are submitting to an EC2 VM, you should not need to provide any secrets, as it will pick up the credentials given to the VM at launch time.

If you are planning to use Spark SQL, then you might want to consider below
When your External tables are pointing to S3, SPARK SQL regresses considerably. You might even encounter memory issue like org.apache.spark.shuffle.FetchFailedException: Too large frame, java.lang.OutOfMemoryError
Another observation, If a shuffle block is over 2GB, the shuffle fails. This issue occurs when external tables are pointing to S3.
SPARK SQL performance on HDFS is 50% faster on 50MM/ 10G dataset compared to S3

Here is beautiful article on this topic you have to go through.
storing-apache-hadoop-data-cloud-hdfs-vs-s3
To Conclude : With better scalability, built-in persistence, and lower prices, S3 is winner! Nonetheless, for better performance and no file sizes or storage formats limitations, HDFS is the way to go.
While accessing files from S3, use of URI scheme s3a gives more performance than s3n and also wit s3a there is no 5GB file size limit.
val data = sc.textFile("s3a://bucket-name/key")
You can sumbit the scala jar file for spark like this for example
spark-submit \
--master local[2] \
--packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11,org.apache.hadoop:hadoop-aws:2.7.3 \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.access.key=xxxx \
--conf spark.hadoop.fs.s3a.secret.key=xxxxxxx \
--class org.etl.jobs.sprint.SprintBatchEtl \
target/scala-2.11/test-ingestion-assembly-0.0.1-SNAPSHOT.jar

It would be a good thing if someone could correct the typo in the title...
Old topic, but not much information can be found on the internet.
Best reference I have is:
https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
Which states that S3 is way cheaper but is about 5 times slower... and some use cases need best performing throughput to ingest data.
Most of the times the spark configuration use hybrid HDFS for temporary work + S3 for final writes without users being aware of that.

Related

Writing many files to parquet from Spark - Missing some parquet files

We developed a job that process and writes a huge amount of files in parquet in Amazon S3 (s3a) using Spark 2.3. Every source file should create a different partition in S3. The code was tested (with less files) and working as expected.
However after the execution using the real data we noticed that some files (a small amount of the total) were not written to parquet. No error or anything weird in the logs. We tested again the code for the files that were missing and it worked ¿?. We want to use the code in a production enviroment but we need to detect what's the problem here. We are writing to parquet like this:
dataframe_with_data_to_write.repartition($"field1", $"field2").write.option("compression", "snappy").option("basePath", path_out).partitionBy("field1", "field2", "year", "month", "day").mode(SaveMode.Append).parquet(path_out)
We used the recommended parameters:
spark.sparkContext.hadoopConfiguration.set("mapreduce.output.fileoutputformat.compress", "true")
spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.cleanup-failures.ignored", "true")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
Is there any known issue of bug using this parameters? Maybe something with S3 eventual consistency? Any suggestions?
Any help will be appreciated.
yes, it is a known issue. Work is committed by listing the output in the attempt working directory and renaming into the destination directory. If that listing underreports files: output is missing. If that listing lists files which aren't there, the commit fails.
Fixes on the ASF Hadoop releases.
hadoop-2.7-2.8 connectors. Write to HDFS, copy files
Hadoop 2.9-3.0 turn on S3Guard for a consistent S3 listing (uses DynamoDB for this)
Hadoop 3.1, switch to the S3A committers which are designed with the consistency and performance issues in mind. The "staging" one from netflix is the simplest to use here.
Further reading: A zero-rename committer.
Update 11-01-2019, Amazon has its own closed source implementation of the ASF zero rename committer. Ask the EMR team for their own proofs of correctness, as the rest of us cannot verify this.
Update 11-dec-2020: Amazon S3 is now fully consistent, so listing will be up to date and correct; update inconsistency and 404 caching no more.
The v1 commit algorithm is still unsafe as directory rename is non-atomic
The v2 commit algorithm is always broken as it renames files one-by-one
Renames are slow O(data) copy operations on S3, so the window of failure during task commit is bigger.
You aren't at risk of data loss any more, but as well as performance being awful, failure during task commits aren't handled properly

Distribute file copy to executors

I have a bunch of data (on S3) that I am copying to a local HDFS (on amazon EMR). Right now I'm doing that using org.apache.hadoop.fs.FileUtil.copy, but it's not clear if this distributes the file copy to the executors. There's certainly nothing showing up in the Spark History server.
Hadoop DistCp seems like the thing (note I'm on S3, so it's actually supposed to be s3-dist-cp which is built on top of dist-cp) except that it's a command-line tool. I'm looking for a way to invoke this from a Scala script (aka, Java).
Any ideas / leads?
cloudcp is an example of using Spark to do the copy; the list of files is turned into an RDD, each row == a copy. That design is optimised for upload from HDFS, as it tries to schedule the upload close to the files in HDFS.
For download, you want to
use listFiles(path, recursive) for maximum performance in listing an object store.
randomise the list of source files so that you don't get throttled by AWS
randomise the placement across the HDFS cluster so that the blocks end up scattered evenly round the cluster

PySpark: How to speed up sqlContext.read.json?

I am using below pyspark code to read thousands of JSON files from an s3 bucket
sc = SparkContext()
sqlContext = SQLContext(sc)
sqlContext.read.json("s3://bucknet_name/*/*/*.json")
This takes a lot of time to read and parse JSON files(~16 mins). How can I parallelize or speed up the process?
The short answer is : It depends (on the underlying infrastructure) and the distribution within data (called the skew which only applies when you're performing anything that causes a shuffle).
If the code you posted is being run on say: AWS' EMR or MapR, it's best to optimize the number of executors on each cluster node such that the number of cores per executor is from three to five. This number is important from the point of reading and writing to S3.
Another possible reason, behind the slowness, can be the dreaded corporate proxy. If all your requests to the S3 service are being routed via a corporate proxy, then the latter is going to be huge bottleneck. It's best to bypass proxy via the NO_PROXY JVM argument on the EMR cluster to the S3 service.
This talk from Cloudera alongside their excellent blogs one and two is an excellent introduction to tuning the cluster. Since we're using sql.read.json the underlying Dataframe will be split into number of partitions given by the yarn param sql.shuffle.paritions described here. It's best to set it at 2 * Number of Executors * Cores per Executor. That will definitely speed up reading, on a cluster whose calculated value exceeds 200
Also, as mentioned in the above answer, if you know the schema of the json, it may speed things up when inferSchema is set to true.
I would also implore you to look at the Spark UI and dig into the DAG for slow jobs. It's an invaluable tool for performance tuning on Spark.
I am planning on consolidating as many infrastructure optimizations on AWS' EMR into a blog. Will update the answer with the link once done.
There are at least two ways to speed up this process:
Avoid wildcards in the path if you can. If it is possible, provide a full list of paths to be loaded instead.
Provide the schema argument to avoid schema inference.

Support for Parquet as an input / output format when working with S3

I've seen a number of questions describing problems when working with S3 in Spark:
Spark jobs finishes but application takes time to close
spark-1.4.1 saveAsTextFile to S3 is very slow on emr-4.0.0
Writing Spark checkpoints to S3 is too slow
many specifically describing issues with Parquet files:
Slow or incomplete saveAsParquetFile from EMR Spark to S3
Does Spark support Partition Pruning with Parquet Files
is Parquet predicate pushdown works on S3 using Spark non EMR?
Huge delays translating the DAG to tasks
Fast Parquet row count in Spark
as well as some external sources referring to other issues with Spark - S3 - Parquet combinations. It makes me think that either S3 with Spark or this complete combination may not be the best choice.
Am I into something here? Can anyone provide an authoritative answer explaining:
Current state of the Parquet support with focus on S3.
Can Spark (SQL) fully take advantage of Parquet features like partition pruning, predicate pushdown (including deeply nested schemas) and Parquet metadata Do all of these feature work as expected on S3 (or compatible storage solutions).
Ongoing developments and opened JIRA tickets.
Are there any configuration options which should be aware of when using these three together?
A lot of the issues aren't parquet specific, but that S3 is not a filesystem, despite the APIs trying to make it look like this. Many nominally-low cost operations take multiple HTTPS requests, with the consequent delays.
Regarding JIRAs
HADOOP-11694; S3A phase II —everything you will get in Hadoop 2.8. Much of this is already in HDP2.5, and yes, it has significant benefits.
HADOOP-13204: the todo list to follow.
Regarding spark (and hive), the use of rename() to commit work is a killer. It's used at the end of tasks and jobs, and in checkpointing. The more output you generate, the longer things take to complete. The s3guard work will include a zero-rename committer, but it will take care and time to move things to it.
Parquet? pushdown works, but there are a few other options to speed things up. I list them and others in:
http://www.slideshare.net/steve_l/apache-spark-and-object-stores

Does Spark support true column scans over parquet files in S3?

One of the great benefits of the Parquet data storage format is that it's columnar. If I've got a 'wide' dataset with hundreds of columns, but my query only touches a few of those, then it's possible read only the data that stores those few columns, and skip the rest.
Presumably this feature works by reading a bit of metadata at the head of a parquet file that indicates the locations on the filesystem for each column. The reader can then seek on disk to read in only the necessary columns.
Does anyone know whether spark's default parquet reader correctly implements this kind of selective seeking on S3? I think it's supported by S3, but there's a big difference between theoretical support and an implementation that properly exploits that support.
This needs to be broken down
Does the Parquet code get the predicates from spark (yes)
Does parquet then attempt to selectively read only those columns, using the Hadoop FileSystem seek() + read() or readFully(position, buffer, length) calls? Yes
Does the S3 connector translate these File Operations into efficient HTTP GET requests? In Amazon EMR: Yes. In Apache Hadoop, you need hadoop 2.8 on the classpath and set the properly spark.hadoop.fs.s3a.experimental.fadvise=random to trigger random access.
Hadoop 2.7 and earlier handle the aggressive seek() round the file badly, because they always initiate a GET offset-end-of-file, get surprised by the next seek, have to abort that connection, reopen a new TCP/HTTPS 1.1 connection (slow, CPU heavy), do it again, repeatedly. The random IO operation hurts on bulk loading of things like .csv.gz, but is critical to getting ORC/Parquet perf.
You don't get the speedup on Hadoop 2.7's hadoop-aws JAR. If you need it you need to update hadoop*.jar and dependencies, or build Spark up from scratch against Hadoop 2.8
Note that Hadoop 2.8+ also has a nice little feature: if you call toString() on an S3A filesystem client in a log statement, it prints out all the filesystem IO stats, including how much data was discarded in seeks, aborted TCP connections &c. Helps you work out what's going on.
2018-04-13 warning:: Do not try to drop the Hadoop 2.8+ hadoop-aws JAR on the classpath along with the rest of the hadoop-2.7 JAR set and expect to see any speedup. All you will see are stack traces. You need to update all the hadoop JARs and their transitive dependencies.
DISCLAIMER: I don't have a definitive answer and don't want to act as an authoritative source either, but have spent some time on parquet support in Spark 2.2+ and am hoping that my answer can help us all to get closer to the right answer.
Does Parquet on S3 avoid pulling the data for unused columns from S3 and only retrieve the file chunks it needs, or does it pull the whole file?
I use Spark 2.3.0-SNAPSHOT that I built today right from the master.
parquet data source format is handled by ParquetFileFormat which is a FileFormat.
If I'm correct, the reading part is handled by buildReaderWithPartitionValues method (that overrides the FileFormat's).
buildReaderWithPartitionValues is used exclusively when FileSourceScanExec physical operator is requested for so-called input RDDs that are actually a single RDD to generate internal rows when WholeStageCodegenExec is executed.
With that said, I think that reviewing what buildReaderWithPartitionValues does may get us closer to the final answer.
When you look at the line you can get assured that we're on the right track.
// Try to push down filters when filter push-down is enabled.
That code path depends on spark.sql.parquet.filterPushdown Spark property that is turned on by default.
spark.sql.parquet.filterPushdown Enables Parquet filter push-down optimization when set to true.
That leads us to parquet-hadoop's ParquetInputFormat.setFilterPredicate iff the filters are defined.
if (pushed.isDefined) {
ParquetInputFormat.setFilterPredicate(hadoopAttemptContext.getConfiguration, pushed.get)
}
The code gets more interesting a bit later when the filters are used when the code falls back to parquet-mr (rather than using the so-called vectorized parquet decoding reader). That's the part I don't really understand (except what I can see in the code).
Please note that the vectorized parquet decoding reader is controlled by spark.sql.parquet.enableVectorizedReader Spark property that is turned on by default.
TIP: To know what part of the if expression is used, enable DEBUG logging level for org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat logger.
In order to see all the pushed-down filters you could turn INFO logging level of org.apache.spark.sql.execution.FileSourceScanExec logger on. You should see the following in the logs:
INFO Pushed Filters: [pushedDownFilters]
I do hope that if it's not close to be a definitive answer it has helped a little and someone picks it up where I left off to make it one soon. Hope dies last :)
parquet reader of spark is just like any other InputFormat,
None of the inputFormat have any thing special for S3. The input formats can read from LocalFileSystem , Hdfs and S3 no special optimization done for that.
Parquet InpuTFormat depending on the columns you ask will selectively read the columns for you .
If you want to be dead sure (although push down predicates works in latest spark version) manually select the columns and write the transformation and actions , instead of depending on SQL
No, predicate pushdown is not fully supported. This, of course, depends on:
Specific use case
Spark version
S3 connector type and version
In order to check your specific use case, you can enable DEBUG log level in Spark, and run your query. Then, you can see whether there are "seeks" during S3 (HTTP) requests as well as how many requests to were actually sent. Something like this:
17/06/13 05:46:50 DEBUG wire: http-outgoing-1 >> "GET /test/part-00000-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet HTTP/1.1[\r][\n]"
....
17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Range: bytes 0-7472093/7472094[\r][\n]"
....
17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Length: 7472094[\r][\n]"
Here's example of an issue report that was opened recently due to inability of Spark 2.1 to calculate COUNT(*) of all the rows in a dataset based on metadata stored in Parquet file: https://issues.apache.org/jira/browse/SPARK-21074

Resources