Writing many files to parquet from Spark - Missing some parquet files - apache-spark

We developed a job that process and writes a huge amount of files in parquet in Amazon S3 (s3a) using Spark 2.3. Every source file should create a different partition in S3. The code was tested (with less files) and working as expected.
However after the execution using the real data we noticed that some files (a small amount of the total) were not written to parquet. No error or anything weird in the logs. We tested again the code for the files that were missing and it worked ¿?. We want to use the code in a production enviroment but we need to detect what's the problem here. We are writing to parquet like this:
dataframe_with_data_to_write.repartition($"field1", $"field2").write.option("compression", "snappy").option("basePath", path_out).partitionBy("field1", "field2", "year", "month", "day").mode(SaveMode.Append).parquet(path_out)
We used the recommended parameters:
spark.sparkContext.hadoopConfiguration.set("mapreduce.output.fileoutputformat.compress", "true")
spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.cleanup-failures.ignored", "true")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
Is there any known issue of bug using this parameters? Maybe something with S3 eventual consistency? Any suggestions?
Any help will be appreciated.

yes, it is a known issue. Work is committed by listing the output in the attempt working directory and renaming into the destination directory. If that listing underreports files: output is missing. If that listing lists files which aren't there, the commit fails.
Fixes on the ASF Hadoop releases.
hadoop-2.7-2.8 connectors. Write to HDFS, copy files
Hadoop 2.9-3.0 turn on S3Guard for a consistent S3 listing (uses DynamoDB for this)
Hadoop 3.1, switch to the S3A committers which are designed with the consistency and performance issues in mind. The "staging" one from netflix is the simplest to use here.
Further reading: A zero-rename committer.
Update 11-01-2019, Amazon has its own closed source implementation of the ASF zero rename committer. Ask the EMR team for their own proofs of correctness, as the rest of us cannot verify this.
Update 11-dec-2020: Amazon S3 is now fully consistent, so listing will be up to date and correct; update inconsistency and 404 caching no more.
The v1 commit algorithm is still unsafe as directory rename is non-atomic
The v2 commit algorithm is always broken as it renames files one-by-one
Renames are slow O(data) copy operations on S3, so the window of failure during task commit is bigger.
You aren't at risk of data loss any more, but as well as performance being awful, failure during task commits aren't handled properly

Related

Spark 2.3.3 outputing parquet to S3

A while back I had the problem that outputting directly parquets to S3 isn't really feasible and I needed a caching layer before I finally copy the parquets to S3 see this post
I know that HADOOP-13786 should fix this problem and it seems to be implemented in HDFS >3.1.0
Now the question is how do I use it in spark 2.3.3 as far as I understand it spark 2.3.3 comes with hdfs 2.8.5. I usually use flintrock to orchestrate my cluster on AWS. Is it just a matter of setting HDFS to 3.1.1 in the flintrock config and then I get all the goodies? Or do I still for example have to set something in code like I did before. For example like this:
conf = SparkConf().setAppName(appname)\
.setMaster(master)\
.set('spark.executor.memory','13g')\
.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version','2')\
.set('fs.s3a.fast.upload','true')\
.set('fs.s3a.fast.upload.buffer','disk')\
.set('fs.s3a.buffer.dir','/tmp/s3a')
(I know this is the old code and probably no longer relevant)
You'll need Hadoop 3.1, and a build of Spark 2.4 which has this PR applied: https://github.com/apache/spark/pull/24970
Some downstream products with their own Spark builds do this (HDP-3.1), but it's not (yet) in the apache builds.
With that you then need to configure parquet to use the new bridging committer (Parquet only allows subclasses of the Parquet committer), and select the specific S3A committer of three (long story) to use. The Staging committer is the one I'd recommend as its (a) based on the one Netflix use and (b) the one I've tested the most.
There's no fundamental reason why the same PR can't be applied to Spark 2.3, just that nobody has tried.

Apache Spark performance on AWS S3 vs EC2 HDFS

What is the performance difference in spark reading file from S3 vs EC2 HDFS. Also Please explain how it works in both case?
Reading S3 is a matter of performing authenticating HTTPS requests with the content-range header set to point to the start of the read (0 or the location you've just done a seek to), and the end (historically the end of the file; this is now optional and should be avoided for the seek-heavy ORC and Parquet inputs).
Key performance points:
Read: you don't get the locality of access; network bandwidth limited by the VMs you rent.
S3 is way slower on seeks, partly addressed in the forthcoming Hadoop 2.8
S3 is way, way slower on metadata operations (list, getFileStatus()). This hurts job setup.
Write: not so bad, except that pre Hadoop 2.8 the client waits until the close() Call to do the upload, which may add delays.
rename(): really a COPY; as rename() is used for committing tasks and jobs, this hurts performance when using s3 as a destination of work. As S3 is eventually consistent, you could lose data anyway. Write to hdfs:// then copy to s3a://
How is this implemented? Look in the Apache Hadoop source tree for the implementations of the abstract org.apache.fs.FileSystem class; HDFS and S3A are both examples. Here's the S3A one. The input stream, with the Hadoop 2.8 lazy seek and fadvise=random option for faster Random IO is S3AInputStream.
Looking at the article the other answer covers, it's a three year old article talking about S3 when it was limited to 5GB; misses out some key points on both sides of the argument.
I think the author had some bias towards S3 in the first place "S3 supports compression!":, as well as some ignorance of aspects of both. (Hint, while both parquet and ORC need seek(), we do this in the s3n and s3a S3 clients by way of the Content-Range HTTP header)
S3 is, on non-EMR systems, a dangerous place to store intermediate data, and performance wise, an inefficient destination of work. This is due to its eventual consistency meaning newly created data may not be picked up by the next stage in the workflow, and because committing work with rename() doesn't work with big datasets. It all seems to work well in development, but production is where the scale problems hit
Looking at the example code,
You'll need the version of amazon-s3 SDK JAR to match your Hadoop versions; for Hadoop 2.7 that's 1.7.4. That's proven to be very brittle.
best to put the s3a secrets into spark-defaults.conf; or leave them as AWS_ environment variables and let spark-submit automatically propagate them. Putting them on the command line makes them visible in a ps command, and you don't want that.
S3a will actually use IAM authentication: if you are submitting to an EC2 VM, you should not need to provide any secrets, as it will pick up the credentials given to the VM at launch time.
If you are planning to use Spark SQL, then you might want to consider below
When your External tables are pointing to S3, SPARK SQL regresses considerably. You might even encounter memory issue like org.apache.spark.shuffle.FetchFailedException: Too large frame, java.lang.OutOfMemoryError
Another observation, If a shuffle block is over 2GB, the shuffle fails. This issue occurs when external tables are pointing to S3.
SPARK SQL performance on HDFS is 50% faster on 50MM/ 10G dataset compared to S3
Here is beautiful article on this topic you have to go through.
storing-apache-hadoop-data-cloud-hdfs-vs-s3
To Conclude : With better scalability, built-in persistence, and lower prices, S3 is winner! Nonetheless, for better performance and no file sizes or storage formats limitations, HDFS is the way to go.
While accessing files from S3, use of URI scheme s3a gives more performance than s3n and also wit s3a there is no 5GB file size limit.
val data = sc.textFile("s3a://bucket-name/key")
You can sumbit the scala jar file for spark like this for example
spark-submit \
--master local[2] \
--packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11,org.apache.hadoop:hadoop-aws:2.7.3 \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.access.key=xxxx \
--conf spark.hadoop.fs.s3a.secret.key=xxxxxxx \
--class org.etl.jobs.sprint.SprintBatchEtl \
target/scala-2.11/test-ingestion-assembly-0.0.1-SNAPSHOT.jar
It would be a good thing if someone could correct the typo in the title...
Old topic, but not much information can be found on the internet.
Best reference I have is:
https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
Which states that S3 is way cheaper but is about 5 times slower... and some use cases need best performing throughput to ingest data.
Most of the times the spark configuration use hybrid HDFS for temporary work + S3 for final writes without users being aware of that.

Support for Parquet as an input / output format when working with S3

I've seen a number of questions describing problems when working with S3 in Spark:
Spark jobs finishes but application takes time to close
spark-1.4.1 saveAsTextFile to S3 is very slow on emr-4.0.0
Writing Spark checkpoints to S3 is too slow
many specifically describing issues with Parquet files:
Slow or incomplete saveAsParquetFile from EMR Spark to S3
Does Spark support Partition Pruning with Parquet Files
is Parquet predicate pushdown works on S3 using Spark non EMR?
Huge delays translating the DAG to tasks
Fast Parquet row count in Spark
as well as some external sources referring to other issues with Spark - S3 - Parquet combinations. It makes me think that either S3 with Spark or this complete combination may not be the best choice.
Am I into something here? Can anyone provide an authoritative answer explaining:
Current state of the Parquet support with focus on S3.
Can Spark (SQL) fully take advantage of Parquet features like partition pruning, predicate pushdown (including deeply nested schemas) and Parquet metadata Do all of these feature work as expected on S3 (or compatible storage solutions).
Ongoing developments and opened JIRA tickets.
Are there any configuration options which should be aware of when using these three together?
A lot of the issues aren't parquet specific, but that S3 is not a filesystem, despite the APIs trying to make it look like this. Many nominally-low cost operations take multiple HTTPS requests, with the consequent delays.
Regarding JIRAs
HADOOP-11694; S3A phase II —everything you will get in Hadoop 2.8. Much of this is already in HDP2.5, and yes, it has significant benefits.
HADOOP-13204: the todo list to follow.
Regarding spark (and hive), the use of rename() to commit work is a killer. It's used at the end of tasks and jobs, and in checkpointing. The more output you generate, the longer things take to complete. The s3guard work will include a zero-rename committer, but it will take care and time to move things to it.
Parquet? pushdown works, but there are a few other options to speed things up. I list them and others in:
http://www.slideshare.net/steve_l/apache-spark-and-object-stores

Does Spark support true column scans over parquet files in S3?

One of the great benefits of the Parquet data storage format is that it's columnar. If I've got a 'wide' dataset with hundreds of columns, but my query only touches a few of those, then it's possible read only the data that stores those few columns, and skip the rest.
Presumably this feature works by reading a bit of metadata at the head of a parquet file that indicates the locations on the filesystem for each column. The reader can then seek on disk to read in only the necessary columns.
Does anyone know whether spark's default parquet reader correctly implements this kind of selective seeking on S3? I think it's supported by S3, but there's a big difference between theoretical support and an implementation that properly exploits that support.
This needs to be broken down
Does the Parquet code get the predicates from spark (yes)
Does parquet then attempt to selectively read only those columns, using the Hadoop FileSystem seek() + read() or readFully(position, buffer, length) calls? Yes
Does the S3 connector translate these File Operations into efficient HTTP GET requests? In Amazon EMR: Yes. In Apache Hadoop, you need hadoop 2.8 on the classpath and set the properly spark.hadoop.fs.s3a.experimental.fadvise=random to trigger random access.
Hadoop 2.7 and earlier handle the aggressive seek() round the file badly, because they always initiate a GET offset-end-of-file, get surprised by the next seek, have to abort that connection, reopen a new TCP/HTTPS 1.1 connection (slow, CPU heavy), do it again, repeatedly. The random IO operation hurts on bulk loading of things like .csv.gz, but is critical to getting ORC/Parquet perf.
You don't get the speedup on Hadoop 2.7's hadoop-aws JAR. If you need it you need to update hadoop*.jar and dependencies, or build Spark up from scratch against Hadoop 2.8
Note that Hadoop 2.8+ also has a nice little feature: if you call toString() on an S3A filesystem client in a log statement, it prints out all the filesystem IO stats, including how much data was discarded in seeks, aborted TCP connections &c. Helps you work out what's going on.
2018-04-13 warning:: Do not try to drop the Hadoop 2.8+ hadoop-aws JAR on the classpath along with the rest of the hadoop-2.7 JAR set and expect to see any speedup. All you will see are stack traces. You need to update all the hadoop JARs and their transitive dependencies.
DISCLAIMER: I don't have a definitive answer and don't want to act as an authoritative source either, but have spent some time on parquet support in Spark 2.2+ and am hoping that my answer can help us all to get closer to the right answer.
Does Parquet on S3 avoid pulling the data for unused columns from S3 and only retrieve the file chunks it needs, or does it pull the whole file?
I use Spark 2.3.0-SNAPSHOT that I built today right from the master.
parquet data source format is handled by ParquetFileFormat which is a FileFormat.
If I'm correct, the reading part is handled by buildReaderWithPartitionValues method (that overrides the FileFormat's).
buildReaderWithPartitionValues is used exclusively when FileSourceScanExec physical operator is requested for so-called input RDDs that are actually a single RDD to generate internal rows when WholeStageCodegenExec is executed.
With that said, I think that reviewing what buildReaderWithPartitionValues does may get us closer to the final answer.
When you look at the line you can get assured that we're on the right track.
// Try to push down filters when filter push-down is enabled.
That code path depends on spark.sql.parquet.filterPushdown Spark property that is turned on by default.
spark.sql.parquet.filterPushdown Enables Parquet filter push-down optimization when set to true.
That leads us to parquet-hadoop's ParquetInputFormat.setFilterPredicate iff the filters are defined.
if (pushed.isDefined) {
ParquetInputFormat.setFilterPredicate(hadoopAttemptContext.getConfiguration, pushed.get)
}
The code gets more interesting a bit later when the filters are used when the code falls back to parquet-mr (rather than using the so-called vectorized parquet decoding reader). That's the part I don't really understand (except what I can see in the code).
Please note that the vectorized parquet decoding reader is controlled by spark.sql.parquet.enableVectorizedReader Spark property that is turned on by default.
TIP: To know what part of the if expression is used, enable DEBUG logging level for org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat logger.
In order to see all the pushed-down filters you could turn INFO logging level of org.apache.spark.sql.execution.FileSourceScanExec logger on. You should see the following in the logs:
INFO Pushed Filters: [pushedDownFilters]
I do hope that if it's not close to be a definitive answer it has helped a little and someone picks it up where I left off to make it one soon. Hope dies last :)
parquet reader of spark is just like any other InputFormat,
None of the inputFormat have any thing special for S3. The input formats can read from LocalFileSystem , Hdfs and S3 no special optimization done for that.
Parquet InpuTFormat depending on the columns you ask will selectively read the columns for you .
If you want to be dead sure (although push down predicates works in latest spark version) manually select the columns and write the transformation and actions , instead of depending on SQL
No, predicate pushdown is not fully supported. This, of course, depends on:
Specific use case
Spark version
S3 connector type and version
In order to check your specific use case, you can enable DEBUG log level in Spark, and run your query. Then, you can see whether there are "seeks" during S3 (HTTP) requests as well as how many requests to were actually sent. Something like this:
17/06/13 05:46:50 DEBUG wire: http-outgoing-1 >> "GET /test/part-00000-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet HTTP/1.1[\r][\n]"
....
17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Range: bytes 0-7472093/7472094[\r][\n]"
....
17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Length: 7472094[\r][\n]"
Here's example of an issue report that was opened recently due to inability of Spark 2.1 to calculate COUNT(*) of all the rows in a dataset based on metadata stored in Parquet file: https://issues.apache.org/jira/browse/SPARK-21074

Pyspark write to External Hive table in S3 is not parallel

I have an external hive table defined with a location in s3
LOCATION 's3n://bucket/path/'
When writing to this table at the end of a pyspark job that aggregates a bunch of data the write to Hive is extremely slow because only 1 executor/container is being used for the write. When writing to an HDFS backed table the write happens in parallel and is significantly faster.
I've tried defining the table using the s3a path but my job fails due to some vague errors.
This is on Amazon EMR 5.0 (hadoop 2.7), pyspark 2.0 but I have experienced the same issue with previous versions of EMR/spark.
Is there a configuration or alternative library that I can use to make this write more efficient?
I guess you're using parquet. The DirectParquetOutputCommitter has been removed to avoid potential data loss issue. The change was actually in 04/2016.
It means the data your write to S3 will firstly be saved in a _temporary folder, then "moved" to its final location. Unfortunately "moving" == "copying & deleting" in S3 and it is rather slow. To make it worse, this final "moving" is done only by the driver.
You will have to write to local HDFS then copy the data over (I do recommend this), if you don't want to fight to add that class back. In HDFS "moving" ~ "renaming" so it takes no time.

Resources