Spark 2.3.3 outputing parquet to S3

Spark 2.3.3 outputing parquet to S3 - apache-spark

A while back I had the problem that outputting directly parquets to S3 isn't really feasible and I needed a caching layer before I finally copy the parquets to S3 see this post
I know that HADOOP-13786 should fix this problem and it seems to be implemented in HDFS >3.1.0
Now the question is how do I use it in spark 2.3.3 as far as I understand it spark 2.3.3 comes with hdfs 2.8.5. I usually use flintrock to orchestrate my cluster on AWS. Is it just a matter of setting HDFS to 3.1.1 in the flintrock config and then I get all the goodies? Or do I still for example have to set something in code like I did before. For example like this:
conf = SparkConf().setAppName(appname)\
.setMaster(master)\
.set('spark.executor.memory','13g')\
.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version','2')\
.set('fs.s3a.fast.upload','true')\
.set('fs.s3a.fast.upload.buffer','disk')\
.set('fs.s3a.buffer.dir','/tmp/s3a')
(I know this is the old code and probably no longer relevant)

You'll need Hadoop 3.1, and a build of Spark 2.4 which has this PR applied: https://github.com/apache/spark/pull/24970
Some downstream products with their own Spark builds do this (HDP-3.1), but it's not (yet) in the apache builds.
With that you then need to configure parquet to use the new bridging committer (Parquet only allows subclasses of the Parquet committer), and select the specific S3A committer of three (long story) to use. The Staging committer is the one I'd recommend as its (a) based on the one Netflix use and (b) the one I've tested the most.
There's no fundamental reason why the same PR can't be applied to Spark 2.3, just that nobody has tried.

Related

Writing many files to parquet from Spark - Missing some parquet files

We developed a job that process and writes a huge amount of files in parquet in Amazon S3 (s3a) using Spark 2.3. Every source file should create a different partition in S3. The code was tested (with less files) and working as expected.
However after the execution using the real data we noticed that some files (a small amount of the total) were not written to parquet. No error or anything weird in the logs. We tested again the code for the files that were missing and it worked ¿?. We want to use the code in a production enviroment but we need to detect what's the problem here. We are writing to parquet like this:
dataframe_with_data_to_write.repartition($"field1", $"field2").write.option("compression", "snappy").option("basePath", path_out).partitionBy("field1", "field2", "year", "month", "day").mode(SaveMode.Append).parquet(path_out)
We used the recommended parameters:
spark.sparkContext.hadoopConfiguration.set("mapreduce.output.fileoutputformat.compress", "true")
spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.cleanup-failures.ignored", "true")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
Is there any known issue of bug using this parameters? Maybe something with S3 eventual consistency? Any suggestions?
Any help will be appreciated.

yes, it is a known issue. Work is committed by listing the output in the attempt working directory and renaming into the destination directory. If that listing underreports files: output is missing. If that listing lists files which aren't there, the commit fails.
Fixes on the ASF Hadoop releases.
hadoop-2.7-2.8 connectors. Write to HDFS, copy files
Hadoop 2.9-3.0 turn on S3Guard for a consistent S3 listing (uses DynamoDB for this)
Hadoop 3.1, switch to the S3A committers which are designed with the consistency and performance issues in mind. The "staging" one from netflix is the simplest to use here.
Further reading: A zero-rename committer.
Update 11-01-2019, Amazon has its own closed source implementation of the ASF zero rename committer. Ask the EMR team for their own proofs of correctness, as the rest of us cannot verify this.
Update 11-dec-2020: Amazon S3 is now fully consistent, so listing will be up to date and correct; update inconsistency and 404 caching no more.
The v1 commit algorithm is still unsafe as directory rename is non-atomic
The v2 commit algorithm is always broken as it renames files one-by-one
Renames are slow O(data) copy operations on S3, so the window of failure during task commit is bigger.
You aren't at risk of data loss any more, but as well as performance being awful, failure during task commits aren't handled properly

Why does persist(StorageLevel.MEMORY_AND_DISK) give different results than cache() with HBase?

I could sound naive asking this question but this is a problem that I have recently faced in my project. Need some better understanding on it.
df.persist(StorageLevel.MEMORY_AND_DISK)
Whenever we use such persist on a HBase read - the same data is returned again and again for the other subsequent batches of the streaming job but HBase is updated for every batch run.
HBase Read Code:
val df = sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog -> schema)).format(dbSetup.dbClass).load().persist(StorageLevel.MEMORY_AND_DISK)
I replaced persist(StorageLevel.MEMORY_AND_DISK) with cache() and it was returning updated records from HBase table as expected.
The reason we tried to use persist(StorageLevel.MEMORY_AND_DISK) is to ensure that the in-memory storage does not get full and we do not end up doing all transformations all over again during the execution of a particular stream.
Spark Version - 1.6.3
HBase Version - 1.1.2.2.6.4.42-1
Could someone explain me this and help me get a better understanding?

As you mentioned you are looking for a reason "why" therefore I'm answering this because otherwise this question will remain unanswered as there's no rational reason these days to run spark 1.6.3 to sense what happens with that specific HBASE version.
Internally, spark calls persist() when you use cache() and it behaves differently on RDDs than on Datasets(or Dataframes).
On RDDs it uses MEMORY_ONLY and on Datasets, MEMORY_AND_DISK.I cant see what you've coded(fully) but generally I can say, you shouldn't have face the difference between the two ways of cache and persist and your issue is simply a version incompatibility btw those or simply a bug that wasn't fixed by Apache.
There are several places to check to see what's wrong
In this link https://spark.apache.org/releases/spark-release-1-6-3.html you can find that maintainance of the code is hapening in branch 1.6 so this is the place to find the code https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/CacheManager.scala
Hope it helped.

Which version of hadoop-aws should I use

I'm running spark jobs on Yarn on EMR 5.14 (hadoop 2.8.3).
Can I use a superior version of hadoop-aws (e.g. 2.9 or 3.1) to benefit from recent optimization in s3a protocol ?

You need to stick with whatever EMR give you. Their s3:// connector is the one which AWS develop and probably your safest option.
FWIW, s3a since in 2.8.3 for input performance. hasn't changed much from later versions, except in 3.1 if you leave fs.s3a.experimental.fadvise to normal, it automatically switches from optimising for sequential IO to random IO (columnar data) on the first backward seek. Still best to set that property to random from the outset if you know all your data is stored as Parquet/ORC in a seekable compression format (i.e. not gzip). No speedup in writes either. You get a consistency layer equivalent to "consistent EMR" in Hadoop 2.9+, and a high performance output committer in Hadoop 3.1. But you cannot try and use those features by dropping in the later JARs. it will only give you stack traces

Why is difference between sqlContext.read.load and sqlContext.read.text?

I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between sqlContext.read.load and sqlContext.read.text.
s3_single_file_inpath='s3a://bucket-name/file_name'
indata = sqlContext.read.load(s3_single_file_inpath, format='com.databricks.spark.csv', header='true', inferSchema='false',sep=',')
indata = sqlContext.read.text(s3_single_file_inpath)
The sqlContext.read.load command above fails with
Py4JJavaError: An error occurred while calling o227.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
But the second one succeeds?
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?

Why is difference between sqlContext.read.load and sqlContext.read.text?
sqlContext.read.load assumes parquet as the data source format while sqlContext.read.text assumes text format.
With sqlContext.read.load you can define the data source format using format parameter.
Depending on the version of Spark 1.6 vs 2.x you may or may not load an external Spark package to have support for csv format.
As of Spark 2.0 you no longer have to load spark-csv Spark package since (quoting the official documentation):
NOTE: This functionality has been inlined in Apache Spark 2.x. This package is in maintenance mode and we only accept critical bug fixes.
That would explain why you got confused as you may have been using Spark 1.6.x and have not loaded the Spark package to have csv support.
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
https://spark.apache.org/docs/1.6.1/sql-programming-guide.html is for Spark 1.6.1 when spark-csv Spark package was not part of Spark. It happened in Spark 2.0.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
There's none actually iff you use Spark 2.x.
If however you use Spark 1.6.x, spark-csv has to be loaded separately using --packages option (as described in Using with Spark shell):
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell
As a matter of fact, you can still use com.databricks.spark.csv format explicitly in Spark 2.x as it's recognized internally.

The difference is:
text is a built-in input format in Spark 1.6
com.databricks.spark.csv is a third party package in Spark 1.6
To use third party Spark CSV (no longer needed in Spark 2.0) you have to follow the instructions on spark-csv site, for example provide
--packages com.databricks:spark-csv_2.10:1.5.0
argument with spark-submit / pyspark commands.
Beyond that sqlContext.read.formatName(...) is a syntactic sugar for sqlContext.read.format("formatName") and sqlContext.read.load(..., format=formatName).

Does Spark support true column scans over parquet files in S3?

One of the great benefits of the Parquet data storage format is that it's columnar. If I've got a 'wide' dataset with hundreds of columns, but my query only touches a few of those, then it's possible read only the data that stores those few columns, and skip the rest.
Presumably this feature works by reading a bit of metadata at the head of a parquet file that indicates the locations on the filesystem for each column. The reader can then seek on disk to read in only the necessary columns.
Does anyone know whether spark's default parquet reader correctly implements this kind of selective seeking on S3? I think it's supported by S3, but there's a big difference between theoretical support and an implementation that properly exploits that support.

This needs to be broken down
Does the Parquet code get the predicates from spark (yes)
Does parquet then attempt to selectively read only those columns, using the Hadoop FileSystem seek() + read() or readFully(position, buffer, length) calls? Yes
Does the S3 connector translate these File Operations into efficient HTTP GET requests? In Amazon EMR: Yes. In Apache Hadoop, you need hadoop 2.8 on the classpath and set the properly spark.hadoop.fs.s3a.experimental.fadvise=random to trigger random access.
Hadoop 2.7 and earlier handle the aggressive seek() round the file badly, because they always initiate a GET offset-end-of-file, get surprised by the next seek, have to abort that connection, reopen a new TCP/HTTPS 1.1 connection (slow, CPU heavy), do it again, repeatedly. The random IO operation hurts on bulk loading of things like .csv.gz, but is critical to getting ORC/Parquet perf.
You don't get the speedup on Hadoop 2.7's hadoop-aws JAR. If you need it you need to update hadoop*.jar and dependencies, or build Spark up from scratch against Hadoop 2.8
Note that Hadoop 2.8+ also has a nice little feature: if you call toString() on an S3A filesystem client in a log statement, it prints out all the filesystem IO stats, including how much data was discarded in seeks, aborted TCP connections &c. Helps you work out what's going on.
2018-04-13 warning:: Do not try to drop the Hadoop 2.8+ hadoop-aws JAR on the classpath along with the rest of the hadoop-2.7 JAR set and expect to see any speedup. All you will see are stack traces. You need to update all the hadoop JARs and their transitive dependencies.

DISCLAIMER: I don't have a definitive answer and don't want to act as an authoritative source either, but have spent some time on parquet support in Spark 2.2+ and am hoping that my answer can help us all to get closer to the right answer.
Does Parquet on S3 avoid pulling the data for unused columns from S3 and only retrieve the file chunks it needs, or does it pull the whole file?
I use Spark 2.3.0-SNAPSHOT that I built today right from the master.
parquet data source format is handled by ParquetFileFormat which is a FileFormat.
If I'm correct, the reading part is handled by buildReaderWithPartitionValues method (that overrides the FileFormat's).
buildReaderWithPartitionValues is used exclusively when FileSourceScanExec physical operator is requested for so-called input RDDs that are actually a single RDD to generate internal rows when WholeStageCodegenExec is executed.
With that said, I think that reviewing what buildReaderWithPartitionValues does may get us closer to the final answer.
When you look at the line you can get assured that we're on the right track.
// Try to push down filters when filter push-down is enabled.
That code path depends on spark.sql.parquet.filterPushdown Spark property that is turned on by default.
spark.sql.parquet.filterPushdown Enables Parquet filter push-down optimization when set to true.
That leads us to parquet-hadoop's ParquetInputFormat.setFilterPredicate iff the filters are defined.
if (pushed.isDefined) {
ParquetInputFormat.setFilterPredicate(hadoopAttemptContext.getConfiguration, pushed.get)
}
The code gets more interesting a bit later when the filters are used when the code falls back to parquet-mr (rather than using the so-called vectorized parquet decoding reader). That's the part I don't really understand (except what I can see in the code).
Please note that the vectorized parquet decoding reader is controlled by spark.sql.parquet.enableVectorizedReader Spark property that is turned on by default.
TIP: To know what part of the if expression is used, enable DEBUG logging level for org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat logger.
In order to see all the pushed-down filters you could turn INFO logging level of org.apache.spark.sql.execution.FileSourceScanExec logger on. You should see the following in the logs:
INFO Pushed Filters: [pushedDownFilters]
I do hope that if it's not close to be a definitive answer it has helped a little and someone picks it up where I left off to make it one soon. Hope dies last :)

parquet reader of spark is just like any other InputFormat,
None of the inputFormat have any thing special for S3. The input formats can read from LocalFileSystem , Hdfs and S3 no special optimization done for that.
Parquet InpuTFormat depending on the columns you ask will selectively read the columns for you .
If you want to be dead sure (although push down predicates works in latest spark version) manually select the columns and write the transformation and actions , instead of depending on SQL

No, predicate pushdown is not fully supported. This, of course, depends on:
Specific use case
Spark version
S3 connector type and version
In order to check your specific use case, you can enable DEBUG log level in Spark, and run your query. Then, you can see whether there are "seeks" during S3 (HTTP) requests as well as how many requests to were actually sent. Something like this:
17/06/13 05:46:50 DEBUG wire: http-outgoing-1 >> "GET /test/part-00000-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet HTTP/1.1[\r][\n]"
....
17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Range: bytes 0-7472093/7472094[\r][\n]"
....
17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Length: 7472094[\r][\n]"
Here's example of an issue report that was opened recently due to inability of Spark 2.1 to calculate COUNT(*) of all the rows in a dataset based on metadata stored in Parquet file: https://issues.apache.org/jira/browse/SPARK-21074

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string