Can S3DistCp combine .snappy.paruqet files? - apache-spark

Can S3DistCp merge multiple files stored as .snappy.parquet output by a Spark app into one file and have the resulting file be readable by Hive?

I was also trying to merge smaller snappy parquet files into larger snappy parquet files.
Used
aws emr add-steps --cluster-id {clusterID} --steps file://filename.json
and
aws emr wait step-complete --cluster-id {clusterID} --step-id {stepID}
Command runs fine but when I try to read the merged file back using parquet-tools, read fails with java.io.EOFException.
Reached out to AWS support team. They said they have a known issue when using s3distcp on parquet files and they are working on a fix but don't have an ETA for the fix.

Related

Spark RDD S3 saveAsTextFile taking long time

I have a Spark Streaming job on EMR which runs on batches of 30 mins, processes the data and finally writes the output to several different files in S3. Now the output step to S3 is taking too long (about 30mins) to write the files to S3. On investigating further, I found that the majority time taken is after all tasks have written the data in temporary folder (happens within 20s) and rest of the time taken is due to the fact that the master node is moving the S3 files from _temporary folder to destination folder and renaming them etc. (Similar to: Spark: long delay between jobs)
Some other details on the job configurations, file format etc are as below:
EMR version: emr-5.22.0
Hadoop version:Amazon 2.8.5
Applications:Hive 2.3.4, Spark 2.4.0, Ganglia 3.7.2
S3 files: Done using RDD saveAsTextFile API with S3A URL, S3 file format is text
Now although the EMRFS output committer is enabled by default in the job but it is not working since we are using RDDs and text file format which is supported post EMR 6.40 version only. One way that I can think of for optimizing the time taken in S3 save is by upgrading the EMR version, converting RDDs to DataFrames/Datasets and using their APIs instead of saveAsTextFile. Is there any other simpler solution possible to optimize the time taken for the job?
Is there any other simpler solution possible to optimize the time taken for the job?
unless you use an s3-specific committer, your jobs will not only be slow, they will be incorrect in the presence of failures. As this may matter to you,it is good that the slow job commits are providing an early warning of problems even before worker failures result in invalid output
options
upgrade. the committers were added for a reason.
use a real cluster fs (e.g HDFS) as the output then upload afterwards.
The s3a zero rename committers do work in saveAsTextFile, but they aren't supported by AWS and the ASF developers don't test on EMR as it is amazon's own fork. you might be able to get any s3a connector amazon ship to work, but you'd be on your own if it didn't.

History Server running with different Spark versions

I have a use case where spark application is running in one spark version, the event data is published to s3, and start history server from the same s3 path, but with different spark version. Will this cause any problems?
No, it will not cause any problem as long as you can read from S3 bucket using that specific format. Spark versions are mostly compatible. As long as you can figure out how to work in specific version, you're good.
EDIT:
Spark will write to S3 bucket in the data format that you specify. For example, on PC if you create txt file any computer can open that file. Similarly on S3, once you've created Parquet file any Spark version can open it, jus the API may be different.

Why would spark insertInto not write parquet files with a .parquet extension?

I have a tool written in Scala that uses Spark's Dataframes API to write data to HDFS. This is the line that writes the data:
temp_table.write.mode(SaveMode.Overwrite).insertInto(tableName)
One of our internal teams is using the tool on their hadoop/spark cluster and when it writes files to HDFS it is doing so without a .parquet extension on the files which (for reasons I won't go into) creates downstream problems for them.
Here is a screenshot provided by that team which shows those files that don't have the .parquet extension:
Note that we have verified that they ARE parquet files (i.e. they can be read using spark.read.parquet(filename))
I have been unable to reproduce this problem in my test environment, when I run the same code there the files get written with a .parquet extension.
Does anyone know what might cause parquet files to not be written with a .parquet extension?

While writing to S3, why I get FileNotFoundException

I'm using Spark-SQL-2.3.1, Kafka, Java 8 in my project, and would like to use AWS-S3 as savage storage.
I am writing/storing the consumed data from Kafka topic into S3 bucket as below:
ds.writeStream()
.format("parquet")
.option("path", parquetFileName)
.option("mergeSchema", true)
.outputMode("append")
.partitionBy("company_id")
.option("checkpointLocation", checkPtLocation)
.trigger(Trigger.ProcessingTime("25 seconds"))
.start();
But while writing I am getting a FileNotFoundException
Caused by: java.io.FileNotFoundException: No such file or directory: s3a://company_id=216231245/part-00055-f4f87dc9-a620-41bd-9380-de4ba7e70efb.c000.snappy.parquet
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1931)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:1822)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1763)
I wounder why I'm getting FileNotFoundException when writing? i am not reading from S3 right?
So what is happening here and how fix this?
This is because S3 is not a file system, but an object store. It does not support the semantics required for rename like HDFS. Spark first writes the output files to temporary folder and then rename them. There is no atomic way of doing this in S3. That's why at times, you will see these errors.
Now, to fix this, if your environment allows, you could use HDFS as an intermediate storage and move the files to S3 for later processing.
If you are on hadoop 3.1, you could use s3a committers shipped with it. More details on how to configure this can be found here
If you are on older version of hadoop, you could use an S3 output committer for Spark, which basically uses S3's multi-part upload to mimic this rename. One such committer I am aware of is this. Looks like this is not updated recently though. There may be other options too.

How to export the output of apache spark program into a csv or text file when using Spark on Amazon EMR

I would like to know how we can print the output after running a SVM algorithm to a csv file. I am hosting my spark cluster on AWS EMR. So any files I access are to be saved and accessed from S3 only. So when I use the saveAsTextFile command and specify an aws path, I don't see the output file(s) being stored in S3. Any suggestions in this regard?
You can use Spark "saveAsTextFile" action to write the results to a file.
An example is available Here

Resources