While writing to S3, why I get FileNotFoundException - apache-spark

I'm using Spark-SQL-2.3.1, Kafka, Java 8 in my project, and would like to use AWS-S3 as savage storage.
I am writing/storing the consumed data from Kafka topic into S3 bucket as below:
ds.writeStream()
.format("parquet")
.option("path", parquetFileName)
.option("mergeSchema", true)
.outputMode("append")
.partitionBy("company_id")
.option("checkpointLocation", checkPtLocation)
.trigger(Trigger.ProcessingTime("25 seconds"))
.start();
But while writing I am getting a FileNotFoundException
Caused by: java.io.FileNotFoundException: No such file or directory: s3a://company_id=216231245/part-00055-f4f87dc9-a620-41bd-9380-de4ba7e70efb.c000.snappy.parquet
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1931)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:1822)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1763)
I wounder why I'm getting FileNotFoundException when writing? i am not reading from S3 right?
So what is happening here and how fix this?

This is because S3 is not a file system, but an object store. It does not support the semantics required for rename like HDFS. Spark first writes the output files to temporary folder and then rename them. There is no atomic way of doing this in S3. That's why at times, you will see these errors.
Now, to fix this, if your environment allows, you could use HDFS as an intermediate storage and move the files to S3 for later processing.
If you are on hadoop 3.1, you could use s3a committers shipped with it. More details on how to configure this can be found here
If you are on older version of hadoop, you could use an S3 output committer for Spark, which basically uses S3's multi-part upload to mimic this rename. One such committer I am aware of is this. Looks like this is not updated recently though. There may be other options too.

Related

Spark RDD S3 saveAsTextFile taking long time

I have a Spark Streaming job on EMR which runs on batches of 30 mins, processes the data and finally writes the output to several different files in S3. Now the output step to S3 is taking too long (about 30mins) to write the files to S3. On investigating further, I found that the majority time taken is after all tasks have written the data in temporary folder (happens within 20s) and rest of the time taken is due to the fact that the master node is moving the S3 files from _temporary folder to destination folder and renaming them etc. (Similar to: Spark: long delay between jobs)
Some other details on the job configurations, file format etc are as below:
EMR version: emr-5.22.0
Hadoop version:Amazon 2.8.5
Applications:Hive 2.3.4, Spark 2.4.0, Ganglia 3.7.2
S3 files: Done using RDD saveAsTextFile API with S3A URL, S3 file format is text
Now although the EMRFS output committer is enabled by default in the job but it is not working since we are using RDDs and text file format which is supported post EMR 6.40 version only. One way that I can think of for optimizing the time taken in S3 save is by upgrading the EMR version, converting RDDs to DataFrames/Datasets and using their APIs instead of saveAsTextFile. Is there any other simpler solution possible to optimize the time taken for the job?
Is there any other simpler solution possible to optimize the time taken for the job?
unless you use an s3-specific committer, your jobs will not only be slow, they will be incorrect in the presence of failures. As this may matter to you,it is good that the slow job commits are providing an early warning of problems even before worker failures result in invalid output
options
upgrade. the committers were added for a reason.
use a real cluster fs (e.g HDFS) as the output then upload afterwards.
The s3a zero rename committers do work in saveAsTextFile, but they aren't supported by AWS and the ASF developers don't test on EMR as it is amazon's own fork. you might be able to get any s3a connector amazon ship to work, but you'd be on your own if it didn't.

Does S3 Strong Consistency mean it is safe to use S3 as a checkpointing location for Spark Structured Streaming applications?

In the past, the general consensus was such that you should not use S3 as checkpointing location for Spark Structured Streaming applications.
However, now that S3 offers strong read after write consistency, is it safe to use S3 as a checkpointing location? If it is not safe, why?
In my experiments, I continue to see checkpointing related exceptions in my Spark Structured streaming applications, but I am uncertain where the problem actually lies.
not really. you get consistency of list and updates, but rename is still mocked with copy and delete...and I think the standard checkpoint algorithm depends on it.
hadoop 3.3.1 added a new API, Abortable to aid with a custom S3 stream checkpoint committer -the idea os that the checkpointer woudl write sstraight to the destination, but abort the write when aborting the checkpoint. a normal close() would finish the write and manifest the file. see https://issues.apache.org/jira/browse/HADOOP-16906
AFAIK nobody has done the actual committer. opportunity for you to contribute there...
You really answer your own question. You do not state if on Databricks or EMR so I am going to assume EC2.
Use HDFS as checkpoint location on local EC2 disk.
Where I am now we have HDFS using HDP and IBM S3, HDFS is used still for checkpointing.

History Server running with different Spark versions

I have a use case where spark application is running in one spark version, the event data is published to s3, and start history server from the same s3 path, but with different spark version. Will this cause any problems?
No, it will not cause any problem as long as you can read from S3 bucket using that specific format. Spark versions are mostly compatible. As long as you can figure out how to work in specific version, you're good.
EDIT:
Spark will write to S3 bucket in the data format that you specify. For example, on PC if you create txt file any computer can open that file. Similarly on S3, once you've created Parquet file any Spark version can open it, jus the API may be different.

How to use new Hadoop parquet magic commiter to custom S3 server with Spark

I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf:
spark.sql.sources.commitProtocolClass com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.magic.enabled true
When using this configuration I end up with the exception:
java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?
Second, if I did understand well, how to use the new committer properly from Spark ?
Edit:
OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:
sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")
I'm able to write so far without trouble.
However my swift server which is a bit older with this config:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
seems to not support properly the partionner.
Regarding "Hadoop S3guard":
It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.
There's no alternative now like a sqlite file or other DB system to store the metadata.
So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.
This article explains nicely how works S3guard
Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)
You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.
All the new committers config documentation I've read up to date, is missing one fundamental fact:
spark 2.x.x does not have needed support classes to make new S3a committers to function.
They promise those cloud integration libs will be bundled with spark 3.0.0, but for now you have to add libraries yourself.
Under the cloud integration maven repos there are multiple distributions supporting the committers, I found one working with directory committer but not the magic.
In general the directory committer is the recommended over magic as it has been well tested and tried. It requires shared filesystem (magic committer does not require one, but needs s3guard) such as HDFS or NFS (we use AWS EFS) to coordinate spark worker writes to S3.

Can S3DistCp combine .snappy.paruqet files?

Can S3DistCp merge multiple files stored as .snappy.parquet output by a Spark app into one file and have the resulting file be readable by Hive?
I was also trying to merge smaller snappy parquet files into larger snappy parquet files.
Used
aws emr add-steps --cluster-id {clusterID} --steps file://filename.json
and
aws emr wait step-complete --cluster-id {clusterID} --step-id {stepID}
Command runs fine but when I try to read the merged file back using parquet-tools, read fails with java.io.EOFException.
Reached out to AWS support team. They said they have a known issue when using s3distcp on parquet files and they are working on a fix but don't have an ETA for the fix.

Resources