Spark S3 Eventual Consistency Issues - apache-spark

I have several Spark jobs that write data to and read data from S3. Occasionally (about once per week for approximately 3 hours), the Spark jobs will fail with the following exception:
org.apache.spark.sql.AnalysisException: Path does not exist.
I've uncovered that this is likely due to the consistency model in S3, where list operations are eventually consistent. S3 Guard claims to solve this issue, but I'm in a Spark environment that doesn't support that utility.
Has anyone else run into this issue and figured out a reasonable approach for dealing with it?

If you are using AWS EMR, they offer consistent EMR.
if you are using Databricks: they offer a consistency mechanism in their transactional IO
Both HDP and CDH ship with S3Guard
if you are running your own home-rolled spark stack, , move to Hadoop 2.9+ to get S3Guard, even better: Hadoop 3.1 for the zero-rename S3A committer.
Otherwise: don't use S3 as your direct destination of work.

Related

Duplicate records as result of Spark Task/Executor Failure

I have a scenario regarding Spark job and want to understand the behavior.
Scenario:
I am reading data using JDBC connector and writing it to HDFS.
As per my understanding there wont be any data duplication or loss in HDFS even though if a executor/ task fails same sql query will execute on RDBMS. Please correct me if I am wrong.
Eventual consistent target like S3, how it will behave. Is there any concern to use it as target.
Strongly consistent target like GCS bucket, how it will behave.
Thanks in advance
s3 is fully consistent now, but awfully slow on rename and rename doesn't reliably fail if the destination exists. You need to use a custom s3 committer (s3a committers, emr optimized committer) for spark/mapreduce. Consult the hadoop and EMR docs for details.
google gcs is consistent, but it doesn't do atomic renames, so the v1 commit protocol, which relies on atomic dir rename, isn't safe. v2 isn't safe anywhere.
The next version of apache hadoop adds a new "intermediate manifest committer" for gcs and abfs performance and correctness.
finally, iceberg is fast and safe everywhere.

Spark RDD S3 saveAsTextFile taking long time

I have a Spark Streaming job on EMR which runs on batches of 30 mins, processes the data and finally writes the output to several different files in S3. Now the output step to S3 is taking too long (about 30mins) to write the files to S3. On investigating further, I found that the majority time taken is after all tasks have written the data in temporary folder (happens within 20s) and rest of the time taken is due to the fact that the master node is moving the S3 files from _temporary folder to destination folder and renaming them etc. (Similar to: Spark: long delay between jobs)
Some other details on the job configurations, file format etc are as below:
EMR version: emr-5.22.0
Hadoop version:Amazon 2.8.5
Applications:Hive 2.3.4, Spark 2.4.0, Ganglia 3.7.2
S3 files: Done using RDD saveAsTextFile API with S3A URL, S3 file format is text
Now although the EMRFS output committer is enabled by default in the job but it is not working since we are using RDDs and text file format which is supported post EMR 6.40 version only. One way that I can think of for optimizing the time taken in S3 save is by upgrading the EMR version, converting RDDs to DataFrames/Datasets and using their APIs instead of saveAsTextFile. Is there any other simpler solution possible to optimize the time taken for the job?
Is there any other simpler solution possible to optimize the time taken for the job?
unless you use an s3-specific committer, your jobs will not only be slow, they will be incorrect in the presence of failures. As this may matter to you,it is good that the slow job commits are providing an early warning of problems even before worker failures result in invalid output
options
upgrade. the committers were added for a reason.
use a real cluster fs (e.g HDFS) as the output then upload afterwards.
The s3a zero rename committers do work in saveAsTextFile, but they aren't supported by AWS and the ASF developers don't test on EMR as it is amazon's own fork. you might be able to get any s3a connector amazon ship to work, but you'd be on your own if it didn't.

Does S3 Strong Consistency mean it is safe to use S3 as a checkpointing location for Spark Structured Streaming applications?

In the past, the general consensus was such that you should not use S3 as checkpointing location for Spark Structured Streaming applications.
However, now that S3 offers strong read after write consistency, is it safe to use S3 as a checkpointing location? If it is not safe, why?
In my experiments, I continue to see checkpointing related exceptions in my Spark Structured streaming applications, but I am uncertain where the problem actually lies.
not really. you get consistency of list and updates, but rename is still mocked with copy and delete...and I think the standard checkpoint algorithm depends on it.
hadoop 3.3.1 added a new API, Abortable to aid with a custom S3 stream checkpoint committer -the idea os that the checkpointer woudl write sstraight to the destination, but abort the write when aborting the checkpoint. a normal close() would finish the write and manifest the file. see https://issues.apache.org/jira/browse/HADOOP-16906
AFAIK nobody has done the actual committer. opportunity for you to contribute there...
You really answer your own question. You do not state if on Databricks or EMR so I am going to assume EC2.
Use HDFS as checkpoint location on local EC2 disk.
Where I am now we have HDFS using HDP and IBM S3, HDFS is used still for checkpointing.

Spark as Data Ingestion/Onboarding to HDFS

While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.
We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later.
Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].
Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.
Options which we are looking for based on priority:
1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well.
We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?
2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?
3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?
4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?
Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.
Best Regards,
Bhupesh
After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.
SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.
Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.

Spark need of HDFS

Hi can anyone explain me, does Apache 'Spark Standalone' need HDFS?
If it's required how Spark uses the HDFS block size during the Spark application execution.
I mean am trying to understand what will be the HDFS role during Spark application execution.
Spark documentation says that the processing parallelism is controlled through RDD partitions and the executors/cores.
Can anyone please help me to understand.
Spark can work without any issues without using HDFS and most certainly it is not required for core execution.
Some distributed storage (not necessarily HDFS) is required for checkpoiniting and is useful for saving results.

Resources