Spark S3Guard - Skip listing S3 - apache-spark

I'm using Spark (2.4) to process I data being stored on S3.
I'm trying to understand if there's a way to spare the listing of the objects that I'm reading as my batch job inputs (I'm talking about ~1M )
I know about S3Guard that stores the objects metadata, and thought that I can use it for skipping the S3 listing.
I've read this Cloudera's blog
Note that it is possible to skip querying S3 in some cases, just
serving results from the Metadata Store. S3Guard has mechanisms for
this but it is not yet supported in production.
I know it's quite old , is it already available in production?

As of July 2019 it is still tagged as experimental; HADOOP-14936 lists the tasks there.
The recent work has generally corner cases you aren't going to encounter on a daily basis, but which we know exist and can't ignore.
The specific feature you are talking about, "auth mode", relies on all clients to be using S3Guard and update the tables, and us being happy that we can handle the failure conditions for consistency.
For a managed table, I'm going to say Hadoop 3.3 will be ready to use this. For HADOOP-3.2, it's close. Really, more testing is needed.
In the meantime, if you can't reduce the number of files in S3, can you make sure you don't have a deep directory tree, as its that recursive directory scan which really suffers against it

Related

Incremental Data Storage

I have time series daily data which I run a model on. The model runs in Spark.
I only want to run the model daily, and append the results to the historic results. It is important to have a 'merged single data source' containing historical data for the model to run successfully.
I have to use an AWS service to store the results. If I store in S3, I will end up storing backfill + 1 file per day (too many files). If I store in Redshift, it doesn't merge + upsert, therefore becoming complicated. The customer facing data is in Redshift, so dropping the table and reloading daily is not an option.
I am not sure how to cleverly (defined as minimal cost and subsequent processing) store the incremental data without re-processing everything daily to get a single file.
S3 is still your best shot. Since your job doesn't seems need to be accessed on a real-time fashion, it's more of a rolling data set.
If you are worried about the amount of file it generates, there is at least 2 things you can do:
S3 object lifecycle management
You can define your objects to be removed or transition to another storage class(cheaper) after x days.
More examples: https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-configuration-examples.html
S3 notification
Basically you can set up a listener in your S3 bucket, 'listening for' all the objects that match your specified prefix and suffix, to trigger other AWS services. One easy thing you can do is to trigger a Lambda, do your processing and then you can do whatever you would like.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/enable-event-notifications.html
Use S3 as your database whenever it's possible. It's damn cheap and it's AWS's backbone.
You can also switch to an ETL. A very efficient one, which is OpenSource, specialized in BigData, fully automatizable and easy to use is the Pentaho Data Integrator.
It comes equipped with ready made plugins for S3, Redshift (and others), and there is a single step to compare with previous values. From my experience it runs pretty fast. Plus it works for you during the night and sends you a morning mail saying every thing went OK (or not).
Note to the moderators: this is a agnostic point of view, I could have recommended many others, but this one seams the most suited for the OP's need.

Lack of Data Node locality when working with S3 for HDFS - eventual consistency

Two points here after reading https://wiki.apache.org/hadoop/AmazonS3
Not sure what to make of this below.
...
eventual consistency: changes made by one application (creation, updates and deletions) will not be visible until some undefined time.
...
Some undefined time? What does that mean for writing SPARK Applications then? If I have n JOBs, that may be something may not yet be visible?
How does the SPARK default partitioning apply then for S3 data?
that Hadoop doc is a bit out of date; I'd google for "spark and object stores" to get some more up to date stuff.
The spark documentation has some spark-specific details.
Some undefined time? What does that mean for writing SPARK Applications then?
Good question. AWS never give the hard data here; the best empirical study is Benchmarking Eventual Consistency: Lessons Learned from Long-Term Experimental Studies
That showed that consistency delays depend on total AWS load, and had some other patterns. Because its so variable, nobody dares give a good value of "undefined time".
My general expectations are
Normally list inconsistency can take a few seconds, but under load it can get worse
if nothing has actually gone wrong with S3 then a few minutes is enough for listing inconsistencies to be resolved.
All the s3 connectors mimic rename by listing all files under a path, copying and deleting them, renaming a directory immediately after processes have written to them may miss data.
Because the means by which Spark jobs commit their output to a filesystem depends on rename(), it is not safe to use them to commit the output of tasks.
If I have n JOBs, that may be something may not yet be visible?
It's worse than that. You can't rely on the rename operations within a single job to get it right.
It's why Amazon offer a consistent emrfs option using dynamoDB for listing consistency, and Hadoop 2.9+ have a feature, S3Guard, which uses dynamoDB for that same operation. Neither deal with update inconsistency though, which is why Hadoop 3.1's "S3A committers" default to generating unique filenames for new files.
If you are using the Apache S3A connector to commit work to S3 using the normal filesystem FileOutputCommitter then, without S3Guard, you are at risk of losing data.
Don't worry about chaining work; worry about that.
BTW: I don't know what Databricks do here. Ask them for the details.
How does the SPARK default partitioning apply then for S3 data?
the partitioning is based on whatever blocksize the object store connector makes up. For example, for the s3a connector, its that of fs.s3a.blocksize

Is it possible to retrieve the list of files when a DataFrame is written, or or have spark store it somewhere?

With a call like
df.write.csv("s3a://mybucket/mytable")
I obviously know where files/objects are written, but because of S3's eventual consistency guarantees, I can't be 100% sure that getting a listing from that location will return all (or even any) of the files that were just written. If I could get the list of files/objects spark just wrote, then I could prepare a manifest file for a Redshift COPY command without worrying about eventual consistency. Is this possible-- and if so how?
The spark-redshift library can take care of this for you. If you want to do it yourself you can have a look at how they do it here: https://github.com/databricks/spark-redshift/blob/1092c7cd03bb751ba4e93b92cd7e04cffff10eb0/src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala#L299
EDIT: I avoid further worry about consistency by using df.coalesce(fileCount) to output a known number of file parts (for Redshift you want a multiple of the slices in your cluster). You can then check how many files are listed in the Spark code and also how many files are loaded in Redshift stl_load_commits.
It's good to be aware of consistency risks; you can get it in listings with delayed create visibility and deleted objects still being found.
AFAIK, You can't get a list of files created, as its somewhere where tasks can generate whatever they want into the task output dir, which is then marshalled (via listing and copy) into the final output dir,
In the absence of a consistency layer atop S3 (S3mper, s3guard, etc), you can read & spin for "a bit" to allow for the shards to catch up. I have no good idea of what is a good value of "a bit".
However, if you are calling fs.write.csv(), you may have been caught by listing inconsistencies within the committer used to propagate task output to the job dir; s that's done in S3A via list + copy, see.

How read large number of large files on NFS and dump to HDFS

I am working with some legacy systems in investment banking domain, which are very unfriendly in the sense that, only way to extract data from them is through a file export/import. Lots of trading takes place and large number of transactions are stored on these system.
Q is how to read large number of large files on NFS and dump it on a system on which analytics can be done by something like Spark or Samza.
Back to issue. Due nature of legacy systems, we are extracting data and dumping into files. Each file is in hundreds of gigabyte size.
I feel next step is to read these and dump to Kafka or HDFS, or maybe even Cassandra or HBase. Reason being I need to run some financial analytics on this data. I have two questions:
How to efficiently read large number of large files which are located on one or numerous machines
Apparently you've discovered already that mainframes are good at writing large numbers of large files. They're good at reading them too. But that aside...
IBM has been pushing hard on Spark on z/OS recently. It's available for free, although if you want support, you have to pay for that. See: https://www-03.ibm.com/systems/z/os/zos/apache-spark.html My understanding is that z/OS can be a peer with other machines in a Spark cluster.
The z/OS Spark implementation comes with a piece that can read data directly from all sorts of mainframe sources: sequential, VSAM, DB2, etc. It might allow you to bypass the whole dump process and read the data directly from the source.
Apparently Hadoop is written in Java, so one would expect that it should be able to run on z/OS with little problem. However, watch out for ASCII vs. EBCDIC issues.
On the topic of using Hadoop with z/OS, there's a number of references out there, including a red piece: http://www.redbooks.ibm.com/redpapers/pdfs/redp5142.pdf
You'll note that in there they make mention of using the CO:z toolkit, which I believe is available for free.
However you mention "unfriendly". I'm not sure if that means "I don't understand this environment as it doesn't look like anything I've used before" or it means "the people I'm working with don't want to help me". I'll assume something like the latter since the former is simply a learning opportunity. Unfortunately, you're probably going to have a tough time getting the unfriendly people to get anything new up and running on z/OS.
But in the end, it may be best to try to make friends with those unfriendly z/OS admins as they likely can make your life easier.
Finally, I'm not sure what analytics you're planning on doing with the data. But in some cases it may be easier/better to move the analytics process to the data instead of moving the data to the analytics.
The simplest way to do it better is zconnector, a ibm product for data ingestion between mainframe to hadoop cluster.
I managed to find an answer. The biggest bottleneck is that reading files is essentially a serial operation.. that is the most efficient way to read from a disk. So for one file I am stuck with a single thread reading it from NFS and sending it to HDFS or Kafka via their APIs.
So it appears best way is to make sure that the source from where data is coming dumps files in multiple NFS folders. That point onward I can run multiple processes to load data to HDFS or Kafka since they are highly parallelized.
How to load? One good way is to mount NFS into Hadoop infrastructure and use distcp. There are other possiblities too which open up once we make sure files are available from large number of NFS. Otherwise remember, reading file is a serial operation. Thanks.

spark save does it collect and save or save from each node

We have a spark cluster with 10 nodes. I have a process that joins couple of dataframes and then saves the result to a s3 location. We are running in cluster mode. When I call save on the dataframe, does it save from the nodes or does it collect all the result to the driver and write it from the driver to the s3. is there a way to verify this.
RDD.save() triggers an evaluation of the entire query.
The work is partitioned by source data (i.e. files), and any splitting which can be done, pushing individual tasks to available executors, collecting the results and finally writing it to the destination directory using the cross-node protocol defined in implementations of FileCommitProtocol, generally HadoopMapReduceCommitProtocol, which then works with Hadoop's FileOutputCommitter to choreograph the commit.
Essentially:
Tasks write to their task-specific subdir under __temporary/$job-attempt/$task-attempt
tasks say they are ready to write, the spark driver tells them to commit vs abort
in speculative execution or failure conditions, tasks can abort, in which case they delete their temp dir
on a commit, the task lists files in its dir and renames them to a job attempt dir, or direct to the destination (v2 protocol)
On a job commit, the driver either lists and renames files in the job attempt dir (v1 protocol), or is a no-op (v2).
On the question of "writing to s3", if you are using Apache Spark (and not amazon EMR), then be aware that this list + rename commit mechanism is (a) slow as renames are really copies, and (b) dangerous as the eventual consistency of S3, especially list inconsistency, means that files saved by tasks may not be listed, hence not committed
At the time of writing (May 2017), the sole committer known to safely commit using s3a or s3n clients is the netflix committer.
There's work underway to pull this into Hadoop and hence spark, but again, in May 2017 it's still a work in progress: demo state only. I speak as the engineer doing this.
To close then: If you want reliable data output writing to S3, consult whoever is hosting your code on EC2. If you are using out-the-box Apache Spark without any supplier-specific code, do not write direct to S3. It may work in tests, but as well as seeing the intermittent failure, you may lose data and not even notice. Statistics are your enemy here: the more work you do, the bigger the datasets, the more tasks that are executed -and so eventually something will go wrong.

Resources