Using S3 data in a Spark Application - apache-spark

I am new to the spark and have some fundamental doubts. I am working on a pyspark application. It is supposed to process 500K items. The current implementation is not efficient and takes forever to complete.
I will briefly explain the tasks.
The application processes a S3 directory. It is supposed to process all the files present under s3://some-bucket/input-data/. The S3 directory structure looks like the below:
s3://some-bucket/input-data/item/Q12/sales.csv
s3://some-bucket/input-data/item/Q13/sales.csv
s3://some-bucket/input-data/item/Q14/sales.csv
The csv files don't have a item identifier column. The name of directory is the item identifier, like Q11, Q12 etc.
The application has a udf defined which downloads the data using boto3, process it, and then dumps the data in S3 in the directory structure like this:
s3://some-bucket/output-data/item/Q12/profit.csv
s3://some-bucket/output-data/item/Q13/profit.csv
s3://some-bucket/output-data/item/Q14/profit.csv
Making 500K API call to S3 for the data, doesn't seem right to me. I am running the spark application on EMR, should I download all the data as a bootstrap step?
Can S3DistCp (s3-dist-cp) solve the issue by downloading the whole data to HDFS and later workers/nodes can access them. Suggestions on how to use s3-dist-cp would be very helpful.

Related

How to read each file's last modified/arrival time while reading input data from aws s3 using spark batch application

i want to read each file's last modified/arrival time while reading input data from aws s3 using spark batch application.
image showing last modified time
There is two option in my mind:
First option is to get the name of the last modified file using AWS SDK (https://medium.com/faun/identifying-the-modified-or-newly-added-files-in-s3-11b577774729) before starting your job.
The second option is to use Structured Streaming. Unfortunatly structured streaming can only process new files and not modified files. A simple workaround is to instead of modify your file just add a new one (but it is maybe not possible because of your use case..)

Read edge DB files from HDFS or S3 in Spark

I have list Db files store into local folder, when I am running spark job on local mode I can provide local path to read those local files. but while running on client or cluster mode path is not accessible. seems they need to be kept at HDFS or access directly from S3.
I am doing following :
java.io.File directory = new File(dbPath)
at dbPath all the list of db files are present. is there any simple way to access those files folder from HDFS or from S3, as I am running this Spark job on AWS.
To my knowledge, there isn't a standard way to do this currently. But it seems you could reverse-engineer a dump-reading protocol through a close examination of how the dump is generated.
According to edgedb-cli/dump.rs, it looks like you can open the file with a binary stream reader and ignore the first 15 bytes of a given dump file.
output.write_all(
b"\xFF\xD8\x00\x00\xD8EDGEDB\x00DUMP\x00\
\x00\x00\x00\x00\x00\x00\x00\x01"
).await?;
But then it appears the remaining dump get written to a mutable async future result via:
header_buf.truncate(0);
header_buf.push(b'H');
header_buf.extend(
&sha1::Sha1::from(&packet.data).digest().bytes()[..]);
header_buf.extend(
&(packet.data.len() as u32).to_be_bytes()[..]);
output.write_all(&header_buf).await?;
output.write_all(&packet.data).await?;
with a SHA1 encoding.
Unfortunately, we're in the dark at this point because we don't know what the byte sequences of the header_buf actually say. You'll need to investigate how the undigested contents look in comparison to any protocols used by asyncpg and Postgres to verify what your dump resembles.
Alternatively, you could prepare a shim to the restore.rs with some pre-existing data loaders.

What is the best practice writing massive amount of files to s3 using Spark

I'm trying to write about 30k-60k parquet files to s3 using Spark and it's taking a massive amount of time (40+ minutes) due to the s3 rate limit.
I wonder if there is a best practice to do such a thing. I heard that writing the data to HDFS and then copying it using s3-dist-cp may be faster. I can't understand why. isn't the copy from HDFS will take the same amount of time because of the s3 rate limit?
Thanks for your help
There is nothing wrong in this approach and works absolutely fine in most of the use cases, but there might be some challenges due to the way in S3 files are written.
Two Important Concepts to Understand
S3(Object Store) != POSIX File System : Rename Operation:
File rename process in POSIX based file system is a metadata only operation.Only the pointer changes and file remains as is on the disk. For example, I have a file abc.txt and I want to rename it as xyz.txt its instantaneous and atomic. xyz.txt’s last modified timestamp remain same as abc.txt’s last modfied timestamp.
Where as in AWS S3 (object store) the file rename under the hood is a copy followed by a delete operation. The source file is first copied to destination and then the source file is deleted.So “aws s3 mv” changes the last modified timestamp of destination file unlike POSIX file system.The metadata here is a key value store where key is the file path and value is the content of the file and there is no such process as changing the key and get this done immediately. The rename process depends on the size of the file. If there is a directory rename(there is nothing called directory in S3 for for simplicity we can assume a recusrive set of files as a directory) then it depends on the # of files inside the dir along with size of each file. So in a nutshell rename is very expensive operation in S3 as compared to normal file system.
S3 Consistency Model
S3 comes with 2 kinds of consistency a.read after write b.eventual consistency and which some cases results in file not found expectation.Files being added and not listed or files being deleted or not removed from list.
Deep explanation:
Spark leverages Hadoop’s “FileOutputCommitter” implementations to write data. Writing data again involves multiple steps and on a high level staging output files and then committing them i.e. writing final files.Here the rename step is involved as I was talking earlier from staging to final step.As you know a spark job is divided into multiple stages and set of tasks and due to nature of distributed computing the tasks are prone to failure so there is also provision to re-launch same task due to system failure or speculative execution of slow running tasks and that leads to concepts of task commit and job commit functions.Here we have 2 options of readily available algorithms and how job and task commits are done and having said this not one algorithm is better then other rather based on where we are committing data.
mapreduce.fileoutputcommitter.algorithm.version=1
commitTask renames the data generated by task from task temporary directory to job temporary directory.
When all the tasks are complete commitJob rename all the data from job temporary directory to final destination and at the end creates _SUCCESS file.
Here driver does the work of commitJob at the end so object stores like S3 may take longer time because of lots of task temporary file being queued up for rename operation(its not serial though)and the write performance is not optimized.It might work pretty well for HDFS as rename is not expensive and just a metadata change.For AWS S3 during commitJob each rename operation of files opens up huge number of API calls to AWS S3 and might cause issues of unexpected API call closure if the number of files are high. It might not also. I have seen both the cases on the same job running in two different times.
mapreduce.fileoutputcommitter.algorithm.version=2
commitTask moves data generated by task from task temporary directory directly to the final destination as soon as task is complete.
commitJob basically writes the _SUCCESS file and doesn't do much.
From a high level this looks optimized but it comes with a limitation not to have the speculative task execution and also if any task fails due to corrupt data then we might end up with residual data in the final destination and needs a clean up. So this algorithm doesn't give 100% data correctness or doesn't work for use cases where we need data in append mode to existing files.Even if this ensures optimised results comes with a risk.The reason for good performance is basically because of less number of rename operations as compared to algorithm 1(still there are renames). Here we might encounter issues of file not found expectations because commitTask writes the file in temporary path and immediately renames them and there are light chances of eventual consistency issues.
Best Practices
Here are few I think we can use while writing spark data processing applications :
If you have a HDFS cluster available then write data from Spark to HDFS and copy it to S3 to persist. s3-dist-cp can be used for data copy from HDFS to S3 optimally.Here we can avoid all that rename operation.With AWS EMR being running for only duration of compute and then terminated afterwards to persist result this approach looks preferable.
Try avoiding writing files and reading it again and again unless there are consumers for the files , and spark is well known for in-memory processing and careful data persistence/cache in-memory will help the optimized run time of the application.

Apache Spark: How to read millions (5+ million) small files (10kb each) from S3

A high level overview of my goal: I need to find the file(s) (they are in JSON format) that contain a particular ID. Basically need to return a DF (or a list) of the ID and the file name that contains it.
// Read in the data from s3
val dfLogs = spark.read.json("s3://some/path/to/data")
.withColumn("fileSourceName", input_file_name())
// Filter for the ID and select then id and fileSourceName
val results = dfLogs.filter($"id" === "some-unique-id")
.select($"id", $"fileSourceName")
// Return the results
results.show(false)
Sounds simple enough, right? However, the challenge I'm facing is that the S3 directory I'm reading from contains millions (approximately 5+ million) files which average in size of 10kb. Small file problem! To do this I've been spinning up a 5 Node cluster (m4.xlarge) on EMR and using Zeppelin to interactively run the above code.
However, I keep getting thrown the following error when running the first spark statement (read):
org.apache.thrift.transport.TTransportException at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
I'm having a hard time finding out more about the above error but I suspect it has to do with the requests being made from my spark job to s3.
Does anyone have any suggestions on how to handle so many small files? Should I do a s3-dist-cp from S3 -> HDFS on the EMR cluster and then run the query above but read from HDFS? Or some other option? This is a one time activity...is it worth creating a super large cluster? Would that improve the performance or solve my error? I've thought about trying to group the files together into bigger ones...but I need the unique files in which contain the ID.
I would love to change the way in which these files are being aggregated in S3...but there is nothing I can do about it.
Note: I've seen a few posts around here but they're quite old. Another link, but this I do not think pertains to my situation

Spark: How to read & write temporary files?

I need to write a Spark app that uses temporary files.
I need to download many many large files, read them with some legacy code, do some processing, delete the file, and write the results to a database.
The files are on S3 and take a long time to download. However, I can do many at once, so I want to download a large number in parallel. The legacy code reads from the file system.
I think I can not avoid creating temporary files. What are the rules about Spark code reading and writing local files?
This must be a common issue, but I haven't found any threads or docs that talk about it. Can someone give me a pointer?
Many thanks
P

Resources