Spark too many small S3 files with DDB stream - apache-spark

I am designing solution that reads DDB stream and write to S3 using Lambda for writing to S3 datalake
I will be reading max batch size of 10,000 changes in a batch window of like 2 minutes. Based on the velocity of change in the DB, my design might also create small S3 Files. I am seeing customers having issues with Spark not working properly with small S3 files. I am wondering how this common this issue is?

Related

Is s3-dist-cp faster for transferring data between S3 and EMR than using Spark and S3 directly?

I'm currently playing around with Spark on an EMR cluster. I noticed that if I perform the reads/writes into/out of my cluster in Spark script itself, there is an absurd wait time for my outputted data to show up in the S3 console, even for relatively lightweight files. Would this write be expedited by writing to HDFS in my Spark script and then adding an additional step to transfer the output from HDFS -> S3 using s3-dist-cp?

Spark RDD S3 saveAsTextFile taking long time

I have a Spark Streaming job on EMR which runs on batches of 30 mins, processes the data and finally writes the output to several different files in S3. Now the output step to S3 is taking too long (about 30mins) to write the files to S3. On investigating further, I found that the majority time taken is after all tasks have written the data in temporary folder (happens within 20s) and rest of the time taken is due to the fact that the master node is moving the S3 files from _temporary folder to destination folder and renaming them etc. (Similar to: Spark: long delay between jobs)
Some other details on the job configurations, file format etc are as below:
EMR version: emr-5.22.0
Hadoop version:Amazon 2.8.5
Applications:Hive 2.3.4, Spark 2.4.0, Ganglia 3.7.2
S3 files: Done using RDD saveAsTextFile API with S3A URL, S3 file format is text
Now although the EMRFS output committer is enabled by default in the job but it is not working since we are using RDDs and text file format which is supported post EMR 6.40 version only. One way that I can think of for optimizing the time taken in S3 save is by upgrading the EMR version, converting RDDs to DataFrames/Datasets and using their APIs instead of saveAsTextFile. Is there any other simpler solution possible to optimize the time taken for the job?
Is there any other simpler solution possible to optimize the time taken for the job?
unless you use an s3-specific committer, your jobs will not only be slow, they will be incorrect in the presence of failures. As this may matter to you,it is good that the slow job commits are providing an early warning of problems even before worker failures result in invalid output
options
upgrade. the committers were added for a reason.
use a real cluster fs (e.g HDFS) as the output then upload afterwards.
The s3a zero rename committers do work in saveAsTextFile, but they aren't supported by AWS and the ASF developers don't test on EMR as it is amazon's own fork. you might be able to get any s3a connector amazon ship to work, but you'd be on your own if it didn't.

Streaming Couchbase data to S3

I'd like to stream the contents of a couchbase bucket to S3 in parquet file using a spark job. I'm currently leveraging Couchbase's spark streaming integration with the couchbase connector to generate a Dstream, but within each dstream there are multiple RDDs that only contain around 10 records each. I could create a file for each RDD and upload them individually to s3, but considering that I have 12 million records to import I would be left with around a million small files in s3 which is not ideal. What would be the best way to load the contents of a couchbase bucket and load it into s3 using a spark job? I'd ultimately like to have a single parquet file with all the contents of the couchbase bucket if possible.

How is a large dataset uploaded to a cloud file system (S3, HDFS) if there isn't enough space on local disk?

I have a project that deals with processing data with Spark on EMR.
From what I've read, people usually store their input data on some file system (HDFS, S3, or locally), and then operate on that. If the data is very large, we don't want to store that locally.
My question is, if I generate a bunch of data, how do you even store that data remotely on S3 or whichever cloud file system there is in the first place? Don't I need to have the data stored locally before I can store it on the cloud?
I ask this because currently, I'm using a service that has a method that returns a Spark Dataset object to me. I'm not quite sure how the workflow goes between calling that method and processing it via Spark on EMR.
The object store connectors tend to write data in blocks; for each partition the work creates a file through the Hadoop FS APIs, with a path like s3://bucket/dest/__temporary/0/task_0001/part-0001.csv, gets back an output stream into which the workers write, that's it.
I don't know about the closed source EMR s3 connector, the ASF S3A one is up there for you to examine
Data is buffered up to the value of fs.s3a.blocksize; default = `32M, i.e. 32MB
Buffering is to disk (default), Heap (arrays) or off-heap byte buffers S3ADataBlocks.
When you write data, once the buffer threshold is reached, that block is uploaded (separate thread); a new block buffer created. S3ABlockOutputStream.write.
when the stream's close() method is called, any outstanding data is PUT to S3, then then thread blocks until it is all uploaded. S3ABlockOutputStream.close
The uploads are in a separate thread, so even if the network is slow you can generate data slightly faster, with the block at the end. The amount of disk/ram you need is as much as all outstanding blocks from all workers uploading data. The thread pool for the upload is shared and of a limited size, so you can tune the params to limit these values. Though that's normally only needed if you try to buffer in memory.
When the queue fills up, the worker threads writing to the S3 output stream block, via the SemaphoredDelegatingExecutor.
the amount of local storage you need then depends on:
number of spark worker threads
rate of data they generate
number of threads/http connections you have to upload the data
bandwidth from VM to S3 (the ultimate limit)
any throttling S3 does with many clients writing to same bit of a bucket
That's with the S3A connector; the EMR s3 one will be different, but again, upload bandwidth will be the bottleneck. I assume it too has something to block workers which create more data than the network can handle.
Anyway: for Spark and the hadoop code it uses underneath, all the source is there for you to explore. Don't be afraid to do so!
When dealing with Spark and any distributed storage keep in mind, that there is some amount of nodes in the Spark cluster.
While the Dataset transformations are manipulated from the single node of cluster named driver, common practice is that all processed data never get collected on one single node in such a cluster. Each node of executor role in cluster operates with fraction of whole data during its ingestion into Spark, processing and storing back to some kind of storage.
With such approach the limits of single node do not limit the volume of data that could be processed by cluster.

AWS Data Lake Ingest

Do you need to ingest excel and other proprietary formats using glue or allow glue to work crawl your s3 bucket to use these data formats within your data lake?
I have gone through the "Data Lake Foundation on the AWS Cloud" document and am left scratching my head about getting data into the lake. I have a Data Provider with a large set of data stored on their system as excel and access files.
Based on the process flow they would upload the data into the submission s3 bucket, which would set off a series of actions, but there is no etl of the data into a format that would work with the other tools.
Would using these files require using glue on the data that is submitted in the bucket or is there another way to make this data available to other tools such as Athena and redshift spectrum?
Thank you for any light you can shed on this topic.
-Guido
I'm not seeing that can take excel data directly to Data Lake. You might need to convert into CSV/TSV/Json or other formats before loading into Data Lake.
Formats Supported by Redshift Spectrum:
http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html -- Again I don't see Excel as of now.
Athena Supported File Formats:
http://docs.aws.amazon.com/athena/latest/ug/supported-formats.html -- I don't see Excel also not supported here.
You need to upload the files to S3 either to Use Athena or Redshift Spectrum or even Redshift storage itself.
Uploading Files to S3:
If you have bigger files, you need to use S3 multipart upload to upload quicker. If you want more speed, you need to use S3 accelerator to upload your files.
Querying Big Data with Athena:
You can create external tables with Athena from S3 locations. Once you create external tables, use Athena Sql reference to query your data.
http://docs.aws.amazon.com/athena/latest/ug/language-reference.html
Querying Big Data with Redshift Spectrum:
Similar to Athena, you can create external tables with Redshift. Start querying those tables and get the results on Redshift.
Redshift has lot of commercial tools, I use SQL Workbench. It is free open source and rock solid, supported by AWS.
SQL WorkBench: http://www.sql-workbench.net/
Connecting your WorkBench to Redshift: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html
Copying data to Redshift:
Also if you want to take the data storage to Redshift, you can use the copy command to pull the data from S3 and its gets loaded to Redshift.
Copy Command Examples:
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html
Redshift Cluster Size and Number of Nodes:
Before creating Redshift Cluster, check for required size and number of nodes needed. More number of nodes gets query parallely running. One more important factor is how well your data is distributed. (Distribution key and Sort keys)
I have a very good experience with Redshift, getting up to the speed might take sometime.
Hope it helps.

Resources