With payload limit of 6 MB in DDB streams if using Lambda, the resulting S3 parquet file in our case, will always be much smaller than <6 MB. In case of lots of data this can produce lots of small S3 files. I am seeing articles about Spark having bad performance due to too many small S3 files.
Isn't this a limitation? What does DDB suggest to solve this? Does anyone know? Thank you for your help
Related
I am doing some tests with Spark and Parquet. I have an identical Parquet file locally and on S3. I read the file with Spark with a simple query:
val data = sparkSession.read.parquet(path)
.filter(col("mycol").startsWith("C")) // Partition filter
data.collectAsList() // just to force execution
When I point the path to S3, I have a much higher input size (+- 2x) than when I read from my local filesystem. The number of records is the same.
Can anybody explain this?
First, I thought it had to do with the fact that S3 is block storage and that it reads an entire block. My partition file was smaller than 32MB so would be read two times (once for schema and once for values). But I have the same behavior with a Parquet file with one large partition (190MB)... I also tried out different queries and I have this result with any query I try.
Thank you for any help provided!
I am new bee to pyspark.
Just wanted to understand how large files I should write into S3 so that Spark can read those files and process.
I have around 400 to 500GB of total data, I need to first upload them to S3 using some tool.
Just trying to understand how big each file should be in S3 so that Spark can read and process efficiently.
And how spark will distribute the S3 files data to multiple executors?
Any god reading link?
Thanks
Try 64-128MB, though it depends on the format.
Spark treats S3 data as independent of location, so doesn't use locality in its placement decisions -just whichever workers have capacity for extra work
I have a folder (path = mnt/data/*.json) in s3 with millions of json files (each file is less than 10 KB). I run the following code:
df = (spark.read
.option("multiline", True)
.option("inferSchema", False)
.json(path))
display(df)
The problem is that it is very slow. Spark creates a job for this with one task. The task appears to have no more executors running it which usually signifies the completion of a job (right?), but for some reason the command cell in DataBricks is still running. It's been stuck like this for 10min. I feel something as simple as this should take no more than 5minutes.
Notes to consider:
Since there are millions of json files, I can't say with confidence that they will have the same exact structure (there could be some discrepancies)
The json files were web-scraped from the same REST API
I read somewhere that inferSchema = False can help reduce runtime, which is why I used it
AWS s3 bucket is already mounted so there is absolutely no need to use boto3
Apache Spark is very good at handling large files but when you have tens of thousands of small files (millions in your case), in a directory/distributed in several directories, that will have a severe impact on processing time (potentially 10s of minutes to hours) since it has to read each of these tiny files.
An ideal file's size should be between 128 MB to 1GB in the disk, anything less than 128 MB (due spark.sql.files.maxPartitionBytes) file would case this Tiny Files problem and will be the bottleneck.
you can rewrite the data in parquet format at an intermediate location as
one large file using coalesce or multiple even-sized files using repartition
you can read the data from this intermediate location for further processing & this should prevent any such bottlenecks which come with Tiny Files problem.
My approach was very simple thanks to Anand pointing out the "small file problem." So my problem was that I could not extract ~ 2 million json files each ~10KB in size. So there was no way I was able to read then store them in parquet format as an intermediary step. I was given an s3 bucket with raw json files scraped from the web.
At any rate, using python's zipfile module came in hand. It was used in order to append multiple json files such that each one was at least 128MB and at most 1GB. Worked pretty well!
There is also another way you can do this using AWS Glue, but of course that requires IAM Role authorization and can be expensive, but the avantage of that is you can convert those files into parquet directly.
zipfile solution: https://docs.python.org/3/library/zipfile.html
AWS Glue solution: https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f
Really good blog posts explaining the small file problem:
https://mungingdata.com/apache-spark/compacting-files/
https://garrens.com/blog/2017/11/04/big-data-spark-and-its-small-files-problem/?unapproved=252&moderation-hash=5a657350c6169448d65209caa52d5d2c#comment-252
S3 and GCS are not block storage as opposite as HDFS so the way how Spark creates partitions when reading from these sources is not that clear to me.
I am now reading from GCS but I get 2 partitions for small files (10 bytes), and also for medium files 100 MBs.
Has anyone an explanation?
generally it's a configuration option, "how big to lie about partition size".
I have a 10GB gzip compressed file in S3 that I need to process in EMR Spark.
I need to load it, do a full outer join and write it back to S3.
The data that I full outer join with is the target dataset that I thought to save as parquet.
I can't make the input file spliced before (since it comes from a third party) and can only change the compression to bz2.
Any suggestion how to make the process of using the input file most efficient?
Currently when just using spark.read.csv it takes very long time and running only one task so it can't be distributed.
Make step 1 a single-worker operation of reading in the file & write it back as snappy encrypted parquet, before doing the join. Once it's written like that, you've got a format which can be split up for the join.
I'd recommend launching an EC2 instance in the same region as the bucket, downloading the 10GB file, unzipping it, and uploading it back to S3. Using the aws-cli this should only take about 15 minutes in total.
For example:
aws s3 sync s3://bucket_name/file.txt.gz .;
gunzip file.txt.gz;
aws s3 sync file.txt s3://bucket_name/;