I am new bee to pyspark.
Just wanted to understand how large files I should write into S3 so that Spark can read those files and process.
I have around 400 to 500GB of total data, I need to first upload them to S3 using some tool.
Just trying to understand how big each file should be in S3 so that Spark can read and process efficiently.
And how spark will distribute the S3 files data to multiple executors?
Any god reading link?
Thanks
Try 64-128MB, though it depends on the format.
Spark treats S3 data as independent of location, so doesn't use locality in its placement decisions -just whichever workers have capacity for extra work
Related
I have a folder (path = mnt/data/*.json) in s3 with millions of json files (each file is less than 10 KB). I run the following code:
df = (spark.read
.option("multiline", True)
.option("inferSchema", False)
.json(path))
display(df)
The problem is that it is very slow. Spark creates a job for this with one task. The task appears to have no more executors running it which usually signifies the completion of a job (right?), but for some reason the command cell in DataBricks is still running. It's been stuck like this for 10min. I feel something as simple as this should take no more than 5minutes.
Notes to consider:
Since there are millions of json files, I can't say with confidence that they will have the same exact structure (there could be some discrepancies)
The json files were web-scraped from the same REST API
I read somewhere that inferSchema = False can help reduce runtime, which is why I used it
AWS s3 bucket is already mounted so there is absolutely no need to use boto3
Apache Spark is very good at handling large files but when you have tens of thousands of small files (millions in your case), in a directory/distributed in several directories, that will have a severe impact on processing time (potentially 10s of minutes to hours) since it has to read each of these tiny files.
An ideal file's size should be between 128 MB to 1GB in the disk, anything less than 128 MB (due spark.sql.files.maxPartitionBytes) file would case this Tiny Files problem and will be the bottleneck.
you can rewrite the data in parquet format at an intermediate location as
one large file using coalesce or multiple even-sized files using repartition
you can read the data from this intermediate location for further processing & this should prevent any such bottlenecks which come with Tiny Files problem.
My approach was very simple thanks to Anand pointing out the "small file problem." So my problem was that I could not extract ~ 2 million json files each ~10KB in size. So there was no way I was able to read then store them in parquet format as an intermediary step. I was given an s3 bucket with raw json files scraped from the web.
At any rate, using python's zipfile module came in hand. It was used in order to append multiple json files such that each one was at least 128MB and at most 1GB. Worked pretty well!
There is also another way you can do this using AWS Glue, but of course that requires IAM Role authorization and can be expensive, but the avantage of that is you can convert those files into parquet directly.
zipfile solution: https://docs.python.org/3/library/zipfile.html
AWS Glue solution: https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f
Really good blog posts explaining the small file problem:
https://mungingdata.com/apache-spark/compacting-files/
https://garrens.com/blog/2017/11/04/big-data-spark-and-its-small-files-problem/?unapproved=252&moderation-hash=5a657350c6169448d65209caa52d5d2c#comment-252
You can use S3 Select with Spark on Amazon EMR and with Databricks, but only for CSV and JSON files. I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much.
Let's say we have a data lake of people with first_name, last_name and country columns.
If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count(), then S3 will transfer all the data for all the columns to the ec2 cluster to run the computation. This is really inefficient because we don't need all the last_name and country data to run this query.
If the data is stored as CSV files and you run the query with S3 select, then S3 will only transfer the data in the first_name column to run the query.
spark
.read
.format("s3select")
.schema(...)
.options(...)
.load("s3://bucket/filename")
.select("first_name")
.distinct()
.count()
If the data is stored in a Parquet data lake and peopleDF.select("first_name").distinct().count() is run, then S3 will only transfer the data in the first_name column to the ec2 cluster. Parquet is a columnar file format and this is one of the main advantages.
So based on my understanding, S3 Select wouldn't help speed up an analysis on a Parquet data lake because columnar file formats offer the S3 Select optimization out of the box.
I am not sure because a coworker is certain I am wrong and because S3 Select supports the Parquet file format. Can you please confirm that columnar file formats provide the main optimization offered by S3 Select?
This is an interesting question. I don't have any real numbers, though I have done the S3 select binding code in the hadoop-aws module. Amazon EMR have some values, as do databricks.
For CSV IO Yes, S3 Select will speedup given aggressive filtering of source data, e.g many GB of data but not much back. Why? although the read is slower, you save on the limited bandwidth to your VM.
For Parquet though, the workers split up a large file into parts and schedule the work across them (Assuming a splittable compression format like snappy is used), so > 1 worker can work on the same file. And they only read a fraction of the data (==bandwidth benefits less), But they do seek around in that file (==need to optimise seek policy else cost of aborting and reopening HTTP connections)
I'm not convinced that Parquet reads in the S3 cluster can beat a spark cluster if there's enough capacity in the cluster and you've tuned your s3 client settings (for s3a this means: seek policy, thread pool size, http pool size) for performance too.
Like I said though: I'm not sure. Numbers are welcome.
Came across this spark package for s3 select on parquet [1]
[1] https://github.com/minio/spark-select
I have a pipe delimited text file that is 360GB, compressed (gzip). The file is in an S3 bucket.
This is my first time using Spark. I understand that you can partition a file in order to allow multiple worker nodes to operate on the data which results in huge performance gains. However, I'm trying to find an efficient way to turn my one 360GB file into a partitioned file. Is there a way to use multiple spark worker nodes to work on my one, compressed file in order to partition it? Unfortunately, I have no control over the fact that I'm just getting one huge file. I could uncompress the file myself and break it into many files (say 360 1GB files), but I'll just be using one machine to do that and it will be pretty slow. I need to run some expensive transformations on the data using Spark so I think partitioning the file is necessary. I'm using Spark inside of Amazon Glue so I know that it can scale to a large number of machines. Also, I'm using python (pyspark).
Thanks.
If i'm not mistaken, Spark uses Hadoop's TextInputFormat if you read a file using SparkContext.textFile. If a compression codec is set, the TextInputFormat determines if the file is splittable by checking if the code is an instance of SplittableCompressionCodec.
I believe GZIP is not splittable, Spark can only generate one partition to read the entire file.
What you could do is:
1. Add a repartition after SparkContext.textFile so you at least have more than one of your transformations process parts of the data.
2. Ask for multiple files instead of just a single GZIP file
3. Write an application that decompresses and splits the files into multiple output files before running your Spark application on it.
4. Write your own compression codec for GZIP (this is a little more complex).
Have a look at these links:
TextInputFormat
source code for TextInputFormat
GzipCodec
source code for GZIPCodec
These are in java, but i'm sure there are equivalent Python/Scala versions of them.
First I suggest you have to used ORC format with zlib compression so you get almost 70% compression and as per my research ORC is the most suitable file format for fastest data processing. So you have to load your file and simply write it into orc format with repartition.
df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")
One potential solution could be to use Amazon's S3DistCp as a step on your EMR cluster to copy the 360GB file in the HDFS file system available on the cluster (this requires Hadoop to be deployed on the EMR).
A nice thing about S3DistCp is that you can change the codec of the output file, and transform the original gzip file into a format which will allow Spark to spawn multiple partitions for its RDD.
However I am not sure about how long it will take for S3DistCp to perform the operation (which is an Hadoop Map/Reduce over S3. It benefits from optimised S3 libraries when run from an EMR, but I am concerned that Hadoop will face the same limitations as Spark when generating the Map tasks).
I currently have a spark cluster set up with 4 worker nodes and 2 head nodes. I have a 1.5 GB CSV file in blob storage that I can access from one of the head nodes. I find that it takes quite a while to load this data and cache it using PySpark. Is there a way to load the data faster?
One thought I had was loading the data, then partitioning the data into k (number of nodes) different segments and saving them back to blob as parquet files. This way, I can load in different parts of the data set in parallel then union... However, I am unsure if all the data is just loaded on the head node, then when computation occurs, it distributes to the other machines. If the latter is true, then the partitioning would be useless.
Help would be much appreciated. Thank you.
Generally, you will want to have smaller file sizes on blob storage so that way you can transfer data between blob storage to compute in parallel so you have faster transfer rates. A good rule of thumb is to have a file size between 64MB - 256MB; a good reference is Vida Ha's Data Storage Tips for Optimal Spark Performance.
Your call out for reading the file and then saving it back to Parquet (with default snappy codec compression) is a good idea. Parquet is natively used by Spark and is often faster to query against. The only tweak would be to partition more by the file size vs. # of worker nodes. The data is loaded onto the worker nodes but partitioning is helpful because more tasks are created to read more files.
Im trying to save a Spark DataFrame (of more than 20G) to a single json file in Amazon S3, my code to save the dataframe is like this :
dataframe.repartition(1).save("s3n://mybucket/testfile","json")
But im getting an error from S3 "Your proposed upload exceeds the maximum allowed size", i know that the maximum file size allowed by Amazon is 5GB.
Is it possible to use S3 multipart upload with Spark? or there is another way to solve this?
Btw i need the data in a single file because another user is going to download it after.
*Im using apache spark 1.3.1 in a 3-node cluster created with the spark-ec2 script.
Thanks a lot
JG
I would try separating the large dataframe into a series of smaller dataframes that you then append into the same file in the target.
df.write.mode('append').json(yourtargetpath)
Try this
dataframe.write.format("org.apache.spark.sql.json").mode(SaveMode.Append).save("hdfs://localhost:9000/sampletext.txt");
s3a is not production version in Spark I think.
I would say the design is not sound. repartition(1) is going to be terrible (what you are telling spark is to merge all partitions to a single one).
I would suggest to convince the downstream to download contents from a folder rather than a single file