I am (for the first time) trying to repartition the data my team is working with to enhance our querying performance. Our data is currently stored in partitioned .parquet files compressed with gzip. I have been reading that using snappy instead would significantly increase throughput (we query this data daily for our analysis). I still wanted to benchmark the two codecs to see the perfomance gap with with my own eyes. I wrote a simple (Py)Spark 2.1.1 application to carry out some tests. I persisted 50 millions records in memory (deserialized) in a single partition, wrote them into a single parquet file (to HDFS) using the different codecs and then imported the files again to assess the difference. My problem is that I can't see any significant difference for both read and write.
Here is how I wrote my records to HDFS (same thing for the gzip file, just replace 'snappy' with 'gzip') :
persisted_records.write\
.option('compression', 'snappy')\
.mode("overwrite")\
.partitionBy(*partition_cols)\
.parquet('path_to_dir/test_file_snappy')
And here is how I read my single .parquet file (same thing for the gzip file, just replace 'snappy' with 'gzip') :
df_read_snappy = spark.read\
.option('basePath', 'path_to_dir/test_file_snappy')\
.option('compression', 'snappy')\
.parquet('path_to_dir/test_file_snappy')\
.cache()
df_read_snappy.count()
I looked at the durations in the Spark UI. For information, the persisted (deserialized) 50 millions rows amount 317.4M. Once written into a single parquet file, the file weights 60.5M and 105.1M using gzip and snappy respectively (this is expected as gzip is supposed to have a better compression ratio). Spark spends 1.7min (gzip) et 1.5min (snappy) to write the file (single partition so a single core has to carry out all the work). Reading times amount to 2.7min (gzip) et 2.9min (snappy) on a single core (since we have a single file / HDFS block). This what I do not understand : where is snappy's higher performance ?
Have I done something wrong ? Is my "benchmarking protocol" flawed ? Is the performance gain here but I am not looking at the right metrics ?
I must add that I am using Spark default conf. I did not change anything aside from specifying the number of executors, etc.
Many thanks for your help!
Notice: Spark parquet jar version is 1.8.1
Related
I need to read data (originating from a RedShift table with 5 columns, total size of the table is on the order of 500gb - 1tb) from S3 into Spark via PySpark for a daily batch job.
Are there any best practices around:
Preferred File Formats for how I store my data in S3? (does the format even matter?)
Optimal file size?
Any resources/links that can point me in the right direction would also work.
Thanks!
This blog post has some great info on the subject:
https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/
Look at the section titled: Use the Best Data Store for Your Use Case
From personal experience, I prefer using parquet in most scenarios, because I’m usually writing the data out once, and then reading it many times (for analytics).
In terms of numbers of files, I like to have between 200 and 1,000. This allows clusters of all sizes to read and write in parallel, and allows my reading of the data to be efficient because with parquet I can zoom in on just the file I’m interested in. If you have too many files, there is a ton of overhead in spark remembering all the file names and locations, and if you have too few files, it can’t parallelize your reads and writes effectively.
File size I have found to be less important than number of files, when using parquet.
EDIT:
Here’s a good section from that blog post that describes why I like to use parquet:
Apache Parquet gives the fastest read performance with Spark. Parquet arranges data in columns, putting related values in close proximity to each other to optimize query performance, minimize I/O, and facilitate compression. Parquet detects and encodes the same or similar data, using a technique that conserves resources. Parquet also stores column metadata and statistics, which can be pushed down to filter columns (discussed below). Spark 2.x has a vectorized Parquet reader that does decompression and decoding in column batches, providing ~ 10x faster read performance.
I have a requirement in my project to process multiple .txt message files using PySpark. The files are moved from local dir to HDFS path (hdfs://messageDir/..) using batches and for every batch, i could see a few thousand .txt files and their total size is around 100GB. Almost all of the files are less than 1 MB.
May i know how HDFS stores these files and perform splits? Because every file is less than 1 MB (less than HDFS block size of 64/128MB), I dont think any split would happen but the files will be replicated and stored in 3 different data nodes.
When i use Spark to read all the files inside the HDFS directory (hdfs://messageDir/..) using wild card matching like *.txt as below:-
rdd = sc.textFile('hdfs://messageDir/*.txt')
How does Spark read the files and perform Partition because HDFS doesn't have any partition for these small files.
What if my file size increases over a period of time and get 1TB volume of small files for every batch? Can someone tell me how this can be handled?
I think you are mixing things up a little.
You have files sitting in HDFS. Here, Blocksize is the important factor. Depending on your configuration, a block normally has 64MB or 128MB. Thus, each of your 1MB files, take up 64MB in HDFS. This is aweful lot of unused space. Can you concat these TXT-files together? Otherwise you will run out of HDFS blocks, really quick. HDFS is not made to store a large amount of small files.
Spark can read files from HDFS, Local, MySQL. It cannot control the storage principles used there. As Spark uses RDDs, they are partitioned to get part of the data to the workers. The number of partitions can be checked and controlled (using repartition). For HDFS reading, this number is defined by the number of files and blocks.
Here is a nice explanation on how SparkContext.textFile() handles Partitioning and Splits on HDFS: How does Spark partition(ing) work on files in HDFS?
You can read from spark even files are small. Problem is HDFS. Usually HDFS block size is really large(64MB, 128MB, or more bigger), so many small files make name node overhead.
If you want to make more bigger file, you need to optimize reducer. Number of write files is determined by how many reducer will write. You can use coalesce or repartition method to control it.
Another way is make one more step that merge files. I wrote spark application code that coalesce. I put target record size of each file, and application get total number of records, then how much number of coalesce can be estimated.
You can use Hive or otherwise.
We have a HIVE target with storage as Parquet.
Informatica BDM jobs are configured to use spark as the execution engine to load data to HIVE target.
We had noticed that there are around 2000 part files which got generated within a partition in HDFS. This behaviour will impact the HIVE performance.
Is there any alternative for the same?
Input File Size is just 12MB
Block size is 128MB
Regards,
Sridar Venkatesan
Root cause was due to spark.sql.shuffle.partitions
You need to set spark.sql.shuffle.partitions=1
This way it will not split file into multiple partitions files.
This works with huge size files as well
I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3
It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.
I have a pipe delimited text file that is 360GB, compressed (gzip). The file is in an S3 bucket.
This is my first time using Spark. I understand that you can partition a file in order to allow multiple worker nodes to operate on the data which results in huge performance gains. However, I'm trying to find an efficient way to turn my one 360GB file into a partitioned file. Is there a way to use multiple spark worker nodes to work on my one, compressed file in order to partition it? Unfortunately, I have no control over the fact that I'm just getting one huge file. I could uncompress the file myself and break it into many files (say 360 1GB files), but I'll just be using one machine to do that and it will be pretty slow. I need to run some expensive transformations on the data using Spark so I think partitioning the file is necessary. I'm using Spark inside of Amazon Glue so I know that it can scale to a large number of machines. Also, I'm using python (pyspark).
Thanks.
If i'm not mistaken, Spark uses Hadoop's TextInputFormat if you read a file using SparkContext.textFile. If a compression codec is set, the TextInputFormat determines if the file is splittable by checking if the code is an instance of SplittableCompressionCodec.
I believe GZIP is not splittable, Spark can only generate one partition to read the entire file.
What you could do is:
1. Add a repartition after SparkContext.textFile so you at least have more than one of your transformations process parts of the data.
2. Ask for multiple files instead of just a single GZIP file
3. Write an application that decompresses and splits the files into multiple output files before running your Spark application on it.
4. Write your own compression codec for GZIP (this is a little more complex).
Have a look at these links:
TextInputFormat
source code for TextInputFormat
GzipCodec
source code for GZIPCodec
These are in java, but i'm sure there are equivalent Python/Scala versions of them.
First I suggest you have to used ORC format with zlib compression so you get almost 70% compression and as per my research ORC is the most suitable file format for fastest data processing. So you have to load your file and simply write it into orc format with repartition.
df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")
One potential solution could be to use Amazon's S3DistCp as a step on your EMR cluster to copy the 360GB file in the HDFS file system available on the cluster (this requires Hadoop to be deployed on the EMR).
A nice thing about S3DistCp is that you can change the codec of the output file, and transform the original gzip file into a format which will allow Spark to spawn multiple partitions for its RDD.
However I am not sure about how long it will take for S3DistCp to perform the operation (which is an Hadoop Map/Reduce over S3. It benefits from optimised S3 libraries when run from an EMR, but I am concerned that Hadoop will face the same limitations as Spark when generating the Map tasks).