SPARK | Generating too many part files - apache-spark

We have a HIVE target with storage as Parquet.
Informatica BDM jobs are configured to use spark as the execution engine to load data to HIVE target.
We had noticed that there are around 2000 part files which got generated within a partition in HDFS. This behaviour will impact the HIVE performance.
Is there any alternative for the same?
Input File Size is just 12MB
Block size is 128MB
Sridar Venkatesan

Root cause was due to spark.sql.shuffle.partitions

You need to set spark.sql.shuffle.partitions=1
This way it will not split file into multiple partitions files.
This works with huge size files as well


Apache Spark loads the entire partition into memory?

Apache Spark loads the entire partition into memory or does it load gradually? Is there any reference (preferably official) about that?
If I have a large partition will be necessary to have the partition size in memory available?
Will loading data from the in-memory partition depend on the type of transformation?
That depends of your file type, if it is CSV/textFile spark usually will load gradually even if you have multiple partitions and it depends of the size of the files. CSV does that because you cannot split by which data you need to read. CSV/textFile to get one row of data you need to scan the whole file.
If we are talking about parquet or orc files the format is naturally splittable. The data will never load the full files if you put some conditions during the read as where and select to choose the columns. That is why the recommended file size is around 1GB to optimise the spark time processing.
So if you are using parquet, each partition of spark should be able to be stored in memory while the process is going. Spark will try to store most partitions it can in the memory of the cluster during the transformations you are doing, if that cannot be fitted that will spill to the disk, reducing the execution time but ensure your execution to finish.

Understanting file distribution and partitioning in HDFS when using Hive

On the one hand, in HDFS docs they say:
HDFS is designed to support very large files. Applications that are
compatible with HDFS are those that deal with large data sets. These
applications write their data only once but they read it one or more
times and require these reads to be satisfied at streaming speeds.
HDFS supports write-once-read-many semantics on files. A typical block
size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64
MB chunks, and if possible, each chunk will reside on a different
Meaning every file will be splitted between nodes.
On the other hand, when I use Hive or Spark SQL, I manage the partitions in such a way that there is a folder for each partition, and all the files inside belong to this partition. For example:
Let's say that each file's size is 1GB and the HDFS block size is 128 MB.
So I am confused. I don't understand if city=Barcelonav/2019-08-28.parquet is saved on only one node as a whole (even together with city=Barcelona/2019-08-27.parquet), or each file is distributed between 8 nodes.
If each file is distributed, then what is the benefit of the partitions?
If the data is distributed according to the partitions I define, how does HDFS know to do that? Does it look for folders with a name in the form of key=value and make sure they will be saved intact?
You are confused between "how HDFS stores the files that we dump into it" and "how Hive/Spark creates different directories in case of partitioning".
Let me try to provide you a perspective.
HDFS works as you have mentioned.
HDFS breaks up the files into n number of blocks depending upon the block size and the size of the file to be stored. The metadata (directories, permissions, etc..) is an abstraction in a sense that the file (2019-08-27.parquet) that you see as one is indeed distributed among nodes. Namenode maintains the metadata.
However, when we partition it creates different directories on HDFS. This ultimately helps when you want to query the data with conditions on the partitioned column. Only relevant directories are searched for the requested data. If you go ahead and query on your partitioned data and write an explain to have a look at the logical plan, you can notice the Partition Filters while FileScan phase.
The partitioned data is still stored on HDFS in the same way that you mentioned.
Hope this helps!

Spark 2.x - gzip vs snappy compression for parquet files

I am (for the first time) trying to repartition the data my team is working with to enhance our querying performance. Our data is currently stored in partitioned .parquet files compressed with gzip. I have been reading that using snappy instead would significantly increase throughput (we query this data daily for our analysis). I still wanted to benchmark the two codecs to see the perfomance gap with with my own eyes. I wrote a simple (Py)Spark 2.1.1 application to carry out some tests. I persisted 50 millions records in memory (deserialized) in a single partition, wrote them into a single parquet file (to HDFS) using the different codecs and then imported the files again to assess the difference. My problem is that I can't see any significant difference for both read and write.
Here is how I wrote my records to HDFS (same thing for the gzip file, just replace 'snappy' with 'gzip') :
.option('compression', 'snappy')\
And here is how I read my single .parquet file (same thing for the gzip file, just replace 'snappy' with 'gzip') :
df_read_snappy =\
.option('basePath', 'path_to_dir/test_file_snappy')\
.option('compression', 'snappy')\
I looked at the durations in the Spark UI. For information, the persisted (deserialized) 50 millions rows amount 317.4M. Once written into a single parquet file, the file weights 60.5M and 105.1M using gzip and snappy respectively (this is expected as gzip is supposed to have a better compression ratio). Spark spends 1.7min (gzip) et 1.5min (snappy) to write the file (single partition so a single core has to carry out all the work). Reading times amount to 2.7min (gzip) et 2.9min (snappy) on a single core (since we have a single file / HDFS block). This what I do not understand : where is snappy's higher performance ?
Have I done something wrong ? Is my "benchmarking protocol" flawed ? Is the performance gain here but I am not looking at the right metrics ?
I must add that I am using Spark default conf. I did not change anything aside from specifying the number of executors, etc.
Many thanks for your help!
Notice: Spark parquet jar version is 1.8.1

Process multiple small files of total size 100GB in HDFS

I have a requirement in my project to process multiple .txt message files using PySpark. The files are moved from local dir to HDFS path (hdfs://messageDir/..) using batches and for every batch, i could see a few thousand .txt files and their total size is around 100GB. Almost all of the files are less than 1 MB.
May i know how HDFS stores these files and perform splits? Because every file is less than 1 MB (less than HDFS block size of 64/128MB), I dont think any split would happen but the files will be replicated and stored in 3 different data nodes.
When i use Spark to read all the files inside the HDFS directory (hdfs://messageDir/..) using wild card matching like *.txt as below:-
rdd = sc.textFile('hdfs://messageDir/*.txt')
How does Spark read the files and perform Partition because HDFS doesn't have any partition for these small files.
What if my file size increases over a period of time and get 1TB volume of small files for every batch? Can someone tell me how this can be handled?
I think you are mixing things up a little.
You have files sitting in HDFS. Here, Blocksize is the important factor. Depending on your configuration, a block normally has 64MB or 128MB. Thus, each of your 1MB files, take up 64MB in HDFS. This is aweful lot of unused space. Can you concat these TXT-files together? Otherwise you will run out of HDFS blocks, really quick. HDFS is not made to store a large amount of small files.
Spark can read files from HDFS, Local, MySQL. It cannot control the storage principles used there. As Spark uses RDDs, they are partitioned to get part of the data to the workers. The number of partitions can be checked and controlled (using repartition). For HDFS reading, this number is defined by the number of files and blocks.
Here is a nice explanation on how SparkContext.textFile() handles Partitioning and Splits on HDFS: How does Spark partition(ing) work on files in HDFS?
You can read from spark even files are small. Problem is HDFS. Usually HDFS block size is really large(64MB, 128MB, or more bigger), so many small files make name node overhead.
If you want to make more bigger file, you need to optimize reducer. Number of write files is determined by how many reducer will write. You can use coalesce or repartition method to control it.
Another way is make one more step that merge files. I wrote spark application code that coalesce. I put target record size of each file, and application get total number of records, then how much number of coalesce can be estimated.
You can use Hive or otherwise.

Faster reading from blob storage via spark

I currently have a spark cluster set up with 4 worker nodes and 2 head nodes. I have a 1.5 GB CSV file in blob storage that I can access from one of the head nodes. I find that it takes quite a while to load this data and cache it using PySpark. Is there a way to load the data faster?
One thought I had was loading the data, then partitioning the data into k (number of nodes) different segments and saving them back to blob as parquet files. This way, I can load in different parts of the data set in parallel then union... However, I am unsure if all the data is just loaded on the head node, then when computation occurs, it distributes to the other machines. If the latter is true, then the partitioning would be useless.
Help would be much appreciated. Thank you.
Generally, you will want to have smaller file sizes on blob storage so that way you can transfer data between blob storage to compute in parallel so you have faster transfer rates. A good rule of thumb is to have a file size between 64MB - 256MB; a good reference is Vida Ha's Data Storage Tips for Optimal Spark Performance.
Your call out for reading the file and then saving it back to Parquet (with default snappy codec compression) is a good idea. Parquet is natively used by Spark and is often faster to query against. The only tweak would be to partition more by the file size vs. # of worker nodes. The data is loaded onto the worker nodes but partitioning is helpful because more tasks are created to read more files.
