Understanting file distribution and partitioning in HDFS when using Hive - apache-spark

On the one hand, in HDFS docs they say:
HDFS is designed to support very large files. Applications that are
compatible with HDFS are those that deal with large data sets. These
applications write their data only once but they read it one or more
times and require these reads to be satisfied at streaming speeds.
HDFS supports write-once-read-many semantics on files. A typical block
size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64
MB chunks, and if possible, each chunk will reside on a different
DataNode.
Meaning every file will be splitted between nodes.
On the other hand, when I use Hive or Spark SQL, I manage the partitions in such a way that there is a folder for each partition, and all the files inside belong to this partition. For example:
/Sales
/country=Spain
/city=Barcelona
/2019-08-28.parquet
/2019-08-27.parquet
/city=Madrid
/2019-08-28.parquet
/2019-08-27.parquet
Let's say that each file's size is 1GB and the HDFS block size is 128 MB.
So I am confused. I don't understand if city=Barcelonav/2019-08-28.parquet is saved on only one node as a whole (even together with city=Barcelona/2019-08-27.parquet), or each file is distributed between 8 nodes.
If each file is distributed, then what is the benefit of the partitions?
If the data is distributed according to the partitions I define, how does HDFS know to do that? Does it look for folders with a name in the form of key=value and make sure they will be saved intact?

You are confused between "how HDFS stores the files that we dump into it" and "how Hive/Spark creates different directories in case of partitioning".
Let me try to provide you a perspective.
HDFS works as you have mentioned.
HDFS breaks up the files into n number of blocks depending upon the block size and the size of the file to be stored. The metadata (directories, permissions, etc..) is an abstraction in a sense that the file (2019-08-27.parquet) that you see as one is indeed distributed among nodes. Namenode maintains the metadata.
However, when we partition it creates different directories on HDFS. This ultimately helps when you want to query the data with conditions on the partitioned column. Only relevant directories are searched for the requested data. If you go ahead and query on your partitioned data and write an explain to have a look at the logical plan, you can notice the Partition Filters while FileScan phase.
The partitioned data is still stored on HDFS in the same way that you mentioned.
Hope this helps!

Related

Is it allowed to merge small files(but will be large when merged) in HDFS by using coalesce or repartition?

I'm using an hdfs-sink-connector to consume Kafka's data into HDFS.
The Kafka connector writes data every 10 minutes, and sometimes the written file's size is really small; it varies from 2MB to 100MB. So, the written files actually waste my HDFS storage since each block size is 256MB.
The directory is created per date; so I wondered it would be great to merge many small files into one big file by daily batch. (I expected the HDFS will automatically divide one large file into block size as a result.)
I know there are many answers which say we could use spark's coalesce(1) or repartition(1), but I worried about OOM error if I read the whole directory and use those functions; it might be more than 90GB~100GB if I read every file.
Will 90~100GB in HDFS be allowed? Am I don't need to be worried about it?
Could anyone let me know if there is a best practice for merging small HDFS files? Thanks!
So, the written files actually waste my HDFS storage since each block size is 256MB.
HDFS doesn't "fill out" the unused parts of the block. So a 2MB file only uses 2MB on disk (well, 6MB if you account for 3x replication). The main concern with small files on HDFS is that billions of small files can cause problems.
I worried about OOM error if I read the whole directory and use those functions
Spark may be an in-memory processing framework, but it still works if the data doesn't fit into memory. In such situations processing spills over onto disk and will be a bit slower.
Will 90~100GB in HDFS be allowed?
That is absolutely fine - this is big data after all. As you noted, the actual file will be split into smaller blocks in the background (but you won't see this unless you use hadoop fsck).

Process multiple small files of total size 100GB in HDFS

I have a requirement in my project to process multiple .txt message files using PySpark. The files are moved from local dir to HDFS path (hdfs://messageDir/..) using batches and for every batch, i could see a few thousand .txt files and their total size is around 100GB. Almost all of the files are less than 1 MB.
May i know how HDFS stores these files and perform splits? Because every file is less than 1 MB (less than HDFS block size of 64/128MB), I dont think any split would happen but the files will be replicated and stored in 3 different data nodes.
When i use Spark to read all the files inside the HDFS directory (hdfs://messageDir/..) using wild card matching like *.txt as below:-
rdd = sc.textFile('hdfs://messageDir/*.txt')
How does Spark read the files and perform Partition because HDFS doesn't have any partition for these small files.
What if my file size increases over a period of time and get 1TB volume of small files for every batch? Can someone tell me how this can be handled?
I think you are mixing things up a little.
You have files sitting in HDFS. Here, Blocksize is the important factor. Depending on your configuration, a block normally has 64MB or 128MB. Thus, each of your 1MB files, take up 64MB in HDFS. This is aweful lot of unused space. Can you concat these TXT-files together? Otherwise you will run out of HDFS blocks, really quick. HDFS is not made to store a large amount of small files.
Spark can read files from HDFS, Local, MySQL. It cannot control the storage principles used there. As Spark uses RDDs, they are partitioned to get part of the data to the workers. The number of partitions can be checked and controlled (using repartition). For HDFS reading, this number is defined by the number of files and blocks.
Here is a nice explanation on how SparkContext.textFile() handles Partitioning and Splits on HDFS: How does Spark partition(ing) work on files in HDFS?
You can read from spark even files are small. Problem is HDFS. Usually HDFS block size is really large(64MB, 128MB, or more bigger), so many small files make name node overhead.
If you want to make more bigger file, you need to optimize reducer. Number of write files is determined by how many reducer will write. You can use coalesce or repartition method to control it.
Another way is make one more step that merge files. I wrote spark application code that coalesce. I put target record size of each file, and application get total number of records, then how much number of coalesce can be estimated.
You can use Hive or otherwise.

Partitioning strategy in Parquet and Spark

I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3
It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.

spark mechanism of accessing files larger than (or lesser) than HDFS block size

This is most of a theoretical query per se, but directly linked to how I should create my files in HDFS. So, please bear with me for a bit.
I'm recently stuck on creating Dataframes for a set of data stored in parquet (snappy) files sitting on HDFS. Each parquet file is approximately 250+ MB in size but the total number of files are around 6k. Which I see as the reason of creating around 10K tasks while creating the DF & obviously runs longer than expected.
I went through some posts where the explanation of the optimal parquet file size to be 1G minimum has been given (https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html),
(Is it better to have one large parquet file or lots of smaller parquet files?).
I wanted to understand how Spark's processing is affected by the size of the files it is reading. More so, does HDFS block size & the file size greater or lesser than HDFS block size literally affects how spark partitions get created? If yes, then how; I need to understand the granular level details. If anyone has any specific & precise links to the context I'm asking of, it'd be a great help in understanding.

How the input data is split in Spark?

I'm coming from a Hadoop background, in hadoop if we have an input directory that contains lots of small files, each mapper task picks one file each time and operate on a single file (we can change this behaviour and have each mapper picks more than one file but that's not the default behaviour). I wonder to know how that works in Spark? Does each spark task picks files one by one or..?
Spark behaves the same way as Hadoop working with HDFS, as in fact Spark uses the same Hadoop InputFormats to read the data from HDFS.
But your statement is wrong. Hadoop will take files one by one only if each of your files is smaller than a block size or if all the files are text and compressed with non-splittable compression (like gzip-compressed CSV files).
So Spark would do the same, for each of the small input files it would create a separate "partition" and the first stage executed over your data would have the same amount of tasks as the amount of input files. This is why for small files it is recommended to use wholeTextFiles function as it would create much less partitions

Resources