Process multiple small files of total size 100GB in HDFS - apache-spark

I have a requirement in my project to process multiple .txt message files using PySpark. The files are moved from local dir to HDFS path (hdfs://messageDir/..) using batches and for every batch, i could see a few thousand .txt files and their total size is around 100GB. Almost all of the files are less than 1 MB.
May i know how HDFS stores these files and perform splits? Because every file is less than 1 MB (less than HDFS block size of 64/128MB), I dont think any split would happen but the files will be replicated and stored in 3 different data nodes.
When i use Spark to read all the files inside the HDFS directory (hdfs://messageDir/..) using wild card matching like *.txt as below:-
rdd = sc.textFile('hdfs://messageDir/*.txt')
How does Spark read the files and perform Partition because HDFS doesn't have any partition for these small files.
What if my file size increases over a period of time and get 1TB volume of small files for every batch? Can someone tell me how this can be handled?

I think you are mixing things up a little.
You have files sitting in HDFS. Here, Blocksize is the important factor. Depending on your configuration, a block normally has 64MB or 128MB. Thus, each of your 1MB files, take up 64MB in HDFS. This is aweful lot of unused space. Can you concat these TXT-files together? Otherwise you will run out of HDFS blocks, really quick. HDFS is not made to store a large amount of small files.
Spark can read files from HDFS, Local, MySQL. It cannot control the storage principles used there. As Spark uses RDDs, they are partitioned to get part of the data to the workers. The number of partitions can be checked and controlled (using repartition). For HDFS reading, this number is defined by the number of files and blocks.
Here is a nice explanation on how SparkContext.textFile() handles Partitioning and Splits on HDFS: How does Spark partition(ing) work on files in HDFS?

You can read from spark even files are small. Problem is HDFS. Usually HDFS block size is really large(64MB, 128MB, or more bigger), so many small files make name node overhead.
If you want to make more bigger file, you need to optimize reducer. Number of write files is determined by how many reducer will write. You can use coalesce or repartition method to control it.
Another way is make one more step that merge files. I wrote spark application code that coalesce. I put target record size of each file, and application get total number of records, then how much number of coalesce can be estimated.
You can use Hive or otherwise.

Related

Is it allowed to merge small files(but will be large when merged) in HDFS by using coalesce or repartition?

I'm using an hdfs-sink-connector to consume Kafka's data into HDFS.
The Kafka connector writes data every 10 minutes, and sometimes the written file's size is really small; it varies from 2MB to 100MB. So, the written files actually waste my HDFS storage since each block size is 256MB.
The directory is created per date; so I wondered it would be great to merge many small files into one big file by daily batch. (I expected the HDFS will automatically divide one large file into block size as a result.)
I know there are many answers which say we could use spark's coalesce(1) or repartition(1), but I worried about OOM error if I read the whole directory and use those functions; it might be more than 90GB~100GB if I read every file.
Will 90~100GB in HDFS be allowed? Am I don't need to be worried about it?
Could anyone let me know if there is a best practice for merging small HDFS files? Thanks!
So, the written files actually waste my HDFS storage since each block size is 256MB.
HDFS doesn't "fill out" the unused parts of the block. So a 2MB file only uses 2MB on disk (well, 6MB if you account for 3x replication). The main concern with small files on HDFS is that billions of small files can cause problems.
I worried about OOM error if I read the whole directory and use those functions
Spark may be an in-memory processing framework, but it still works if the data doesn't fit into memory. In such situations processing spills over onto disk and will be a bit slower.
Will 90~100GB in HDFS be allowed?
That is absolutely fine - this is big data after all. As you noted, the actual file will be split into smaller blocks in the background (but you won't see this unless you use hadoop fsck).

Understanting file distribution and partitioning in HDFS when using Hive

On the one hand, in HDFS docs they say:
HDFS is designed to support very large files. Applications that are
compatible with HDFS are those that deal with large data sets. These
applications write their data only once but they read it one or more
times and require these reads to be satisfied at streaming speeds.
HDFS supports write-once-read-many semantics on files. A typical block
size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64
MB chunks, and if possible, each chunk will reside on a different
DataNode.
Meaning every file will be splitted between nodes.
On the other hand, when I use Hive or Spark SQL, I manage the partitions in such a way that there is a folder for each partition, and all the files inside belong to this partition. For example:
/Sales
/country=Spain
/city=Barcelona
/2019-08-28.parquet
/2019-08-27.parquet
/city=Madrid
/2019-08-28.parquet
/2019-08-27.parquet
Let's say that each file's size is 1GB and the HDFS block size is 128 MB.
So I am confused. I don't understand if city=Barcelonav/2019-08-28.parquet is saved on only one node as a whole (even together with city=Barcelona/2019-08-27.parquet), or each file is distributed between 8 nodes.
If each file is distributed, then what is the benefit of the partitions?
If the data is distributed according to the partitions I define, how does HDFS know to do that? Does it look for folders with a name in the form of key=value and make sure they will be saved intact?
You are confused between "how HDFS stores the files that we dump into it" and "how Hive/Spark creates different directories in case of partitioning".
Let me try to provide you a perspective.
HDFS works as you have mentioned.
HDFS breaks up the files into n number of blocks depending upon the block size and the size of the file to be stored. The metadata (directories, permissions, etc..) is an abstraction in a sense that the file (2019-08-27.parquet) that you see as one is indeed distributed among nodes. Namenode maintains the metadata.
However, when we partition it creates different directories on HDFS. This ultimately helps when you want to query the data with conditions on the partitioned column. Only relevant directories are searched for the requested data. If you go ahead and query on your partitioned data and write an explain to have a look at the logical plan, you can notice the Partition Filters while FileScan phase.
The partitioned data is still stored on HDFS in the same way that you mentioned.
Hope this helps!

Does size of part files play a role for Spark SQL performance

I am trying to query the hdfs which has lot of part files (avro). Recently we made a change to reduce parallelism and thus the size of part files have increased , the size of each of these part files are in the range of 750MB to 2 GB (we use spark streaming to write date to hdfs in 10 minute intervals, so the size of these files depends on the amount of data we are processing from the upstream). The number of part files would be around 500. I was wondering if the size of these part files/ number of part files would play any role in the spark SQL performance?
I can provide more information if required.
HDFS, Map Reduce and SPARK prefer files that are larger in size, as opposed to many small files. S3 also has issues. I am not sure if you mean HDFS or S3 here.
Repartitioning smaller files to a lesser number of larger files will - without getting into all the details - allow SPARK or MR to process less of, but bigger blocks of data, thereby improving the speed of jobs by decreasing the number of map tasks needed to read them in, and reducing the cost of storage due to less wastage and name node contention issues.
All in all, the small files problem of which there is much to read on. E.g. https://www.infoworld.com/article/3004460/application-development/5-things-we-hate-about-spark.html. Just to be clear, I am a Spark fan.
Generally, fewer, larger files are better,
One issue is whether the file can be split, and how.
Files compressed with .gz cannot be split: you have to read from the start to the finish, so at most one worker at a time gets assigned a single file (except near the end of a query & speculation can trigger a second). Use a compression like snappy and all is well
very small files are inefficient as startup/commit overhead dominates
on HDFS, small files put load on the namenode, so the ops team may be unhappy to

spark mechanism of accessing files larger than (or lesser) than HDFS block size

This is most of a theoretical query per se, but directly linked to how I should create my files in HDFS. So, please bear with me for a bit.
I'm recently stuck on creating Dataframes for a set of data stored in parquet (snappy) files sitting on HDFS. Each parquet file is approximately 250+ MB in size but the total number of files are around 6k. Which I see as the reason of creating around 10K tasks while creating the DF & obviously runs longer than expected.
I went through some posts where the explanation of the optimal parquet file size to be 1G minimum has been given (https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html),
(Is it better to have one large parquet file or lots of smaller parquet files?).
I wanted to understand how Spark's processing is affected by the size of the files it is reading. More so, does HDFS block size & the file size greater or lesser than HDFS block size literally affects how spark partitions get created? If yes, then how; I need to understand the granular level details. If anyone has any specific & precise links to the context I'm asking of, it'd be a great help in understanding.

Spark 2.0.0: read many .gz files

I have more than 150,000 .csv.gz files, organised in several folders (on s3) that have the same prefix. The size of each file is approximately 550KB. My goal is to read all these files into one DataFrame, the total size is about 80GB.
I am using EMR 5.0.0 with a decent cluster: 3 instances of c4.8xlarge
(36 vCPU, 60 GiB memory, EBS Storage:100 GiB).
I am reading the files using a wildcard character in the path:
sc.textFile("s3://bucket/directory/prefix*/*.csv.gz")
Then I do some map operations and I transform the RDD into a DataFrame by calling toDF("col1_name", "col2_name", "col3_name"). I then do few calls to UDFs to create new columns.
When I call df.show() the operation take longtime and never finish.
I wonder why the process is taking very long time?
Is reading that large number of .csv.gz files is the problem?
.gz files are not splittable and will result in 150K partitions. Spark will not like that: it struggles with even several 10k's of partitions.
You might want to look into aws distcp or S3DistCp to copy to hdfs first - and then bundle the files using an appropriate Hadoop InputFormat such as CombineFileInputFormat that gloms many files into one. Here is an older blog that has more ideas: http://inquidia.com/news-and-info/working-small-files-hadoop-part-3

Resources