HDFS Date partition directory loop - apache-spark

I have a HDFS Directory as below.
/user/staging/app_name/2022_05_06
Under such a directory I have around 1000 part files.
I want to loop each of the part file and start loading them into cassandra,the volume of the entire directory is around 50 Billion.
This is very huge to process in a single shot,hence the idea was to read the individual part files and start loading them one by one in Append Mode.
Can anyone help in the approach?

Related

Disk read performance - Does splitting 100k+ of files into subdirectories help while read them faster?

I have 100Ks+ of small JSON data files in one directory (not by choice). When accessing each of them, does a flat vs. pyramid directory structure make any difference? Does it help Node.js/Nginx/filesystem retrieve them faster, if the files would be grouped by e.g. first letter, in corresponding directories?
In other words, is it faster to get baaaa.json from /json/b/ (only b*.json here), then to get it from /json/ (all files), when it is same to assume that the subdirectories contain 33 times less files each? Does it make finding each file 33x faster? Or is there any disk read difference at all?
jfriend00's comment EDIT: I am not sure what the underlying filesystem will be yet. But let's assume an S3 bucket.

Fastest way to sort very large files preferably with progress

I have a 200GB flat file (one word per line) and I want to sort the file, then remove the duplicates and create one clean final TXT file out of it.
I tried sort with --parallel but it ran for 3 days and I got frustrated and killed the process as I didn't see any changes to the chunk of files it created in /tmp.
I need to see the progress somehow and make sure its not stuck and its working. Whats the best way to do so? Are there any Linux tools or open source project dedicated for something like this?
I don't use Linux, but if this is Gnu sort, you should be able to see the temporary files it creates from another window to monitor progress. The parallel feature only helps during the initial pass that sorts and creates the initial list of temporary files. After that, the default is a 16-way merge.
Say for example the first pass is creating temp files around 1GB in size. In this case, Gnu sort will end up creating 200 of these 1GB temp files before starting the merge phase. The 16 way merge means that 16 of those temp files will be merged at a time, creating temp files of size 16GB, and so on.
So one way to monitor progress is to monitor the creation of those temporary files.

How to properly load millions of files into an RDD

I have a very large set of json files (>1 million files) that I would like to work on with Spark.
But, I've never tried loading this much data into an RDD before, so I actually don't know if it can be done, or rather if it even should be done.
What is the correct pattern for dealing with this amount of data within RDD(s) in Spark?
Easiest way would be to create directory, copy all the files to the directory and pass directory as path while reading the data.
If you try to use patterns in the directory path, Spark might run into out of memory issues.

Quickly write files made by MultipleTextOutputFormat to cloud store (Azure, S3, etc)

I have a Spark job that takes data from a Hive table, transforms it, and eventually gives me an RDD containing keys of filenames and values of that file's content. I then pass that onto a custom OutputFormat that creates individual files based on those keys. The end result is about 20 million files, each file being about 1-10MB in size.
My issue is now efficiently writing those files to my end destination. I can't put them in HDFS because 20 million small files will quickly grind HDFS to a halt. If I attempt to write directly to my cloud store it goes very slowly as it appears that each task will upload each file it gets sequentially. I'm interested in hearing about any technique I can use to speed up this process such that files will be uploaded with as much parallelization as possible.

Optimal directory structure for saving large number of files

One software we developed generates more and more, currently about 70000 files per day, 3-5 MB each. We store these files on a Linux server with ext3 file system. The software creates a new directory every day, and writes the files generated that day into this directory. Writing and reading such a large number of files is getting slower and slower (I mean, per file), so one of my colleagues suggested opening subdirectories in every hour. We will test whether this makes the system faster, but this problem can be generalized:
Has anyone measured the speed of writing and reading files, as a function of the number of files in the target directory? Is there an optimal file count above which it's faster to put the files into subdirectories? What are the important parameters which may influence the optimum?
Thank you in advance.

Resources