ext performance handling millions of files - linux

I have a filesystem with 40 million files in a 10 level tree structure (around 500 GB in total). The problem I have is the backup. An Incr backup (bacula) takes 9 hours (around 10 GB) with a very low performance. Some directories have 50k files, other 10k files. The HDs are HW RAID, and I have the default Ubuntu LV on top. I think the bottleneck here is the # of files (the huge # of inodes.) I'm trying to improve the performance (a full backup on the same FS takes 4+ days, at 200k/s read speed).
- Do you think that partitioning the FS into several smaller FS would help? I can have 1000 smaller FS...
- Do you think that moving from HD to SSD would help?
- Any advice?
Thanks!

Moving to SSD will improve the speed of the backup. The SSD will get tired very soon and you will need the backup...
Can't you organise things that you know where to look for changed/new files?
In that way you pnlu need to increment-backup those folders.
Is it necessary your files are online? Can you have tar files of old trees 3 levels deep?
I guess a find -mtime -1 will take hours as well.
I hope that the backup is not using the same partition as de tree structure
(everything under /tmp is a very bad plan), the temporary files the bavkup might make should be on a different partition.
Where are the new files coming from? When all files are changed by a process you control, your process can make a logfile with a list of files changed.

Related

Why is Spark much faster at reading a directory compared to a list of filepaths?

I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs

Why operations on a large folder is so slow

I have half a million small files in a folder, whose total size is only 70G. I can barely run any commands (e.g. ls) without waiting for several minutes, so I decide to group 700 files in a folder, and there are 700 folders. After the grouping, I can run the commands in several seconds.
However, later I try to copy the entire folder using rsync, and I find the speed is surprisingly slow (about 8GB/30min, while my harddisk throughput is 150 MB/s or 1GB/7s).
What is making rsync (or potentially any other commands) here so slow, and is there a solution?

Multithreaded binary diff tool?

There are a lot of binary diff tools out there:
xdelta
rdiff
vbdiff
rsync
and so on. They are great, but one-threaded. Is it possible to split large files on chunks, find diff between chunks simultaneously and then merge into the final delta? Any other tools, libraries to find delta between very large files (hundreds Gb) in a reasonable amount of time and RAM? May be I could implement algorithm myself, but can not find any papers about it.
ECMerge is multi threaded and able to compare huge files.
libraries to find delta between very large files (hundreds Gb) in a reasonable amount of time and RAM?
try HDiffPatch,it used in 50GB game(not test 100GB) : https://github.com/sisong/HDiffPatch
it can run fast for large file, but is not muti-thread differ;
Creating a patch: hdiffz -s-1k -c-zlib old_path new_path out_delta_file
Applying a patch: hpatchz old_path delta_file out_new_path
diff with -s-1k & input 100GB files, requires ~ 100GB*16/1k < 2GB bytes of memory; if diff with -s-128k then less time & less memory;
bsdiff can changed to muti-thread differ:
suffix array sort algorithm can replace by msufsort,it's a muti-thread suffix array construction algorithm;
match func changed to a muti-thread version, clip new file by thread number;
bzip2 compresser changed to a muti-thread version,such as pbzip2 or lzma2 ...
but this way need very large of memory! (not suitable for large files)

Why does Spark job fail with "too many open files"?

I get "too many open files" during the shuffle phase of my Spark job. Why is my job opening so many files? What steps can I take to try to make my job succeed.
This has been answered on the spark user list:
The best way is definitely just to increase the ulimit if possible,
this is sort of an assumption we make in Spark that clusters will be
able to move it around.
You might be able to hack around this by decreasing the number of
reducers [or cores used by each node] but this could have some performance implications for your
job.
In general if a node in your cluster has C assigned cores and you run
a job with X reducers then Spark will open C*X files in parallel and
start writing. Shuffle consolidation will help decrease the total
number of files created but the number of file handles open at any
time doesn't change so it won't help the ulimit problem.
-Patrick Wendell
the default ulimit is 1024 which is ridiculously low for large scale applications. HBase recommends up to 64K; modern linux systems don't seem to have trouble with this many open files.
use
ulimit -a
to see your current maximum number of open files
ulimit -n
can temporarily change the number of open files; you need to update the system configuration files and per-user limits to make this permanent. On CentOS and RedHat systems, that can be found in
/etc/sysctl.conf
/etc/security/limits.conf
Another solution for this error is reducing your partitions.
check to see if you've got a lot of partitions with:
someBigSDF.rdd.getNumPartitions()
Out[]: 200
#if you need to persist the repartition, do it like this
someBigSDF = someBigSDF.repartition(20)
#if you just need it for one transformation/action,
#you can do the repartition inline like this
someBigSDF.repartition(20).groupBy("SomeDt").agg(count("SomeQty")).orderBy("SomeDt").show()

Fastest way to shuffle lines in a file in Linux

I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?
Use shuf instead of sort -R (man page).
The slowness of sort -R is probably due to it hashing every line. shuf just does a random permutation so it doesn't have that problem.
(This was suggested in a comment but for some reason not written as an answer by anyone)
The 50 minutes is not caused by the actual mechanics of sorting, based on your description. The time is likely spent waiting on /dev/random to generate enough entropy.
One approach is to use an external source of random data (http://random.org, for example) along with a variation on a Schwartzian Transform. The Schwartzian Transform turns the data to be sorted into "enriched" data with the sort key embedded. The data is sorted using the key and then the key is discarded.
To apply this to your problem:
generate a text file with random numbers, 1 per line, with the same number of lines as the file to be sorted. This can be done at any time, run in the background, run on a different server, downloaded from random.org, etc. The point is that this randomness is not generated while you are trying to sort.
create an enriched version of the file using paste:
paste random_number_file.txt string_data.txt > tmp_string_data.txt
sort this file:
sort tmp_string_data.txt > sorted_tmp_string_data.txt
remove the random data:
cut -f2- sorted_tmp_string_data.txt > random_string_data.txt
This is the basic idea. I tried it and it does work, but I don't have 16 million lines of text or 16 million lines of random numbers. You may want to pipeline some of those steps instead of saving it all to disk.
You may try my tool: HugeFileProcessor. It's capable of shuffling files of hundreds of GBs in a reasonable time.
Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.
The algorithm is as follows.
Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.
As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.
Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.
We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.
Why it works?
Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.
Some numbers
On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.

Resources