There are a lot of binary diff tools out there:
xdelta
rdiff
vbdiff
rsync
and so on. They are great, but one-threaded. Is it possible to split large files on chunks, find diff between chunks simultaneously and then merge into the final delta? Any other tools, libraries to find delta between very large files (hundreds Gb) in a reasonable amount of time and RAM? May be I could implement algorithm myself, but can not find any papers about it.
ECMerge is multi threaded and able to compare huge files.
libraries to find delta between very large files (hundreds Gb) in a reasonable amount of time and RAM?
try HDiffPatch,it used in 50GB game(not test 100GB) : https://github.com/sisong/HDiffPatch
it can run fast for large file, but is not muti-thread differ;
Creating a patch: hdiffz -s-1k -c-zlib old_path new_path out_delta_file
Applying a patch: hpatchz old_path delta_file out_new_path
diff with -s-1k & input 100GB files, requires ~ 100GB*16/1k < 2GB bytes of memory; if diff with -s-128k then less time & less memory;
bsdiff can changed to muti-thread differ:
suffix array sort algorithm can replace by msufsort,it's a muti-thread suffix array construction algorithm;
match func changed to a muti-thread version, clip new file by thread number;
bzip2 compresser changed to a muti-thread version,such as pbzip2 or lzma2 ...
but this way need very large of memory! (not suitable for large files)
Related
I have to unzip a file which is being downloaded from a server. I have gone through the zip file structure. What I want to understand is, if compressed data is constructed using how many bytes of a data bytes? Is the compression algorithm runs over all the file and generates output or compression algorithm runs on let's say 256 bytes, output result, and select next 256 bytes.
Similarly, do I need to download the whole file before running uncompressing algorithm? or I can download 256 bytes ( for example) and run the algorithm on it?
I am about to publish a machine learning dataset. This dataset contains about 170,000 files (png images of 32px x 32px). I first wanted to share them by a zip archive (57.2MB). However, extracting those files takes extremely long (more than 15 minutes - I'm not sure when I started).
Is there a better format to share those files?
Try .tar.xz - better compression ratio but a little slower to extract than .tar.gz
I just did some Benchmarks:
Experiments / Benchmarks
I used dtrx to extract the following and time dtrx filename to get the time.
Format File size Time to extract
.7z 27.7 MB > 1h
.tar.bz2 29.1 MB 7.18s
.tar.lzma 29.3 MB 6.43s
.xz 29.3 MB 6.56s
.tar.gz 33.3 MB 6.56s
.zip 57.2 MB > 30min
.jar 70.8 MB 5.64s
.tar 177.9 MB 5.40s
Interesting. The extracted content is 47 MB big. Why is .tar more than 3 times the size of its content?
Anyway. I think tar.bz2 might be a good choice.
Just use tar.gz at the lowest compression level (just to get rid of the tar zeros between files). png files are already compressed, so there is no point in trying to compress them further. (Though you can use various tools to try to minimize the size of each png file before putting them into the distribution.)
I have a filesystem with 40 million files in a 10 level tree structure (around 500 GB in total). The problem I have is the backup. An Incr backup (bacula) takes 9 hours (around 10 GB) with a very low performance. Some directories have 50k files, other 10k files. The HDs are HW RAID, and I have the default Ubuntu LV on top. I think the bottleneck here is the # of files (the huge # of inodes.) I'm trying to improve the performance (a full backup on the same FS takes 4+ days, at 200k/s read speed).
- Do you think that partitioning the FS into several smaller FS would help? I can have 1000 smaller FS...
- Do you think that moving from HD to SSD would help?
- Any advice?
Thanks!
Moving to SSD will improve the speed of the backup. The SSD will get tired very soon and you will need the backup...
Can't you organise things that you know where to look for changed/new files?
In that way you pnlu need to increment-backup those folders.
Is it necessary your files are online? Can you have tar files of old trees 3 levels deep?
I guess a find -mtime -1 will take hours as well.
I hope that the backup is not using the same partition as de tree structure
(everything under /tmp is a very bad plan), the temporary files the bavkup might make should be on a different partition.
Where are the new files coming from? When all files are changed by a process you control, your process can make a logfile with a list of files changed.
I have a 500GB text file with about 10 billions rows that needs to be sorted in alphabetical order. What is the best algorithm to use? Can my implementation & set-up be improved ?
For now, I am using the coreutils sort command:
LANG=C
sort -k2,2 --field-separator=',' --buffer-size=(80% RAM) --temporary-directory=/volatile BigFile
I am running this in AWS EC2 on a 120GB RAM & 16 cores virtual machine. It takes the most part of the day.
/volatile is a 10TB RAID0 array.
The 'LANG=C' trick delivers a x2 speed gain (thanks to 1)
By default 'sort' uses 50% of the available RAM. Going up to 80-90% gives some improvement.
My understanding is that gnu 'sort' is a variant of the merge-sort algorithm with O(n log n), which is the fastest : see 2 & 3 . Would moving to QuickSort help (I'm happy with an unstable sort)?
One thing I have noticed is that only 8 cores are used. This is related to default_max_threads set to 8 in linux coreutils sort.c (See 4). Would it help to recompile sort.c with 16 ?
Thanks!
FOLLOW-UP :
#dariusz
I used Chris and your suggestions below.
As the data was already generated in batches: I sorted each bucket separately (on several separate machines) and then used the 'sort --merge' function. Works like a charm and is much faster: O(log N/K) vs O(log N).
I also rethinked the project from scratch: some data post-processing is now performed while the data is generated, so that some un-needed data (noise) can be discarded before sorting takes place.
All together, data size reduction & sort/merge led to massive reduction in computing resources needed to achieve my objective.
Thanks for all your helpful comments.
The benefit of quicksort over mergesort is no additional memory overhead. The benefit of mergesort is the guaranteed O(n log n) run time, where as quicksort can be much worse in the event of poor pivot point sampling. If you have no reason to be concerned about the memory use, don't change. If you do, just ensure you pick a quicksort implementation that does solid pivot sampling.
I don't think it would help spectacularly to recompile sort.c. It might be, on a micro-optimization scale. But your bottleneck here is going to be memory/disk speed, not amount of processor available. My intuition would be that 8 threads is going to be maxing out your I/O throughput already, and you would see no performance improvement, but this would certainly be dependent on your specific setup.
Also, you can gain significant performance increases by taking advantage of the distribution of your data. For example, evenly distributed data can be sorted very quickly by a single bucket sort pass, and then using mergesort to sort the buckets. This also has the added benefit of decreasing the total memory overhead of mergesort. If the memory comlexity of mergesort is O(N), and you can separate your data into K buckets, your new memory overhead is O(N/K).
Just an idea:
I assume the file contents are generated for quite a large amout of time. Write an application (script?) which would periodically move the up-till-now generated file to a different location, append its contents to another file, perform a sort on that different file, and repeat until all data is gathered.
That way your system would spend more time sorting, but the results would be available sooner, since sorting partially-sorted data will be faster than sorting the unsorted data.
I think, you need perform that sort in 2 stages:
Split to trie -like buckets, fit into memory.
Iterate buckets according alphabeth order, fetch each, sort, and append to output file.
This is example.
Imagine, you have bucket limit 2 lines only, and your input file is:
infile:
0000
0001
0002
0003
5
53
52
7000
on the 1st iteration, you read your input file "super-bucket, with empty prefix", and split according 1st letter.
There would be 3 output files:
0:
000
001
002
003
5:
(empty)
3
2
7:
000
As you see, bucket with filename/prefix 7 contains only one record 000, which is "7000", splitted to 7 - filename, and 000 - tail of the string. since this is just one record, wil do not need to split this file anymore. But, files "0" and "5" contains 4 and 3 records, what is more than limit 2. So, need split them again.
After split:
00:
01
02
03
5:
(empty)
52:
(empty)
53:
(empty)
7:
000
As you see, files with prefix "5" and "7" already splitted. so, need just split file "00".
As you see, after splitting, you will have set of relative small files.
Thereafter, run 2nd stage:
Sort filenames, and process filenames according sorted order.
sort each file, and append resut to output, with adding file name to output string.
I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?
Use shuf instead of sort -R (man page).
The slowness of sort -R is probably due to it hashing every line. shuf just does a random permutation so it doesn't have that problem.
(This was suggested in a comment but for some reason not written as an answer by anyone)
The 50 minutes is not caused by the actual mechanics of sorting, based on your description. The time is likely spent waiting on /dev/random to generate enough entropy.
One approach is to use an external source of random data (http://random.org, for example) along with a variation on a Schwartzian Transform. The Schwartzian Transform turns the data to be sorted into "enriched" data with the sort key embedded. The data is sorted using the key and then the key is discarded.
To apply this to your problem:
generate a text file with random numbers, 1 per line, with the same number of lines as the file to be sorted. This can be done at any time, run in the background, run on a different server, downloaded from random.org, etc. The point is that this randomness is not generated while you are trying to sort.
create an enriched version of the file using paste:
paste random_number_file.txt string_data.txt > tmp_string_data.txt
sort this file:
sort tmp_string_data.txt > sorted_tmp_string_data.txt
remove the random data:
cut -f2- sorted_tmp_string_data.txt > random_string_data.txt
This is the basic idea. I tried it and it does work, but I don't have 16 million lines of text or 16 million lines of random numbers. You may want to pipeline some of those steps instead of saving it all to disk.
You may try my tool: HugeFileProcessor. It's capable of shuffling files of hundreds of GBs in a reasonable time.
Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.
The algorithm is as follows.
Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.
As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.
Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.
We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.
Why it works?
Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.
Some numbers
On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.