compare large txt files - c#-4.0

I'm trying to compare two huge text files (from 100MB to 500MB each one) in order to extract the lines that differs between the files and write these differing lines on another text files.
I found on the Net the An O(ND) Difference Algorithm for C# but when implemented the result is an OutOfMemory Exception.
Could you know an exit way from this blind alley?
Thank you very much.
Antonio

Related

How to better merge .gz files?

I want to merge two files ending with .gz. I have tried two ways among others. For the first way, I directly concatenated the files using cat; for the other way, I first decompressed each file through gunzip, and then concatenated the decompressed files before compressing again using gzip. Interestingly, I found that the resulting files vary in size. Could anyone answer my puzzle here?
Thank you in advance!
If your question is which is better, then concatenating is faster, but recompressing will give you better compression. So it depends on how you define "better".

How can I combine many files into single file without compression, keeping the same behavior across platforms?

I have a folder which includes a lot of subfolders and files. I want to combine all those files into one single large file. That file should be able to get expanded rendering back the original folder and files.
Another requirement is that the method to do it should render exactly the same output (single large file) across different platforms (Node.js, Android, iOS). I've tried ZIP utility's store mode, it indeed renders one file combining all input files and it doesn't compress them, which is good. However, when I try it on Node.js and Windows 7Zip software (ZIP format Store mode), I find that the outputs are not exactly the same. The two large files' sizes are slightly different and of course with different md5. Though they can both be expanded and return back identical files, the single files doesn't meet my requirement.
Another option I tried is Tar file format. Node.js and 7Zip renders different output as well.
Do you know anything I miss using ZIP store mode and Tar file? e.g. using specific versions or customized ZIP util?
Or, could you provide another method to realize my tasks?
I need a method to combine files which shares exactly the same protocol across Node.js, android, and iOS platforms.
Thank you.
The problem is your requirement. You should only require that the files and directory structure be exactly reconstructed after extraction. Not that the archive itself be exactly the same. Instead of running your MD5 on the archive, instead run it on the reconstructed files.
There is no way to assure the same zip result using different compressors, or different versions of the same compressor, or the same version of the same code with different settings. If you do not have complete control of the code creating and compressing the data, e.g., by virtue of having written it yourself and assuring portability across platforms, then you cannot guarantee that the archive files will be the same.
More importantly, there is no need to have that guarantee. If you want to assure the integrity of the transfer, check the result of extraction, not the intermediate archive file. Then your check is even better than checking the archive, since you are then also verifying that there were no bugs in the construction and extraction processes.

Re-order lines in textfile for better compression ratio

I have lots of huge text-files which need to be compressed with the highest ratio possible. Compression speed may be slow, as long as decompression is reasonably fast.
Each line in these files contains one dataset, and they can be stored in any order.
A Similar problem to this one:
Sorting a file to optimize for compression efficiency
But for me compression speed is not an issue. Are there ready-to-use tools to group similar lines together? Or maybe just an algorithm I can implement?
Sorting alone gave some improvement, but I suspect a lot more is possible.
Each file is about 600 million lines long, ~40 bytes each, 24GB total. Compressed to ~10GB with xz
Here's a fairly naïve algorithm:
Choose an initial line at random and write to the compression stream.
While remaining lines > 0:
Save the state of the compression stream
For each remaining line in the text file:
write the line to the compression stream and record the resulting compressed length
roll back to the saved state of the compression stream
Write the line that resulted in the lowest compressed length to the compression stream
Free the saved state
This is a greedy algorithm and won't be globally optimal, but it should be pretty good at matching together lines that compressed well when followed one after the other. It's O(n2) but you said compression speed wasn't an issue. The main advantage is that it's empirical: it doesn't rely on assumptions about which line order will compress well but actually measures it.
If you use zlib, it provides a function deflateCopy that duplicates the state of the compression stream, although it's apparently pretty expensive.
Edit: if you approach this problem as outputting all lines in a sequence while trying to minimize the total edit distance between all pairs of lines in the sequence then this problem reduces to the Travelling Salesman Problem, with the edit distance as your "distance" and all your lines as the nodes you have to visit. So you could look into the various approaches to that problem and apply them to this. Even then, the optimal TSP solution in terms of edit distance isn't necessarily going to be the file that compresses the smallest/

Using a hashset to break up a .txt file

I am trying to write a simple plagiarism program by taking one file and and comparing it to other files by splitting up each file every six words and comparing then to the other files, which is also split up the same way. I was reading up on hashsets and I figured I might try and split them up with hashsets but I have no idea how. Any advice would be appreciated.

Hashing raw audio data

I'm looking for a solution to this task: I want to open any audio file (MP3,FLAC,WAV), then proceed it to the extracted form and hash this data. The thing is: I don't know how to get this extracted audio data. DirectX could do the job, right? And also, I suppose if I have fo example two MP3 files, both 320kbps and only ID3 tags differ and there's a garbage inside on of the files mixed with audio data (MP3 format allows garbage to be inside) and I extract both files, I should get the exactly same audio data, right? I'd only differ if one file is 128 and the other 320, for example. Okay so, the question is, is there a way to use DirectX to get this extracted audio data? I imagine it'd be some function returning byte array or something. Also, it would be handy to just extract whole file without playback. I want to process hundreds of files so 3-10min/s each (if files have to be played at natural speed for decoding) is way worse that one second for each file (only extracting)
I hope my question is understandable.
Thanks a lot for answers,
Aaron
Use http://sox.sourceforge.net/ (multiplatform). It's faster than realtime as you'd like, and it's designed for batch mode much more than DirectX. For example, sox -r 48k -b 16 -L -c 1 in.mp3 out.raw. Loop that over your hundreds of files using whatever scripting language you like (bash, python, .bat, ...).

Resources