Compare Efficiently in Node.js If Two Files are Uniq - node.js

I'm looking for an efficient way to compare if two files are uniq in Node.js.
All I need to know is if files are equal or not. Simple true / false as output will be enough.
Building checksums on a files is a bit slow. Meanwhile Linux diff command is quite fast for comparing even large files. So i just curious if there any equivalent or module in Node.js of efficient Linux diff command.
As suggested in comments - we can try to use stream-equal module for that.
I just tried to compare same 1.3 GB files with both stream-equal and diff
See timeframes below:
9.352s - stream-equal
0.008s - diff
Looks like diff is insanely fast.
One way I'm thinking of speeding things up on large files is via reading first 10 bytes and last 10 bytes of the same size large files and compare by that. If the first and the last bytes are equal then there is quite a big chance that files are identical.
But I'm not pretty sure for now what is the correct way to implement this.

Related

How to extract interval/range of rows from compressed file?

How do I return interval of rows from 100mil rows *.gz file?
Let's say I need 5 mil rows starting from 15mil up to 20mil?
is this the best performing option?
zcat myfile.gz|head -20000000|tail -500
real 0m43.106s
user 0m43.154s
sys 0m9.259s
That's a perfectly reasonable option; since you don't know how long a line will be, you basically have to decompress and iterate the lines to figure out where the line separators are. All three tools are fairly heavily optimized, so I/O and decompression time will likely dominate regardless.
In theory, rolling your own solution that combines all three tools in a single executable might save a little (by reducing the costs of IPC a bit), but the savings would likely be negligible.

How to convert fixed size dimension to unlimited in a netcdf file

I'm downloading daily 600MB netcdf-4 files that have this structure:
netcdf myfile {
dimensions:
time_counter = 18 ;
depth = 50 ;
latitude = 361 ;
longitude = 601 ;
variables:
salinity
temp, etc
I'm looking for a better way to convert the time_counter dimension from a fixed size (18) to an unlimited dimension.
I found a way of doing it with the netcdf commands and sed. Like this:
ncdump myfile.nc | sed -e "s#^.time_counter = 18 ;#time_counter = UNLIMITED ; // (currently 18)#" | ncgen -o myfileunlimited.nc
which worked for me for small files, but when dumping a 600 MB netcdf files, takes to much memory and time.
Somebody knows another method for accomplishing this?
Your answers are very insightful. I'm not really looking a way to improve this ncdump-sed-ncgen method, I know that dumping a netcdf file that is 600MB uses almost 5 times more space in a text file (CDL representation). To then modify some header text and generate the netcdf file again, doesn't feels very efficient.
I read the latest NCO commands documentation, and found a option specific to ncks "--mk_rec_dmn". Ncks mainly extracts and writes or appends data to a new netcdf file, then this seems the better approach, extract all the data of myfile.nc and write it with a new record dimension (unlimited dimension) which the "--mk_rec_dmn" does, then replace the old file.
ncks --mk_rec_dmn time_counter myfile.nc -o myfileunlimited.nc ; mv myfileunlimited.nc myfile.nc
To do the opposite operation (record dimension to fixed-size) would be.
ncks --fix_rec_dmn time_counter myfile.nc -o myfilefixedsize.nc ; mv myfilefixedsize.nc myfile.nc
The shell pipeline can only be marginally improved by making the sed step only modify the beginning of the file and pass everything else through, but the expression you have is very cheap to process and will not make a dent in the time spent.
The core problem is likely that you're spending a lot of time in ncdump formatting the file information into textual data, and in ncgen parsing textual data into a NetCDF file format again.
As the route through dump+gen is about as slow as it is shown, that leaves using NetCDF functionality to do the conversion of your data files.
If you're lucky, there may be tools that operate directly on your data files to do changes or conversions. If not, you may have to write them yourself with the NetCDF libraries.
If you're extremely unlucky, NetCDF-4 files are HDF5 files with some extra metadata. In particular, the length of the dimensions is stored in the _netcdf_dim_info dataset in group _netCDF (or so the documentation tells me).
It may be possible to modify the information there to turn the current length of the time_counter dimension into the value for UNLIMITED (which is the number 0), but if you do this, you really need to verify the integrity of the resulting file, as the documentation neatly puts it:
"Note that modifying these files with HDF5 will almost certainly make them unreadable to netCDF-4."
As a side note, if this process is important to your group, it may be worth looking into what hardware could do the task faster. On my Bulldozer system, the process of converting a 78 megabyte file takes 20 seconds, using around 500 MB memory for ncgen working set (1 GB virtual) and 12 MB memory for ncdump working set (111 MB virtual), each task taking up the better part of a core.
Any decent disk should read/sink your files in 10 seconds or so, memory doesn't matter as long as you don't swap, so CPU is probably your primary concern if you take the dump+gen route.
If concurrent memory use is a big concern, you can trade some bytes for space by saving the intermediary result from sed onto disk, which will likely take up to 1.5 gigabytes or so.
You can use the xarray python package's xr.to_netdf() method, then optimise memory usage via using Dask.
You just need to pass names of the dimensions to make unlimited to the unlimited_dims argument and use the chunks to split the data. For instance:
import xarray as xr
ds = xr.open_dataset('myfile.nc', chunks={'time_counter': 18})
ds.to_netcdf('myfileunlimited.nc', unlimited_dims={'time_counter':True})
There is a nice summary of combining Dask and xarray linked here.

Comparing a big file on two servers

I have two servers and I want to move a backup tar.bz file(50G) from one to other one.
I used AXEL to download file from source server. But now when I want to extract it, it gives me error unexpected EOF. The size of them are same and it seems like there is a problem in content.
I want to know if there is a program/app/script that can compare these two files and correct only damaged parts?! Or do I need to split it by hand and compare each part's hash code?
Problem is here that source server has limited bandwidth and low transfers speed so I cant transfer it again from zero.
You can use a checksum utility, such as md5 or sha, to see if the files are the same on either end. e.g.
$ md5 somefile
MD5 (somefile) = d41d8cd98f00b204e9800998ecf8427e
by running such a command on both ends and comparing the result, you can get some certainty as to if the files are the same.
As for only downloading the erroneous portion of a file, this would require checksums on both sides for "pieces" of the data, such as with the bittorrent protocol.
Ok, I found "rdiff" the best way to solve this problem. Just doing:
On Destination Server:
rdiff signature destFile.tar.bz destFile.sig
Then transferring destFile.sig to source server and execute rdiff there on Source Server again:
rdiff delta destFile.sig srcFile.tar.bz delta.rdiff
Then transferring delta.rdiff to destination server and execute rdiff once again on Destination Server:
rdiff patch destFile.tar.bz delta.rdiff fixedFile.tar.bz
This process really doesn't need a separate program. You can simply do it by using a couple of simple commands. If any of the md5sums don't add up, copy over the mismatched one(s) and concatenate them back together. To make comparing the md5sums easier, just run a diff between the output of the two files (or do an md5sum of the outputs to see if there is a difference at all without having to copy over the output).
split -b 1000000000 -d bigfile bigfile.
for i in bigfile.*
do
md5sum $i
done

Fast binary diff for 10 MB files

I have two 10 MB files, and I'd like to find the longest common subsequence with offsets, e.g. the result should look like:
42 bytes at offset 5 of the first file and offset 8 of the second file
85 bytes at offset 100 of the first file and offset 55 of the second file
...
This is a one-off-task, I have to run it only on a single pair of files.
I don't care about the programming language, but it must run on Linux.
I have tried command-line tools bsdiff and xdelta, but their output diff file format is too complicated to understand, and it lacks any documentation -- so I would have to understand complicated and undocumented C source code to get those results. It would take several hours, and I don't have that much time for this, so I'm giving up on that path.
I have tried Perl module String::LCSS_XS , but that's too slow (it has been running for an hour now), Perl module Algorithm::Diff::XS, but it needs too much memory, and Perl module Algorithm::LCSS, but that's too slow (implemented in Perl). I couldn't find anything useful in Python (the built-in difflib is too slow).
Is there a tool which runs quickly (i.e. less than a few hours) for 10 MB files, and I can convert its output to the format I want in less than an hour of work?

How can I compare two zip format(.tar,.gz,.Z) files in Unix

I have two gz files. I want to compare those files without extracting. for example:
first file is number.txt.gz - inside that file:
1111,589,3698,
2222,598,4589,
3333,478,2695,
4444,258,3694,
second file - xxx.txt.gz:
1111,589,3698,
2222,598,4589,
I want to compare any column between those files. If column1 in first file is equal to the 1st column of second file means I want output like this:
1111,589,3698,
2222,598,4589,
You can't do this.
You can compare all content from archive by comparing archives but not part of data in compressed files.
You can compare selected files in archive too without unpacking because archive has metadata with CRC32 control sum and you must compare this sum to know this without unpacking.
If you need to check and compare your data after it's written to those huge files, and you have time and space constraints preventing you from doing this, then you're using the wrong storage format. If your data storage format doesn't support your process then that's what you need to change.
My suggestion would be to throw your data into a database rather than writing it to compressed files. With sensible keys, comparison of subsets of that data can be accomplished with a simple query, and deleting no longer needed data becomes similarly simple.
Transactionality and strict SQL compliance are probably not priorities here, so I'd go with MySQL (with the MyISAM driver) as a simple, fast DB.
EDIT: Alternatively, Blorgbeard's suggestion is perfectly reasonable and feasible. In any programming language that has access to (de)compression libraries, you can read your way sequentially through the compressed file without writing the expanded text to disk; and if you do this side-by-side for two input files, you can implement your comparison with no space problem at all.
As for the time problem, you will find that reading and uncompressing the file (but not writing it to disk) is much faster than writing to disk. I recently wrote a similar program that takes a .ZIPped file as input and creates a .ZIPped file as output without ever writing uncompressed data to file; and it runs much more quickly than an earlier version that unpacked, processed and re-packed the data.
You cannot compare the files while they remain compressed using different techniques.
You must first decompress the files, and then find the difference between the results.
Decompression can be done with gunzip, tar, and uncompress (or zcat).
Finding the difference can be done with the diff command.
I'm not 100% sure whether it's meant match columns/fields or entire rows, but in the case of rows, something along these lines should work:
comm -12 <(zcat number.txt.gz) <(zcat xxx.txt.gz)
or if the shell doesn't support that, perhaps:
zcat number.txt.gz | { zcat xxx.txt.gz | comm -12 /dev/fd/3 - ; } 3<&0
exact answer i want is this only
nawk -F"," 'NR==FNR {a[$1];next} ($3 in a)' <(gzcat file1.txt.gz) <(gzcat file2.txt.gz)
. instead of awk, nawk works perfectly and it's gzip file so use gzcat

Resources