Finding non SIMILAR lines on Solaris (or Linux) in two files - linux

I'm trying to compare 2 files on a Solaris box and only see the lines that are not similar. I know that I can use the command given below to find lines that are not exact matches, but that isn't good enough for what I'm try to do.
comm -12 <(sort FILE1.txt | uniq) <(sort FILE2.txt | uniq) > diff.txt
For the purposes of this question I would define simlar as having the same characters ~80% of the time, but completely ignoring locations that differ (since the sections that differ may also differ in length). The locations that differ can be assumed to occur at roughly the same point in the line. In other words once we find a location that differs we have to figure out when to start comparing again.
I know this is a hard problem to solve and will appreciate any help/ideas.
EDIT:
Example input 1:
Abend for SP EAOJH with account s03284fjw and client pewaj39023eipofja,level.error
Exception Invalid account type requested: 134029830198,level.fatal
Only in file 1
Example input 2:
Exception Invalid account type requested: 1307230,level.fatal
Abend for SP EREOIWS with account 32192038409aoewj and client eowaji30948209,level.error
Example output:
Only in file 1
I am also realizing that it would be ideal if the files were not read into memory all at once since they can be nearly 100 gigs. Perhaps perl would be better than bash because of this need.

Related

What is the structure of the binary files produced by nfcapd (one of the nfdump tools)?

I want to split files produced by nfcapd (a netflow producing daemon) into multiple files, because the file initially produced by nfcapd might be too big.
My problem is that I have no idea what the structure of the files produced are, I suppose there is a header and then a list of netflows but I can't figure out at which byte ends the header and at which byte begins and ends a netflow, and if there is a footer.
I tried to understand it from reading the source C code on github but as I am not really a beast in C, it is quite hard for me to comprehend.
At first, I thought nfdump could solve my problem by reading a number of netflows at a time in the initial file but there is no built-in way to do this, you can use nfdump to read the first N netflows but you can't go from 1 to N then from N to N+N, you can only read from 1 to N.
If anyone knows a way to split those binary files into multiple files that can be used by nfdump, I would really like to know it.
you can set the time interval to less than 5 minutes (which is the default) using the -t parameter . That is the way to create a smaller files from advance.
For example:
nfcapd -w 1 -l -p -t 60
Please note that the -w should be set accordingly:
if the -t is 60 (seconds)
the -w should be 1 (minute)
there is more here: https://manpages.debian.org/testing/nfdump/nfcapd.1.en.html

Documentation for uinput

I am trying very hard to find the documentation for the uinput but the only thing I have found was the linux/uinput.h. I have also found some tutorials on the internet but no documentation at all!
For example I would like to know what UI_SET_MSCBIT does but I can't find anything about it.
How does people know how to use uinput?
Well, it takes some investigation effort for such subtle things. From
drivers/input/misc/uinput.c and include/uapi/linux/uinput.h files you can see bits for UI_SET_* definitions, like this:
MSC
REL
LED
etc.
Run next command in kernel sources directory:
$ git grep --all-match -e 'MSC' -e 'REL' -e 'LED' -- Documentation/*
or use regular grep, if your kernel doesn't have .git directory:
$ grep -rl MSC Documentation/* | xargs grep -l REL | xargs grep -l LED
You'll get this file: Documentation/input/event-codes.txt, from which you can see:
EV_MSC: Used to describe miscellaneous input data that do not fit into other types.
EV_MSC events are used for input and output events that do not fall under other categories.
A few EV_MSC codes have special meaning:
MSC_TIMESTAMP: Used to report the number of microseconds since the last reset. This event should be coded as an uint32 value, which is allowed to wrap around with no special consequence. It is assumed that the time difference between two consecutive events is reliable on a reasonable time scale (hours). A reset to zero can happen, in which case the time since the last event is unknown. If the device does not provide this information, the driver must not provide it to user space.
I'm afraid this is the best you can find out there for UI_SET_MSCBIT out there.

Strange results using Linux find

I am trying to set up a backup shell script that shall run once per week on my server and keep the weekly backups for ten weeks and it all works well, except for one thing...
I have a folder that contains many rather large files, so the ten weekly backups of that folder take up quite a large amount of disk space and many of the larger files in that folder rarely change, so I thought I would split the backup of that folder in two: one for the smaller files that is included in the 'normal' weekly backup (and kept for ten weeks) and one file for the larger files that is just updated every week, without the older weekly versions being kept.
I have used the following command for the larger files:
/usr/bin/find /other/projects -size +100M -print0 | /usr/bin/xargs -0 /bin/tar -rvPf /backup/PRJ-files_LARGE.tar
That works as expected. The tar -v option is there for debugging. However, when archiving the smaller files, I use a similar command:
/usr/bin/find /other/projects -size -100M -print0 | /usr/bin/xargs -0 /bin/tar -rvPf /backup/PRJ-files_$FILE_END.tar
Where $FILE_END is the weekly number. The line above does not work. I had the script run the other day and it took hours and produced a file that was 70 Gb, though the expected output size is about 14 Gb (there are a lot of files). It seems there is some duplication of files in the large file, I have not been able to fully check though. Yesterday I ran the command above for the smaller files from the command line and I could see that files I know to be larger than 100 Mb were included.
However, just now I ran find /other/projects -size -100M from the command line and that produced the expected list of files.
So, if anyone has any ideas what I am doing wrong I would really appreciate tips or pointers. The file names include spaces and all sorts of characters, e.g. single quote, if that has something to do with it.
The only thing I can think of is that I am not using xargs properly and admittedly I am not very familiar with that, but I still think that the problem lies in my use of find since it is find that gives the input to xargs.
First of all, I do not know if it is considered bad form or not to answer your own question, but I am doing it anyway since I realised my error and I wanted to close this and hopefully be able to help someone having the same problem as I had.
Now, once I realised what I did wrong I frankly am a bit embarrassed that I did not see it earlier, but this is it:
I did some experimental runs from the command line and after a while I realised that the output not only listed all files, but it also listed the directories themselves. Directories are of course files too and they are smaller than 100M so they have (most likely anyway) been included and when they have been included, all files in them have also been included, regardless of their sizes. This would also explain why the output file was five times larger than expected.
So, in order to overcome this I added -type f, which includes only regular files, to the find command and lo and behold, it worked!
To recap, the adjusted command I use for the smaller files is now:
/usr/bin/find /other/projects -size -100M -type f -print0 | /usr/bin/xargs -0 /bin/tar -rvPf /backup/PRJ-files_$FILE_END.tar

Performance of sort command in unix

I am writing a custom apache log parser for my company and I noticed a performance issue that I can't explain. I have a text file log.txt with size 1.2GB.
The command: sort log.txt is up to 3 sec slower than the command: cat log.txt | sort
Does anybody know why this is happening?
cat file | sort is a Useless Use of Cat.
The purpose of cat is to concatenate
(or "catenate") files. If it's only
one file, concatenating it with
nothing at all is a waste of time, and
costs you a process.
It shouldn't take longer. Are you sure your timings are right?
Please post the output of:
time sort file
and
time cat file | sort
You need to run the commands a few times and get the average.
Instead of worrying about the performance of sort instead you should change your logging:
Eliminate unnecessarily verbose output to your log.
Periodically roll the log (based on either date or size).
...fix the errors outputting to the log. ;)
Also, are you sure cat is reading the entire file? It may have a read buffer etc.

How can I compare two zip format(.tar,.gz,.Z) files in Unix

I have two gz files. I want to compare those files without extracting. for example:
first file is number.txt.gz - inside that file:
1111,589,3698,
2222,598,4589,
3333,478,2695,
4444,258,3694,
second file - xxx.txt.gz:
1111,589,3698,
2222,598,4589,
I want to compare any column between those files. If column1 in first file is equal to the 1st column of second file means I want output like this:
1111,589,3698,
2222,598,4589,
You can't do this.
You can compare all content from archive by comparing archives but not part of data in compressed files.
You can compare selected files in archive too without unpacking because archive has metadata with CRC32 control sum and you must compare this sum to know this without unpacking.
If you need to check and compare your data after it's written to those huge files, and you have time and space constraints preventing you from doing this, then you're using the wrong storage format. If your data storage format doesn't support your process then that's what you need to change.
My suggestion would be to throw your data into a database rather than writing it to compressed files. With sensible keys, comparison of subsets of that data can be accomplished with a simple query, and deleting no longer needed data becomes similarly simple.
Transactionality and strict SQL compliance are probably not priorities here, so I'd go with MySQL (with the MyISAM driver) as a simple, fast DB.
EDIT: Alternatively, Blorgbeard's suggestion is perfectly reasonable and feasible. In any programming language that has access to (de)compression libraries, you can read your way sequentially through the compressed file without writing the expanded text to disk; and if you do this side-by-side for two input files, you can implement your comparison with no space problem at all.
As for the time problem, you will find that reading and uncompressing the file (but not writing it to disk) is much faster than writing to disk. I recently wrote a similar program that takes a .ZIPped file as input and creates a .ZIPped file as output without ever writing uncompressed data to file; and it runs much more quickly than an earlier version that unpacked, processed and re-packed the data.
You cannot compare the files while they remain compressed using different techniques.
You must first decompress the files, and then find the difference between the results.
Decompression can be done with gunzip, tar, and uncompress (or zcat).
Finding the difference can be done with the diff command.
I'm not 100% sure whether it's meant match columns/fields or entire rows, but in the case of rows, something along these lines should work:
comm -12 <(zcat number.txt.gz) <(zcat xxx.txt.gz)
or if the shell doesn't support that, perhaps:
zcat number.txt.gz | { zcat xxx.txt.gz | comm -12 /dev/fd/3 - ; } 3<&0
exact answer i want is this only
nawk -F"," 'NR==FNR {a[$1];next} ($3 in a)' <(gzcat file1.txt.gz) <(gzcat file2.txt.gz)
. instead of awk, nawk works perfectly and it's gzip file so use gzcat

Resources