bash script: how to diff two files after stripping prefix from each line? - linux

I have two log files. Each line is formatted as follows:
<timestamp><rest of line>
with this timestamp format:
2015-10-06 04:35:55.909 REST OF LINE
I need to diff the two files modulo the timestamps, i.e. I need to compare lines of the two files without their timestamps. What linux tools should I use?
I am on a RedHat 6 machine running bash if it makes a difference

You don't need to create temp files: use bash process substitution
diff <(cut -d" " -f3- log1) <(cut -d" " -f3- log2)

I would first generate the two files to compare with the header removed using the cut command like this :
cut -f 3- -d " " file_to_compare > cut_file
And then use the diff command.

You can use 'cut'
cat file1 | cut -b23- > file1cut
cat file2 | cut -b23- > file2cut
diff file1 file2

To print all fields but the first two the awk utility (and programming language) can be used:
awk '{$1=$2=""; print $0}' file1 > newfile1
awk '{$1=$2=""; print $0}' file2 > newfile2
diff newfile1 newfile2

Well, as your're looking for a tool why not just use a Kompare. Its very powerful and well known which is used by most developers who uses Linux.
https://www.kde.org/applications/development/kompare/
https://docs.kde.org/trunk5/en/kdesdk/kompare/using.html
Kompare is a GUI front-end program that enables differences between source files to be viewed and merged. It can be used to compare differences on files or the contents of folders, and it supports a variety of diff formats and provide many options to customize the information level displayed.

Related

Using GREP to find a list of genes (around 200) in a whole exome tab delimited text file

I would like to extract all rows containing genes of interest from a very large exome file (txt tab-delimited).
It is not practical to GREP them individually so I thought I would put them in a text file as a list and use the following command.
grep -E Gene_list.txt Sample1_GREP.txt > Output.txt
This is taking ages to iterate and I did try other alternatives but came nowhere near to finding a solution.
200 patterns is not large for grep by any means. Try GNU grep (sometimes ggrep), which is faster than BSD grep. Also, use tr to translate Gene_list.txt tab delimiters to newlines:
tr '\t' '\n' < Gene_list.txt | ggrep -F -f - Sample1_GREP.txt > Output.txt

How to use sed to extract a field from a delimited file

Am using centos 7 linux
I do have a text file which a lot of lines in same format which is email,password
example:
test#test.com,test
i would like to use sed to only save test#test.com and remove ,testwhich means it's will remove from all lines starting from ','.
#Setop's answer is good - in general, using cut or awk is an usual practice while dealing with delimited files.
We can use sed as well, as per your question:
sed -i 's/,.*//' file # changes the file in-place
or, using two steps:
sed 's/,.*//' file > file.modified && mv file.modified file
s/,.*// replaces , and all characters after it with nothing
This can get trickier if you have multiple fields and want a small subset from it.
cut -d, -f1 yourfile
or
awk -F, '{print $1}'

Cut and Awk command in linux

How can I extract word between 2 words in a file using cut and awk command.
Lets say: I have a file with below content.
This is my file and it has lots of content along wiht password and want to extract PASSWORD=MYPASSWORDISHERE==and file is ending here.
exptected output
1) using awk command linux.
2) using cut command linux.
MYPASSWORDISHERE==
Using awk actually gawk
awk '{match($0,/PASSWORD=(.*==)/,a); print a[1];}' input.txt
Using cut you can try, I'm not sure if it works with your file
cut -d"=" -s -f2,3 --output-delimiter="==" input.txt

Comparing files off first x number of characters

I have two text files that both have data that look like this:
Mon-000101,100.27242,9.608597,11.082,10.034,0.39,I,0.39,I,31.1,31.1,,double with 1355,,,,,,,,
Mon-000171,100.2923,9.52286,14.834,14.385,0.45,I,0.45,I,33.7,33.7,,,,,,,,,,
Mon-000174,100.27621,9.563802,11.605,10.134,0.95,I,1.29,I,30.8,30.8,,,,,,,,,,
I want to compare the two files based off of the Mon-000101(as an example of one ID) characters to see where they differ. I tried some diff commands that I found in another question, which didn't work. I'm out of ideas so I'm turning to anybody with more experience than myself.
Thanks.
HazMatt:Desktop m$ diff NGC2264_classI_h7_notes.csv /Users/m/Downloads/allfitclassesI.txt
1c1
Mon-000399,100.25794,9.877631,12.732,12.579,0.94,I,-1.13,I,9.8,9.8,,"10,000dn vs 600dn brighter source at 6 to 12"" Mon-000402,100.27347,9.59Mon-146053,100.23425,9.571719,12.765,11.39,1.11,I,1.04,I,16.8,16.8,,"double 3"" confused with 411, appears brighter",,,,,,,,
\ No newline at end of file
---
Mon-146599 Mon-146599 4.54 I 4.54 III
\ No newline at end of file
This was my attempt and the output. The thing is, is that I know the files differ by eleven lines...corresponding to eleven mismatched values. I don't want to do this by hand (who would?). Maybe I'm misreading the diff output. But I'd expect more than this.
Have you tried :
diff `cat file_1 | grep Mon-00010` `cat file_2 | grep Mon-00010`
First sort both the files and then try using diff
sort file1 > file1_sorted
sort file2 > file2_sorted
diff file1_sorted file2_sorted
Sorting will help arranging both the files as per first ID field, so that you don't get unwanted mismatches.
I am not sure what you are searching, but I'll try to help. Otherwise you could give some examples of input files and desired output.
My input-files are:
prompt> cat in1.txt
Mon-000101,100.27242,9.608597,11.082,10.034,0.39,I,0.39,I,31.1,31.1,,double with 1355,,,,,,,,
Mon-000171,100.2923,9.52286,14.834,14.385,0.45,I,0.45,I,33.7,33.7,,,,,,,,,,
Mon-000174,100.27621,9.563802,11.605,10.134,0.95,I,1.29,I,30.8,30.8,,,,,,,,,
and
prompt> cat in2.txt
Mon-000101,111.27242,9.608597,11.082,10.034,0.39,I,0.39,I,31.1,31.1,,double with 1355,,,,,,,,
Mon-000172,100.2923,9.52286,14.834,14.385,0.45,I,0.45,I,33.7,33.7,,,,,,,,,,
Mon-000174,122.27621,9.563802,11.605,10.134,0.95,I,1.29,I,30.8,30.8,,,,,,,,,,
If you are just interested in the "ID" (whatever that means) you have to seperate it. I assume the ID is the tag before the first comma, so it is possible to cut everything except the ID and compare:
prompt> diff <(cut -d',' -f1 in1.txt) <(cut -d',' -f1 in2.txt)
2c2
< Mon-000171
---
> Mon-000172
If the ID is more complicated you can grep with the use of regular expressions.
Additionally diff -y gives you a little graphical output of which lines are differing. You can use this to merely compare the complete file or use it with the cutting explained before:
prompt> diff -y <(cut -d',' -f1 in1.txt) <(cut -d',' -f1 in2.txt)
Mon-000101 Mon-000101
Mon-000171 | Mon-000172
Mon-000174 Mon-000174

How do I randomly merge two input files to one output file using unix tools?

I have two text files, of different sizes, which I would like to merge into one file, but with the content mixed randomly; this is to create some realistic data for some unit tests. One text file contains the true cases, while the other the false.
I would like to use standard Unix tools to create the merged output. How can I do this?
Random sort using -R:
$ sort -R file1 file2 -o file3
My version of sort also does not support -R. So here is an alternative using awk by inserting a random number in front of each line and sorting according to those numbers, then strip off the number.
awk '{print int(rand()*1000), $0}' file1 file2 | sort -n | awk '{$1="";print $0}'
This adds a random number to the beginning of each line with awk, sorts based on that number, and then removes it. This will even work if you have duplicates (as pointed out by choroba) and is slightly more cross platform.
awk 'BEGIN { srand() } { print rand(), $0 }' file1 file2 |
sort -n |
cut -f2- -d" "

Resources