Comparing files off first x number of characters - linux

I have two text files that both have data that look like this:
Mon-000101,100.27242,9.608597,11.082,10.034,0.39,I,0.39,I,31.1,31.1,,double with 1355,,,,,,,,
Mon-000171,100.2923,9.52286,14.834,14.385,0.45,I,0.45,I,33.7,33.7,,,,,,,,,,
Mon-000174,100.27621,9.563802,11.605,10.134,0.95,I,1.29,I,30.8,30.8,,,,,,,,,,
I want to compare the two files based off of the Mon-000101(as an example of one ID) characters to see where they differ. I tried some diff commands that I found in another question, which didn't work. I'm out of ideas so I'm turning to anybody with more experience than myself.
Thanks.
HazMatt:Desktop m$ diff NGC2264_classI_h7_notes.csv /Users/m/Downloads/allfitclassesI.txt
1c1
Mon-000399,100.25794,9.877631,12.732,12.579,0.94,I,-1.13,I,9.8,9.8,,"10,000dn vs 600dn brighter source at 6 to 12"" Mon-000402,100.27347,9.59Mon-146053,100.23425,9.571719,12.765,11.39,1.11,I,1.04,I,16.8,16.8,,"double 3"" confused with 411, appears brighter",,,,,,,,
\ No newline at end of file
---
Mon-146599 Mon-146599 4.54 I 4.54 III
\ No newline at end of file
This was my attempt and the output. The thing is, is that I know the files differ by eleven lines...corresponding to eleven mismatched values. I don't want to do this by hand (who would?). Maybe I'm misreading the diff output. But I'd expect more than this.

Have you tried :
diff `cat file_1 | grep Mon-00010` `cat file_2 | grep Mon-00010`

First sort both the files and then try using diff
sort file1 > file1_sorted
sort file2 > file2_sorted
diff file1_sorted file2_sorted
Sorting will help arranging both the files as per first ID field, so that you don't get unwanted mismatches.

I am not sure what you are searching, but I'll try to help. Otherwise you could give some examples of input files and desired output.
My input-files are:
prompt> cat in1.txt
Mon-000101,100.27242,9.608597,11.082,10.034,0.39,I,0.39,I,31.1,31.1,,double with 1355,,,,,,,,
Mon-000171,100.2923,9.52286,14.834,14.385,0.45,I,0.45,I,33.7,33.7,,,,,,,,,,
Mon-000174,100.27621,9.563802,11.605,10.134,0.95,I,1.29,I,30.8,30.8,,,,,,,,,
and
prompt> cat in2.txt
Mon-000101,111.27242,9.608597,11.082,10.034,0.39,I,0.39,I,31.1,31.1,,double with 1355,,,,,,,,
Mon-000172,100.2923,9.52286,14.834,14.385,0.45,I,0.45,I,33.7,33.7,,,,,,,,,,
Mon-000174,122.27621,9.563802,11.605,10.134,0.95,I,1.29,I,30.8,30.8,,,,,,,,,,
If you are just interested in the "ID" (whatever that means) you have to seperate it. I assume the ID is the tag before the first comma, so it is possible to cut everything except the ID and compare:
prompt> diff <(cut -d',' -f1 in1.txt) <(cut -d',' -f1 in2.txt)
2c2
< Mon-000171
---
> Mon-000172
If the ID is more complicated you can grep with the use of regular expressions.
Additionally diff -y gives you a little graphical output of which lines are differing. You can use this to merely compare the complete file or use it with the cutting explained before:
prompt> diff -y <(cut -d',' -f1 in1.txt) <(cut -d',' -f1 in2.txt)
Mon-000101 Mon-000101
Mon-000171 | Mon-000172
Mon-000174 Mon-000174

Related

Loop through each column in a CSV file and exporting distinct values to a file

I have a CSV file with columns A-O. 500k rows. In Bash I would like to loop through each column, get distinct values and output them to a file:
sort -k1 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f1 -d , | uniq > EMPLOYEEID.csv
sort -k2 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f2 -d , | uniq > SORTNAME.csv
This works, but to me is very manual and not really scalable if there were like 100 columns.
The code sorts the column in-place and then the column specified is passed to uniq to get distinct values and is then outputted.
NB: The first row has the header information.
The above code works, but I'm looking to streamline it somewhat.
Assuming headers can be used as file names for each column:
head -1 test.csv | \
tr "," "\n" | \
sed "s/ /_/g" | \
nl -ba -s$'\t' | \
while IFS=$'\t' read field name; do
cut -f$field -d',' test.csv | \
tail -n +2 | sort -u > "${name}.csv" ;
done
Explanation:
head - reads the first line
tr- replaces the , with new line
sed - replaces white space with _ for cleaner file names (tr would work also, and you can combine with previous one then, but if you need more complex transforms use sed)
nl - adds the field number
-ba - number all lines
-s$'\t' - set the separator to tab (not necessary, as it default, but for clarity sake)
while- reads trough field number/names
cut - selects the field
tail - removes the heading, not all tails have this option, you can replace with sed
sort -u - sorts and removes duplicates
>"$name.csv" - saves in the appropriate file name
note: this assumes that there are no , int the fields, otherwise you will need to use a csv parser
Doing all the columns in a single pass is much more efficient than rescanning the entire input file for each column.
awk -F , 'NR==1 { ncols = split($0, cols, /,/); next }
{ for(i=1; i<=ncols; ++i)
if (!seen[i ":" $i])
print $i >>cols[i] ".csv"}' CROWN.csv
If this is going to be part of a bigger task, maybe split the input file into several temporary files with fewer columns than the number of open file handles permitted on your system, rather than fix this script to handle an arbitrary number of columns.
You can inspect this system constant with ulimit -n; on some systems, you can increase it either by tweaking the system configuration or, in the worst case, by recompiling the kernel. (Your question doesn't identify your platform, but this should be easy enough to google.)
Addendum: I created a quick and dirty timing comparison of these answers at https://ideone.com/dnFj41; I encourage you to fork it and experiment with different shapes of input data. With an input file of 100 columns and (probably) no duplication in the columns -- but only a few hundred rows -- I got the following results:
0.001s Baseline test -- simply copy input file to an identical output file
0.242s tripleee -- this single-pass AWK script
0.561s Sorin -- multiple passes using simple shell script
2.154s Mihir -- multiple passes using AWK
Unfortunately, Carmen's answer could not be tested, because I did not have permissions to install Text::CSV_XS on Ideone.
An earlier version of this answer contained a Python attempt, but I was too lazy to finish debugging it. It's still there in the edit history if you are curious.

What do back brackets do in this bash script code?

so i'm doing a problem with bashscript, this one: ./namefreq.sh ANA should return a list of two names (on separate lines) ANA and RENEE, both of which have frequency 0.120.
Basically I have a file from table.csv shown in the code below that have names and a frequency number next to them e.g. Anna, 0.120
I'm still unsure what the `` does for this code, and I'm also struggling to understand how this code is able to print out two names with identical frequencies. The way I read the code is:
grep compares the word (-w) typed by the user (./bashscript.sh Anna) to the value of (a), which then uses the cut command to be able to compare the 2nd field of the line separated by the delimiter "," which is the frequency from the file table.csv and then | cut -f1 -d"," prints out the first fields which are the names with the same frequency
^ would this be correct?
thanks :)
#!/bin/bash
a=`grep -w $1 table.csv | cut -f2 -d','`
grep -w $a table.csv | cut -f1 -d',' | sort -d
When a command is in backticks or $(), the output of the command is subsituted back into the command in place of it. So if the file has Anna,0.120
a=`grep -w Anna table.csv | cut -f2 -d','`
will execute the grep and cut commands, which will output 0.120, so it will be equivalent to
a=0.120
Then the command looks for all the lines that match 0.120, extracts the first field with cut, and sorts them.

bash script: how to diff two files after stripping prefix from each line?

I have two log files. Each line is formatted as follows:
<timestamp><rest of line>
with this timestamp format:
2015-10-06 04:35:55.909 REST OF LINE
I need to diff the two files modulo the timestamps, i.e. I need to compare lines of the two files without their timestamps. What linux tools should I use?
I am on a RedHat 6 machine running bash if it makes a difference
You don't need to create temp files: use bash process substitution
diff <(cut -d" " -f3- log1) <(cut -d" " -f3- log2)
I would first generate the two files to compare with the header removed using the cut command like this :
cut -f 3- -d " " file_to_compare > cut_file
And then use the diff command.
You can use 'cut'
cat file1 | cut -b23- > file1cut
cat file2 | cut -b23- > file2cut
diff file1 file2
To print all fields but the first two the awk utility (and programming language) can be used:
awk '{$1=$2=""; print $0}' file1 > newfile1
awk '{$1=$2=""; print $0}' file2 > newfile2
diff newfile1 newfile2
Well, as your're looking for a tool why not just use a Kompare. Its very powerful and well known which is used by most developers who uses Linux.
https://www.kde.org/applications/development/kompare/
https://docs.kde.org/trunk5/en/kdesdk/kompare/using.html
Kompare is a GUI front-end program that enables differences between source files to be viewed and merged. It can be used to compare differences on files or the contents of folders, and it supports a variety of diff formats and provide many options to customize the information level displayed.

How to grep within a grep

I have a bunch of massive text files, about 100MB each.
I want to grep to find entries that have 'INDIANA JONES' in it:
$ grep -ir 'INDIANA JONES' ./
Then, I would like to find the entries where there is the word PORTUGAL within 5,000 characters of the INDIANA JONES term. How would I do this?
# in pseudocode
grep -ir 'INDIANA JONES' ./ | grep 'PORTUGAL' within 5000 char
Use grep's -o flag to output the 5000 characters surround the match, then search those characters for the second string. For example:
grep -ioE ".{5000}INDIANA JONES.{5000}" file.txt | grep "PORTUGAL"
If you need the original match, add the -n flag to the second grep and pipe into:
cut -f1 -d: > line_numbers.txt
then you could use awk to print those lines:
awk 'FNR==NR { a[$0]; next } FNR in a' line_numbers.txt file.txt
To avoid the temporary file, this could be written like:
awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" file.txt | grep -n "PORTUGAL" | cut -f1 -d:) file.txt
For multiple files, use find and a bash loop:
for i in $(find . -type f); do
awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" "$i" | grep -n "PORTUGAL" | cut -f1 -d:) "$i"
done
One way to deal with this is with gawk. You could set the record separator to either INDIANA JONES or PORTUGAL and then perform a length check on the record (after stripping newlines, assuming newlines do not count towards the limit of 5000). You may have to resort to find to run this recursively within a directory
awk -v RS='INDIANA JONES|PORTUGAL' '{a = $0;
gsub("\n", "", a)};
((RT ~ /IND/ && prevRT ~/POR/) || (RT ~ /POR/ && prevRT ~/IND/)) && length(a) < 5000{found=1};
{prevRT=RT};
END{if (found) print FILENAME}' file.txt
Consider installing ack-grep.
sudo apt-get install ack-grep
ack-grep is a more powerful version of grep.
There's no trivial solution to your question (that i can think of) outside of a full batch script, but you can use the -A and -B flags on ack-grep to specify a number of trailing or leading lines to output, resp.
This may not be a number of chars, but is a step further in that direction.
While this may not be a solution, it might give you some idea as to how to do this. Lookup filters like ack, awk, sed, etc. and see if you can find one with a flag for this kind of behaviour.
The ack-grep manual:
http://manpages.ubuntu.com/manpages/hardy/man1/ack-grep.1p.html
EDIT:
I think the sad news is, what you might think you're looking for is something like:
grep "\(INDIANA JONES\).\{1,5000\}PORTUGAL" filename
The problem is, even on a small file, querying this is going to be impossible time-wise.
I got this one to work with a different number. it's a size problem.
For such a large set of files, you'll need to do this in more than one step.
A Solution:
The only solution I know of is the leading and trailing output from ack-grep.
Step 1: how long are your lines?
If you knew how many lines out you had to go
(and you could estimate/calculate this a few ways) then you'd be able to grep the output of the first grep. Depending on what's in your file, you should be able to get a decent upper bound as to how many lines is 5000 chars (if a line has 100 chars average, 50+ lines should cover you, but if it has 10 chars, you'll need 500+).
You've got to determine the maximum number of lines that could be 5000 chars. You could guess or pick a high range if you like, but that'll be up to you. It's your data.
With that, call: (if you needed 100 lines for 5000 chars)
ack-grep -ira "PORTUGAL" -A 100 -B 100 filename
and
ack-grep -ira "INDIANA JONES" -A 100 -B 100 filename
replace the 100s with what you need.
Step 2: parse the output
you'll need to take the matches that ack-grep returns and parse them, looking for any matches again, within these sub-ranges.
Look for INDIANA JONES in the first PORTUGAL ack-grep match output, and look for PORTUGAL in the second set of matches.
This should take a bit more work, likely involving a bash script (I might see if I can get one working this week), but it solves your massive-data problem, by breaking it down into more manageable chunks.
grep 'INDIANA JONES' . -iR -l | while read filename; do head -c 5000 "$filename" | grep -n PORTUGAL -H --label="$filename" ; done
This works as follows:
grep 'INDIANA JONES' . -iR -l. Search for all files in or below the current directory. Case insensitive (-i). And only print the names of the files that match (-l), don't print any content.
| while read filename; do ...|...|...; done for each line of input, store it in variable $filename and execute the pipeline.
Now, for each file that matched 'INDIANA JONES', we do
head -c 5000 "$filename" - extract the first 5000 characters
grep ... - search for PORTUGAL. Print the filename (-H), but where we tell us the 'filename' we want to use with --label="$filename". Print line numbers too, -n.

How do I randomly merge two input files to one output file using unix tools?

I have two text files, of different sizes, which I would like to merge into one file, but with the content mixed randomly; this is to create some realistic data for some unit tests. One text file contains the true cases, while the other the false.
I would like to use standard Unix tools to create the merged output. How can I do this?
Random sort using -R:
$ sort -R file1 file2 -o file3
My version of sort also does not support -R. So here is an alternative using awk by inserting a random number in front of each line and sorting according to those numbers, then strip off the number.
awk '{print int(rand()*1000), $0}' file1 file2 | sort -n | awk '{$1="";print $0}'
This adds a random number to the beginning of each line with awk, sorts based on that number, and then removes it. This will even work if you have duplicates (as pointed out by choroba) and is slightly more cross platform.
awk 'BEGIN { srand() } { print rand(), $0 }' file1 file2 |
sort -n |
cut -f2- -d" "

Resources