Comparing two huge files in Unix - linux

Requirement is to compare two huge Unix files and writing the difference in third file based on a unique key (first field) after searching few options got the below command:
awk 'FNR==NR{a[$0];next}!($0 in a)' hosts.csv masterlist.csv>results.csv
Though this gives the differences, if for a field one file contains NULL (as a word) and other empty/space for null values how to ignore this in the command and compare other fields?
Also would like to make a generic script or utility with such options, don't need the code but just a suggestion would be helpful.

You can try this fix in your awk:
awk 'FNR==NR{if ($0 !~ /NULL| *|^$/){a[$0]}next}!($0 in a)' hosts.csv masterlist.csv>results.csv
As #fedorqui suggest in comments, here's another alternative:
awk 'FNR==NR{if ($0 !~ /NULL/ && NF){a[$0]}next}!($0 in a)' hosts.csv masterlist.csv>results.csv

try to compare them using binary. if you compress the file into a binary (serialization), you can then compare them quite rapidly. if there is a difference you can then go through the file and compare them using similar methods to git... check their source code. hope this helps

Related

use awk to left outer join two csv file based on multiple columns while keeping order of the first file observations

I have two csv files.
File 1
ID,Name,Gender,Salary,DOB
11,Jim,M,200,90
12,David,M,100,89
12,David,M,300,89
13,Lucy,F,150,86
14,Lily,F,200,85
13,Lucy,F,100,86
File 2
DOB,Name,Children
90,Jim,2
88,Michael,4
88,Lily,1
85,Lily,0
What I want to do is to left outer join File 2 into File 1 based on DOB and Name while keeping the order of File 1 observations.
So the output is expected to be
ID,Name,Gender,Salary,DOB,Children
11,Jim,M,200,90,2
12,David,M,100,89,
12,David,M,300,89,
13,Lucy,F,150,86,
14,Lily,F,200,85,0
13,Lucy,F,100,86,
I learned that we need to sort data if we use join command. So I was wondering whether I could use awk to do my work. But I am new with awk. Is there anyone can help me? By the way, if the data is very big, can I drop print command in awk but simply use > *.csv to save into a new csv file? It's because I found solutions to some related questions in this website often used {print ...}. Thank you.
awk to the rescue!
$ awk -F, 'NR==FNR{a[$1,$2]=$3; next} {print $0 FS a[$NF,$2]}' file2 file1
ID,Name,Gender,Salary,DOB,Children
11,Jim,M,200,90,2
12,David,M,100,89,
12,David,M,300,89,
13,Lucy,F,150,86,
14,Lily,F,200,85,0
13,Lucy,F,100,86,
join will require sorted input and you need embellishments to recover initial ordering. You can redirect the output to a file by adding > outputfile.csv

Is there an efficient way to separate lines into different file, awk in this case?

I am trying to separate a file into two different files based on whether the line contains certain string. If a line contain "ITS", this line and the line right after it will be write to file ITS.txt; if a line contains "V34" then this line and the line right after it will be write to file "V34.txt".
My awk code is
awk '/ITS/{print>"ITX.txt";getline;print>"ITX.txt";}; /V34/{print>"V34.txt";getline;print>"V34.txt";}' seqs.fna
It works well. But I am wondering whether there is an efficient way to do so?
seqs.fna (9-10G)
>16S.V34.S7.5_1
ACGGGAGGCAGCAGTAGGGAATCTTCC
>PCR.ITS.S8.14_2
CATTTAGAGGAAGTAAAAGTCGTAACA
>PCR.ITS.S7.11_3
CATTTAGAGGAAGTACAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTTTTGAAGGCTACAC
>16S.V34.S8.6_4
ACGGGCGGCAGCAGTAGGGAAT
>16S.V34.S8.13_5
ACGGGCGGCAGCAGTAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTCTTCGGATCGTAAAACTCTGT
>16S.V34.S7.14_6
ACGGGGGGCAGCAGTAGGGAATCTTCCACAATGGGTGCAAACCTGATGGAGCAATGCCG
>16S.V34.S8.4_7
ACGGGAGGCAGCAGTAGGGAATCTTCCACAAT
>16S.V34.S8.14_8
CGTAGAGATGTGGAGGAACACCAGTGGCGAAG
>16S.V34.S8.8_9
CTGGGATAACACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTTGTAGTC
>16S.V34.S7.3_10
GGTCTGTAATTGACGCTGAGGTTCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCGGGTAGTC
getline has a few very specific uses and this would not be one of them. See http://awk.freeshell.org/AllAboutGetline. If you rewrote your script without getline you'd solve the problem yourself but given the input file you posted, this is all you need:
awk -F'.' '/^>/{out=$2".txt"} {print > out}' seqs.fna
To learn how to use awk correctly, read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

Split ordered file in Linux

I have a large delimited file (with pipe '|' as the delimiter) which I have managed to sort (using linux sort) according to first (numeric), second (numeric) and fourth column (string ordering since it is a timestamp value). The file is like this:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I was wondering if there is an easy way to split this file to multiple text files with an awk, sed, grep or perl one liner whenever the first column or the second column value changes. The final result for the example file should be 3 text files like that:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I could do that in Java of course, but I think it would be kind of overkill, if it can be done with a script. Also, is this possible that the filenames created use those two columns values, something like 77_141.txt for the first file, 77_171.txt for the second file and 78_115.txt for the third one?
awk is very handy for this kind of problems. This can be an approach:
awk -F"|" '{print >> $1"_"$2".txt"}' file
Explanation
-F"|" sets field separator as |.
{print > something} prints the lines into the file something.
$1"_"$2".txt" instead of something, set the output file as $1"_"$2, being $1 the first field based on the | separator. That is, 77, 78... And same for $2, being 141, 171...

Bash script key/value pair regardless of bash version

I am writing a curl bash script to test webservices. I will have file_1 which would contain the URL paths
/path/to/url/1/{dynamic_path}.xml
/path/to/url/2/list.xml?{query_param}
Since the values in between {} is dynamic, I am creating a separate file, which will have values for these params. the input would be in key-value pair i.e.,
dynamic_path=123
query_param=shipment
By combining two files, the input should become
/path/to/url/1/123.xml
/path/to/url/2/list.xml?shipment
This is the background of my problem. Now my questions
I am doing it in bash script, and the approach I am using is first reading the file with parameters and parse it based on '=' and store it in key/value pair. so it will be easy to replace i.e., for each url I will find the substring between {} and whatever the text it comes with, I will use it as the key to fetch the value from the array
My approach sounds okay (at least to me) BUT, I just realized that
declare -A input_map is only supported in bashscript higher than 4.0. Now, I am not 100% sure what would be the target environment for my script, since it could run in multiple department.
Is there anything better you could suggest ? Any other approach ? Any other design ?
P.S:
This is the first time i am working on bash script.
Here's a risky way to do it: Assuming the values are in a file named "values"
. values
eval "$( sed 's/^/echo "/; s/{/${/; s/$/"/' file_1 )"
Basically, stick a dollar sign in front of the braces and transform each line into an echo statement.
More effort, with awk:
awk '
NR==FNR {split($0, a, /=/); v[a[1]]=a[2]; next}
(i=index($0, "{")) && (j=index($0,"}")) {
key=substr($0,i+1, j-i-1)
print substr($0, 1, i-1) v[key] substr($0, j+1)
}
' values file_1
There are many ways to do this. You seem to think of putting all inputs in a hashmap, and then iterate over that hashmap. In shell scripting it's more common and practical to process things as a stream using pipelines.
For example, your inputs could be in a csv file:
123,shipment
345,order
Then you could process this file like this:
while IFS=, read path param; do
sed -e "s/{dynamic_path}/$path/" -e "s/{query_param}/$param/" file_1
done < input.csv
The output will be:
/path/to/url/1/123.xml
/path/to/url/2/list.xml?shipment
/path/to/url/1/345.xml
/path/to/url/2/list.xml?order
But this is just an example, there can be so many other ways.
You should definitely start by writing a proof of concept and test it on your deployment server. This example should work in old versions of bash too.

Using Awk to process a file where each record has different fixed-width fields

I have some data files from a legacy system that I would like to process using Awk. Each file consists of a list of records. There are several different record types and each record type has a different set of fixed-width fields (there is no field separator character). The first two characters of the record indicate the type, from this you then know which fields should follow. A file might look something like this:
AAField1Field2LongerField3
BBField4Field5Field6VeryVeryLongField7Field8
CCField99
Using Gawk I can set the FIELDWIDTHS, but that applies to the whole file (unless I am missing some way of setting this on a record-by-record basis), or I can set FS to "" and process the file one character at a time, but that's a bit cumbersome.
Is there a good way to extract the fields from such a file using Awk?
Edit: Yes, I could use Perl (or something else). I'm still keen to know whether there is a sensible way of doing it with Awk though.
Hopefully this will lead you in the right direction. Assuming your multi-line records are guaranteed to be terminated by a 'CC' type row you can pre-process your text file using simple if-then logic. I have presumed you require fields1,5 and 7 on one row and a sample awk script would be.
BEGIN {
field1=""
field5=""
field7=""
}
{
record_type = substr($0,1,2)
if (record_type == "AA")
{
field1=substr($0,3,6)
}
else if (record_type == "BB")
{
field5=substr($0,9,6)
field7=substr($0,21,18)
}
else if (record_type == "CC")
{
print field1"|"field5"|"field7
}
}
Create an awk script file called program.awk and pop that code into it. Execute the script using :
awk -f program.awk < my_multi_line_file.txt
You maybe can use two passes:
1step.awk
/^AA/{printf "2 6 6 12" }
/^BB/{printf "2 6 6 6 18 6"}
/^CC/{printf "2 8" }
{printf "\n%s\n", $0}
2step.awk
NR%2 == 1 {FIELDWIDTHS=$0}
NR%2 == 0 {print $2}
And then
awk -f 1step.awk sample | awk -f 2step.awk
You probably need to suppress (or at least ignore) awk's built-in field separation code, and use a program along the lines of:
awk '/^AA/ { manually process record AA out of $0 }
/^BB/ { manually process record BB out of $0 }
/^CC/ { manually process record CC out of $0 }' file ...
The manual processing will be a bit fiddly - I suppose you'll need to use the substr function to extract each field by position, so what I've got as one line per record type will be more like one line per field in each record type, plus the follow-on printing.
I do think you might be better off with Perl and its unpack feature, but awk can handle it too, albeit verbosely.
Could you use Perl and then select an unpack template based on the first two chars of the line?
Better use some fully featured scripting language like perl or ruby.
What about 2 scripts? E.g. 1st script inserts field separators based on the first characters, then the 2nd should process it?
Or first of all define some function in your AWK script, which splits the lines into variables based on the input - I would go this way, for the possible re-usage.

Resources