How to compare and replace fields using awk - linux

I have a two files. If Field-9 of File-1 and Field-1 of File-2 is same then replace Field-1 of File-1 with Field-2 of File-1
file1:
12345||||||756432101000||756432||||
aaaaa||||||986754812345||986754||||
ccccc||||||134567222222||134567||||
file2:
756432|AAAAAAAAAAA
986754|20030040000
The expected output is:
12345||||||AAAAAAAAAAA||756432||||
aaaaa||||||20030040000||986754||||
ccccc||||||134567222222||134567|||
I tried this code
awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{$7=a[$2];print}' OFS='|' file2 file1
but instead of replacing the field, it gets deleted.

You are using the wrong column as the index of the array in the second block, and you are not checking for missing keys. This produces the output you posted:
awk -F '|' -v OFS='|' 'NR==FNR{a[$1]=$2;next}$9 in a{$7=a[$9]}1' file2 file1

Related

How do I compare two files in unix based on their columns

I am fairly new to unix commands, but i have two .csv files where i would like to compare the first column either with diff or comm. Every line is different, if i were to compare the whole line, thats why i want to compare the first column in each file and then have the difference printed out in numbers where the landcode sould not be counted more than once. The first file has also has a header i want to skip when it compares.
sample from file1:
iso_code,continent,location,date,total_cases
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,2.0
AFG,Oslo,Norway,2020-09-06,324.0
AZE,Hamburg,Germany,2020-03-30,29.0
sample from file2:
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,5.0
ABW,Chil Ukrain,Aruba,2020-10-06,4449.0
ALB,Upsala,Sweden,2020-08-275.0,
AFG,Afghanistan,,2020-09-06,324.0
The expected output should be "2", as there are two occurrences of the same land code in the two files. Duplicates of the contry code sould only be counted one time. That is why expected out should be 2 and not 3
I have tried multiple solutions:
awk 'NR==FNR{c[$1]++;next};c[$1] == 0' owid-covid-data-filtered.csv owid-covid-data.csv | wc -l
with the awk i get output: 1
and
diff owid-covid-data.csv owid-covid-data-filtered.csv |cut -d' ' -f1 owid-covid-data-filtered.csv| wc -l
overall i want the occurrences that are similar in both file1 and file2 column 1
From the condition c[$1] == 0 in the awk script from the question I assumed you want to print lines from file2 that contain a code that is not present in file1.
As it is clarified now, that you want to count the codes that are present in both files, see below at the end of the answer for the reverse check.
Slight modifications to your script will fix the problems:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1]++ == 0' file1 file2
Option -F , specifies comma (,) as field separator.
The condition if(NR!=1)c[$1]++; skips the header line in file1.
The post-increment operator in c[$1]++ == 0 will make the condition fail for the second or later occurrence of the same code in file2.
I omit the trailing | wc -l here to show the output lines.
I modified file2 to contain two lines with the same code in column 1 that is not present in file1.
With file2 shown here
AND,Europe,Andorra,2020-07-26,897.0
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
ALB,Europe,Albania,2020-08-23,8275.1
ALB,Europe,Albania,2020-08-23,8275.2
AFG,Asia,Afghanistan,2020-09-06,38324.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
and file1 from the question I get this output:
AND,Europe,Andorra,2020-07-26,897.0
ALB,Europe,Albania,2020-08-23,8275.1
(Only the first line with ALB is printed`.)
You can also implemente the counting in awk instead of using wc -l.
awk -F , 'NR==FNR { if(NR!=1)c[$1]++; next } c[$1]++ == 0 {count++} END {print count}' file1 file2
If you want to print the lines from file2 that contain a code that is present in file1, the script can be modified like this:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1] { c[$1]=0; print}' file1 file2
This prints
ABW,North America,Aruba,2020-03-13,2.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
(The first line with code ABW.)
Alternative solution as requested in a comment.
tail -n +2 file1|cut -f1 -d,|sort -u>code1
cut -f1 -d, file2|sort -u>code2
fgrep -vf code1 code2
rm code1 code2
Or combined in one command without using temporary files code1 and code2:
fgrep -f <(tail -n +2 file1|cut -f1 -d,|sort -u) <(cut -f1 -d, file2|sort -u)
Add | wc -l to count the lines instead of printing them.
Explanation:
tail -n +2 print everything starting from the 2nd line
cut -f1 -d, print the first field, delimited with ,
sort -u sort lines and remove duplicates
fgrep -f code1 code2 print all lines from code2 that contain any of the strings from code1
occurrences that are similar in both file1 and file2 column 1:
$ awk -F, 'NR==FNR{a[$1];next}$1 in a' file1 file2
Output:
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
AFG,Asia,Afghanistan,2020-09-06,38324.0

AWK: Comparing substrings from two files and write to third file

I'm trying to compare two different files, let's say "file1" and "file2", in this way. If the substring of characters i.e 5 characters at position (8 to 12) matches in both files - file1 and file2, then remove that matching row from file 1. Finally, write the output to file3.(output contains the remaining rows which are not matching with file 2)
My output is the non matching rows of file1.
Output (file3) = File1 - File2
File1
-----
aqcdfdf**45555**78782121
axcdfdf**45555**75782321
aecdfdf**75555**78782221
aqcdfdf**95555**78782121
File2
-----
aqcdfdf**45555**78782121
axcdfdf**25555**75782321
File3
-----
aecdfdf**75555**78782221
aqcdfdf**95555**78782121
I tried awk but i need some thing which looks at substring of the two files, since there are no delimiters in my files.
$ awk 'FNR==NR {a[$1]; next} $1 in a' f1 f2 > file3
Could you please try following, written and tested with shown samples in GNU awk. Once happy with results on terminal then redirect output of following command to > file3(append > file3 to following command).
awk '{str=substr($0,8,5)} FNR==NR{a[str];next} !(str in a)' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
str=substr($0,8,5) ##Creating str which has sub-string of current line from 8th to 12th character.
}
FNR==NR{ ##Checking condition FNR==NR which will run when Input_file2 is being read.
a[str] ##Creating array a with index of str here.
next ##next will skip all further statements from here.
}
!(str in a) ##Checking condition if str is NOT present in a then print that line from Input_file1.
' file2 file1 ##Mentioning Input_file names here.

Awk to remove and move rows/records to another file where 3rd field value is empty

I have below pipe separated pipe.
I want to remove the rows where 3rd field is blank in file1 and need to paste those removed line from File1 into another File(File2).
I tried the below code it is working fine and removing the rows for all three column where 3rd field is blank but not able to figure out how to paste those removed line in another file along with the below code.
So need to know single code statement to remove rows where 3rd column value is empty from File1 and paste those removed rows into another File(like File2)
awk -F"|" -v OFS"|" '$3!=""' File1.txt > test.txt
File1
billingtype|documentnumber|originaldocumentnumber
YMNC|420075416|765467
YMNC|429842808|74646464
YPBC|429842809
INV|430071605|7688888
YPBC|430071609
Output File
billingtype|documentnumber|originaldocumentnumber
YMNC|420075416|765467
YMNC|429842808|74646464
INV|430071605|7688888
File2
billingtype|documentnumber|originaldocumentnumber
YPBC|429842809
YPBC|430071609
$ awk 'BEGIN {FS=OFS="|"}
FNR==1 {print > "File2"}
{if($3!="") print;
else print > "File2"}' File1 > tmp && mv tmp File1
header will be printed in both files. Output to a temp file and move over to the input file. Missing field records will be printed to other file.
You can try this gnu sed
sed -i -Ee '1{WFile2' -e 'b' -e '}' -e '/(.*\|){2}/!{W File2' -e 'd}' File1

find records longer/shorter than a particular col

this is my file: FILEABC.txt
Name|address|age|country
john|london|12|UK
adam|newyork|39|US|X12|123
jake|madrid|45|ESP
ram|delhi
joh|cal|34|US|788
I wanted to find the the header count in the file. so i've this command
cat FILEABC.txt | awk --field-separator='|' '{print NF}' | sort -n |uniq -c
the result i get for this cmd is
cat FILEABC.txt | awk --field-separator='|' '{print NF}' | sort -n |uniq -c
1 2
3 4
1 5
1 6
My requirement is that, how do i find those records that have only 2 fields, 4 fields and so on from my file.
for ex,
if want to see the records having only 2 col:
ram|delhi
if want to see rec's having more than 4 col:
adam|newyork|39|US|X12|123
If you want to only print the records which have 2 fields then following may help you in same.
awk -F"|" 'NF==2' Input_file
For any kind of records if you need a line which has more than 4 fields then change above condition to NF>4 or you need line which have more than 5 fields eg--> NF>5
Explanation: BY doing -F"|" I am making sure field separator is pipe here, then NF is an awk out of the box variable which defines the TOTAL number of fields in a line, so as per your request checking if number of fields are more than 2 here, if this condition is TRUE then print the current line(where I have NOT written print because awk works on method of condition and action, so if condition is TRUE here I am not mentioning any action and by default action print will happen for that line).
Using awk, variable NF gives total number of fields in record/row, by default awk use single space as field separator, if you alter FS, it will calculate NF based on field separator mentioned, so what you can do is
awk -v FS='|' 'NF==2' infile
Which is same as
# Usual Syntax : awk 'condition { action }' infile
awk -v FS='|' 'NF==2{ print }' infile
For more than 4 fields,
awk -v FS='|' 'NF > 4' infile
you can also use grep to filter 2-columed records:
grep '^[^|]*|[^|]*$' FILEABC.txt
It will output:
ram|delhi

extracting unique values between 2 sets/files

Working in linux/shell env, how can I accomplish the following:
text file 1 contains:
1
2
3
4
5
text file 2 contains:
6
7
1
2
3
4
I need to extract the entries in file 2 which are not in file 1. So '6' and '7' in this example.
How do I do this from the command line?
many thanks!
$ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2
6
7
Explanation of how the code works:
If we're working on file1, track each line of text we see.
If we're working on file2, and have not seen the line text, then print it.
Explanation of details:
FNR is the current file's record number
NR is the current overall record number from all input files
FNR==NR is true only when we are reading file1
$0 is the current line of text
a[$0] is a hash with the key set to the current line of text
a[$0]++ tracks that we've seen the current line of text
!($0 in a) is true only when we have not seen the line text
Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given
Using some lesser-known utilities:
sort file1 > file1.sorted
sort file2 > file2.sorted
comm -1 -3 file1.sorted file2.sorted
This will output duplicates, so if there is 1 3 in file1, but 2 in file2, this will still output 1 3. If this is not what you want, pipe the output from sort through uniq before writing it to a file:
sort file1 | uniq > file1.sorted
sort file2 | uniq > file2.sorted
comm -1 -3 file1.sorted file2.sorted
There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.
I was wondering which of the following solutions was the "fastest" for "larger" files:
awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeX
awk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2 # awk2 by ghostdog74
comm -13 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)
grep -v -F -x -f file1 file2
Results of my benchmarks in short:
Do not use grep -Fxf, it's much slower (2-4 times in my tests).
comm is slightly faster than join.
If file1 and file2 are already sorted, comm and join are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for comm probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.
For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was
# Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU
$ wc file1 file2
321599 321599 8098710 file1
321603 321603 8098794 file2
Typical results of fastest runs
awk2: real 0m1.145s user 0m1.088s sys 0m0.056s user+sys 1.144
awk1: real 0m1.369s user 0m1.324s sys 0m0.044s user+sys 1.368
comm: real 0m0.980s user 0m1.608s sys 0m0.184s user+sys 1.792
join: real 0m1.080s user 0m1.756s sys 0m0.140s user+sys 1.896
grep: real 0m4.005s user 0m3.844s sys 0m0.160s user+sys 4.004
BTW, for the awkies: It seems that a[$0]=1 is faster than a[$0]++, and (!($0 in a)) is faster than (!a[$0]). So, for an awk solution I suggest:
awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2
How about:
diff file_1 file_2 | grep '^>' | cut -c 3-
This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.
The files don't even need to be sorted.
with grep:
grep -F -x -v -f file_1 file_2
here's another awk solution
$ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2
6
7
$ cat file1 file1 file2 | sort | uniq -u
6
7
uniq -- report or filter out repeated lines in a file
... Repeated
lines in the input will not be detected if they are not adjacent, so
it may be necessary to sort the files first.
-u Only output lines that are not repeated in the input.
Print file1 twice to make sure all entries from file1 are skipped by uniq -u .
cat file1 file2 | sort -u > unique
If you are really set on doing this from the command line, this site (search for "no duplicates found") has an awk example that searches for duplicates. It may be a good starting point to look at that.
However, I'd encourage you to use Perl or Python for this. Basically, the flow of the program would be:
findUniqueValues(file1, file2){
contents1 = array of values from file1
contents2 = array of values from file2
foreach(value2 in contents2){
found=false
foreach(value1 in contents1){
if (value2 == value1) found=true
}
if(!found) print value2
}
}
This isn't the most elegant way of doing this, since it has a O(n^2) time complexity, but it will do the job.

Resources