AWK compare two columns in two seperate files - linux

I would like to compare two files and do something like this: if the 5th column in the first file is equal to the 5th column in the second file, I would like to print the whole line from the first file. Is that possible? I searched for the issue but was unable to find a solution :(
The files are separated by tabulators and I tried something like this:
zcat file1.txt.gz file2.txt.gz | awk -F'\t' 'NR==FNR{a[$5];next}$5 in a {print $0}'
Did anybody tried to do a similar thing? :)
Thanks in advance for help!

Your script is fine, but you need to provide each file individually to awk and in reverse order.
$ cat file1.txt
a b c d 100
x y z w 200
p q r s 300
1 2 3 4 400
$ cat file2.txt
. . . . 200
. . . . 400
$ awk 'NR==FNR{a[$5];next} $5 in a {print $0}' file2.txt file1.txt
x y z w 200
1 2 3 4 400
EDIT:
As pointed out in the comments, the generic solution above can be improved and tailored to OP's situation of starting with compressed tab-separated files:
$ awk -F'\t' 'NR==FNR{a[$5];next} $5 in a' <(zcat file2.txt) <(zcat file1.txt)
x y z w 200
1 2 3 4 400
Explanation:
NR is the number of the current record being processed and FNR is the number
of the current record within its file . Thus NR == FNR is only
true when awk is processing the first file given to it (which in our case is file2.txt).
a[$5] adds the value of the 5th column as an index to the array a. Arrays in awk are associative arrays, but often you don't care about associating a value and just want to make a nice collection of things. This is a
pithy way to make a collection of all the values we've seen in 5th column of the
first file. The next statement, which follows, says to immediately get the next
available record without looking at any anymore statements in the awk program.
Summarizing the above, this line says "If you're reading the first file (file2.txt),
save the value of column 5 in the array called a and move on to the record without
continuing with the rest of the awk program."
NR == FNR { a[$5]; next }
Hopefully it's clear from the above that the only way we can past that first line of
the awk program is if we are reading the second file (file1.txt in our case).
$5 in a evaluates to true if the value of the 5th column occurs as an index in
the a array. In other words, it is true for every record in file1.txt whose 5th
column we saw as a value in the 5th column of file2.txt.
In awk, when the pattern portion evaluates to true, the accompanying action is
invoked. When there's no action given, as below, the default action is triggered
instead, which is to simply print the current record. Thus, by just saying
$5 in a, we are telling awk to print all the records in file1.txt whose 5th
column also occurs in file2.txt, which of course was the given requirement.
$5 in a

Related

Awk iteratively replacing strings from array

I've been recently trying to do the following in awk -
we have two files (F1.txt F2.txt.gz). While streaming from the second one, I want to replace all occurrences of entries from f1.txt with its substrings. I came to this point:
zcat F2.txt.gz |
awk 'NR==FNR {a[$1]; next}
{for (i in a)
$0=gsub(i, substr(i, 0, 2), $0) #this does not work of course
}
{print $0}
' F1.txt -
Was wondering how to do this properly in Awk. Thanks!
Please correct the assumptions if wrong.
You have two files, one includes a set of entries. If the second file has any one of these words, replace them with first two chars.
Example:
==> file1 <==
Azerbaijan
Belarus
Canada
==> file2 <==
Caspian sea is in Azerbaijan
Belarus is in Europe
Canada is in metric system.
$ awk 'NR==FNR {a[$1]; next}
{for(i=1;i<=NF;i++)
if($i in a) $i=substr($i,1,2)}1' file1 file2
Caspian sea is in Az
Be is in Europe
Ca is in metric system.
note that substring index starts with 1 in awk.
try to change
$0=gsub(i, substr(i, 0, 2), $0)
into
gsub(i, substr(i, 0, 2))
The return value of the gsub() function is the number of successful replacements instead of the string after the replacement.
$0=gsub(i, substr(i, 0, 2), $0) #this does not work of course
GNU AWK's function gsub does alter value of 3rd argument (thus it must be assignable) and does return number of substitutions made. You should not care about return value if you just want altered value.
Consider following simple example, let file1.txt content be
a x
b y
c z
and file2.txt content be
quick fox jumped over lazy dog
then
awk 'FNR==NR{arr[$1]=$2;next}{for(i in arr){gsub(i,arr[i],$0)};print}' file1.txt file2.txt
gives output
quizk fox jumped over lxzy dog
be warned that if there is any chain in your replacement
a b
b c
then output becomes dependent on array traversal order.
(tested in gawk 4.2.1)

How to insert a column at the start of a txt file using awk?

How to insert a column at the start of a txt file running from 1 to 2059 which corresponds to the number of rows I have in my file using awk. I know the command will be something like this:
awk '{$1=" "}1' File
Not sure what to put between the speech-marks 1-2059?
I also want to include a header in the header row so 1 should only go in the second row technically.
**ID** Heading1
RQ1293939 -7.0494
RG293I32SJ -903.6868
RQ19238983 -0899977
rq747585950 988349303
FID **ID** Heading1
1 RQ1293939 -7.0494
2 RG293I32SJ -903.6868
3 RQ19238983 -0899977
4 rq747585950 988349303
So I need to insert the FID with 1 - 2059 running down the first column
What you show does not work, it just replaces the first field ($1) with a space and prints the result. If you do not have empty lines try:
awk 'NR==1 {print "FID\t" $0; next} {print NR-1 "\t" $0}' File
Explanations:
NR is the awk variable that counts the records (the lines, in our case), starting from 1. So NR==1 is a condition that holds only when awk processes the first line. In this case the action block says to print FID, a tab (\t), the original line ($0), and then move to next line.
The second action block is executed only if the first one has not been executed (due to the final next statement). It prints NR-1, that is the line number minus one, a tab, and the original line.
If you have empty lines and you want to skip them we will need a counter variable to keep track of the current non-empty line number:
awk 'NR==1 {print "FID\t" $0; next} NF==0 {print; next} {print ++cnt "\t" $0}' File
Explanations:
NF is the awk variable that counts the fields in a record (the space-separated words, in our case). So NF==0 is a condition that holds only on empty lines (or lines that contain only spaces). In this case the action block says to print the empty line and move to the next.
The last action block is executed only if none of the two others have been executed (due to their final next statement). It increments the cnt variable, prints it, prints a tab, and prints the original line.
Uninitialized awk variables (like cnt in our example) take value 0 when they are used for the first time as a number. ++cnt increments variable cnt before its value is used by the print command. So the first time this block is executed cnt takes value 1 before being printed. Note that cnt++ would increment after the printing.
Assuming you don't really have a blank row between your header line and the rest of your data:
awk '{print (NR>1 ? NR-1 : "FID"), $0}' file
Use awk -v OFS='\t' '...' file if you want the output to be tab-separated or pipe it to column -t if you want it visually tabular.

Less rows than expected after comparing two files

I have two files to be compared:
"base" file from where I get values in the second column after comparing it with "temp" file
"temp" file which is continuously changing (e.g., in every loop)
"base" file:
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
"temp" file:
2.3
1.8
4.5
For comparison, the following code is used:
awk 'NR==FNR{A[$1]=$2;next} {i=int($1+.01)} i in A {print A[i]}' base temp
Therefore, it outputs:
b
a
d
As noticed, even though there are decimals numbers in "temp" file, the corresponding letters are found and printed. However, I found that with a larger file (e.g., more than a couple of thousands row records in "temp" file) the code always outputs "158" rows less than the actual number of rows in the "temp" file. I do not get why this happens and would like your support to circumvent this.
In the following example, "tmpctd" is the base file and "tmpsf" is the changing file.
awk 'NR==FNR{A[$1]=$2;next} {i=int($1+.01)} i in A {print A[i]}' tmpctd tmpsf
The above comparison produces 22623 rows, but the "tmpsf" (i.e., "temp" file) has 22781 rows. Thus, 158 rows less after comparing both files. For testing please find these files here: https://file.io/pxi24ZtPt0kD and https://file.io/tHgdI3dkbKhr.
Any hints are welcomed.
PS. I updated both links, sorry for that.
Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==NR{
a[int($1)]
next
}
($1 in a){
print $2
}
' temp_file base_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when temp_file is being read.
a[int($1)] ##Creating array a which has index as integer value of 1st field of current line.
next ##next will skip all further statements from here.
}
($1 in a){ ##Checking condition if first field is present in array a then do following.
print $2 ##Printing 2nd field of currnet line.
}
' temp_file base_file ##Mentioning Input_file names here.

Splitting the first column of a file in multiple columns using AWK

File looks like this, but with millions of lines (TAB separated):
1_number_column_ranking_+ 100 200 Target "Hello"
I want to split the first column by the _ so it becomes:
1 number column ranking + 100 200 Target "Hello"
This is the code I have been trying:
awk -F"\t" '{n=split($1,a,"_");for (i=1;i<=n;i++) print $1"\t"a[i]}'
But it's not quite what I need.
Any help is appreciated (the other threads on this topic were not helpful for me).
No need to split, just replace would do:
awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1'
Eg:
$ cat file
1_number_column_ranking_+ 100 200 Target "Hello"
$ awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1' file
1 number column ranking + 100 200 Target "Hello"
gsub will replace all occurances, when no 3rd argument given, it will replace in $0.
Last 1 is a shortcut for {print}. (always true, implied {print}.)
Another awk, if the "_" appears only in the first column.
Split the input field by regex "[_\t]+" and just do a dummy operation like $1=$1 in the main section, so that $0 is reconstructed with OFS="\t"
$ cat steveman.txt
1_number_column_ranking_+ 100 200i Target "Hello"
$ awk -F"[_\t]" ' BEGIN { OFS="\t"} { $1=$1; print } ' steveman.txt
1 number column ranking + 100 200i Target "Hello"
$
Thanks #Ed, updated from -F"[_\t]+" to -F"[_\t]" that will avoid concatenating empty fields.

awk sum every 4th number - field

So my input file is:
1;a;b;2;c;d;3;e;f;4;g;h;5
1;a;b;2;c;d;9;e;f;101;g;h;9
3;a;b;1;c;d;3;e;f;10;g;h;5
I want to sum the numbers then write it to a file (so i need every 4th field).
I tried many sum examples on the net but i didnt found answer for my problem.
My ouput file should looks:
159
Thanks!
Update:
a;b;**2**;c;d;g
3;e;**3**;s;g;k
h;5;**2**;d;d;l
The problem is the same.
I want to sum the 3th numbers (But in the line it is 3th).
So 2+3+2.
Output: 7
Apparently you want to print every 3rd field, not every 4th. The following code loops through all fields, suming each one in a 3k+1 position.
$ awk -F";" '{for (i=1; i<=NF; i+=3) sum+=$i} END{print sum}' file
159
The value is printed after processing the whole file, in the END {} block.

Resources