Less rows than expected after comparing two files - linux

I have two files to be compared:
"base" file from where I get values in the second column after comparing it with "temp" file
"temp" file which is continuously changing (e.g., in every loop)
"base" file:
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
"temp" file:
2.3
1.8
4.5
For comparison, the following code is used:
awk 'NR==FNR{A[$1]=$2;next} {i=int($1+.01)} i in A {print A[i]}' base temp
Therefore, it outputs:
b
a
d
As noticed, even though there are decimals numbers in "temp" file, the corresponding letters are found and printed. However, I found that with a larger file (e.g., more than a couple of thousands row records in "temp" file) the code always outputs "158" rows less than the actual number of rows in the "temp" file. I do not get why this happens and would like your support to circumvent this.
In the following example, "tmpctd" is the base file and "tmpsf" is the changing file.
awk 'NR==FNR{A[$1]=$2;next} {i=int($1+.01)} i in A {print A[i]}' tmpctd tmpsf
The above comparison produces 22623 rows, but the "tmpsf" (i.e., "temp" file) has 22781 rows. Thus, 158 rows less after comparing both files. For testing please find these files here: https://file.io/pxi24ZtPt0kD and https://file.io/tHgdI3dkbKhr.
Any hints are welcomed.
PS. I updated both links, sorry for that.

Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==NR{
a[int($1)]
next
}
($1 in a){
print $2
}
' temp_file base_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when temp_file is being read.
a[int($1)] ##Creating array a which has index as integer value of 1st field of current line.
next ##next will skip all further statements from here.
}
($1 in a){ ##Checking condition if first field is present in array a then do following.
print $2 ##Printing 2nd field of currnet line.
}
' temp_file base_file ##Mentioning Input_file names here.

Related

How to insert a column at the start of a txt file using awk?

How to insert a column at the start of a txt file running from 1 to 2059 which corresponds to the number of rows I have in my file using awk. I know the command will be something like this:
awk '{$1=" "}1' File
Not sure what to put between the speech-marks 1-2059?
I also want to include a header in the header row so 1 should only go in the second row technically.
**ID** Heading1
RQ1293939 -7.0494
RG293I32SJ -903.6868
RQ19238983 -0899977
rq747585950 988349303
FID **ID** Heading1
1 RQ1293939 -7.0494
2 RG293I32SJ -903.6868
3 RQ19238983 -0899977
4 rq747585950 988349303
So I need to insert the FID with 1 - 2059 running down the first column
What you show does not work, it just replaces the first field ($1) with a space and prints the result. If you do not have empty lines try:
awk 'NR==1 {print "FID\t" $0; next} {print NR-1 "\t" $0}' File
Explanations:
NR is the awk variable that counts the records (the lines, in our case), starting from 1. So NR==1 is a condition that holds only when awk processes the first line. In this case the action block says to print FID, a tab (\t), the original line ($0), and then move to next line.
The second action block is executed only if the first one has not been executed (due to the final next statement). It prints NR-1, that is the line number minus one, a tab, and the original line.
If you have empty lines and you want to skip them we will need a counter variable to keep track of the current non-empty line number:
awk 'NR==1 {print "FID\t" $0; next} NF==0 {print; next} {print ++cnt "\t" $0}' File
Explanations:
NF is the awk variable that counts the fields in a record (the space-separated words, in our case). So NF==0 is a condition that holds only on empty lines (or lines that contain only spaces). In this case the action block says to print the empty line and move to the next.
The last action block is executed only if none of the two others have been executed (due to their final next statement). It increments the cnt variable, prints it, prints a tab, and prints the original line.
Uninitialized awk variables (like cnt in our example) take value 0 when they are used for the first time as a number. ++cnt increments variable cnt before its value is used by the print command. So the first time this block is executed cnt takes value 1 before being printed. Note that cnt++ would increment after the printing.
Assuming you don't really have a blank row between your header line and the rest of your data:
awk '{print (NR>1 ? NR-1 : "FID"), $0}' file
Use awk -v OFS='\t' '...' file if you want the output to be tab-separated or pipe it to column -t if you want it visually tabular.

Merging two txt files based on a common column with different row numbers

I would like to merge two whitespace-delimited files without sorting them first based on the "phenotype" column. File 1 contains the same phenotype several times, while file 2 has each phenotype only once. I need to match "phenotype" from file 1 to "category" in file 2.
File 1:
chr pos pval_EAS phenotype FDR
1 1902906 0.234 biomarkers-30600-both_sexes-irnt.tsv.gz 1
2 1475898 0.221 biomarkers-30600-both_sexes-irnt.tsv.gz 1
2 568899 0.433 continuous-4566-both_sexes-irnt.tsv.gz 1
2 2435478 0.113 continuous-4566-both_sexes-irnt.tsv.gz 1
4 1223446 0.112 phecode-554-both_sexes-irnt.tsv.gz 0.345
4 3456573 0.0003 phecode-554-both_sexes-irnt.tsv.gz 0.989
File 2:
phenotype Category
biomarkers-30600-both_sexes-irnt.tsv.bgz Metabolic
continuous-4566-both_sexes-irnt.tsv.gz Neoplasms
phecode-554-both_sexes-irnt.tsv.gz Immunological
I tried the following, but I don't get the desired output:
awk -F' ' 'FNR==NR{a[$1]=$4; next} {print $0 a[$6]}' file2 file1 > file3
With your shown samples, please try following.
awk 'FNR==NR{arr[$1]=$2;next} ($4 in arr){print $0,arr[$4]}' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when file2 is being read.
arr[$1]=$2 ##Creating array arr with index of $1 and value is $2.
next ##next will skip all further statements from here.
}
($4 in arr){ ##Checking condition if 4th field is in arr then do following.
print $0,arr[$4] ##Printing current line along with value of arr with 4th field as index number.
}
' file2 file1 ##Mentioning Input_file names here.
Bonus solution: In case you want to print those lines which are not matching values and want to print with N/A then do following.
awk 'FNR==NR{arr[$1]=$2;next} {print $0,(($4 in arr)?arr[$4]:"N/A")}' file2 file1

AWK: Comparing substrings from two files and write to third file

I'm trying to compare two different files, let's say "file1" and "file2", in this way. If the substring of characters i.e 5 characters at position (8 to 12) matches in both files - file1 and file2, then remove that matching row from file 1. Finally, write the output to file3.(output contains the remaining rows which are not matching with file 2)
My output is the non matching rows of file1.
Output (file3) = File1 - File2
File1
-----
aqcdfdf**45555**78782121
axcdfdf**45555**75782321
aecdfdf**75555**78782221
aqcdfdf**95555**78782121
File2
-----
aqcdfdf**45555**78782121
axcdfdf**25555**75782321
File3
-----
aecdfdf**75555**78782221
aqcdfdf**95555**78782121
I tried awk but i need some thing which looks at substring of the two files, since there are no delimiters in my files.
$ awk 'FNR==NR {a[$1]; next} $1 in a' f1 f2 > file3
Could you please try following, written and tested with shown samples in GNU awk. Once happy with results on terminal then redirect output of following command to > file3(append > file3 to following command).
awk '{str=substr($0,8,5)} FNR==NR{a[str];next} !(str in a)' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
str=substr($0,8,5) ##Creating str which has sub-string of current line from 8th to 12th character.
}
FNR==NR{ ##Checking condition FNR==NR which will run when Input_file2 is being read.
a[str] ##Creating array a with index of str here.
next ##next will skip all further statements from here.
}
!(str in a) ##Checking condition if str is NOT present in a then print that line from Input_file1.
' file2 file1 ##Mentioning Input_file names here.

Splitting the first column of a file in multiple columns using AWK

File looks like this, but with millions of lines (TAB separated):
1_number_column_ranking_+ 100 200 Target "Hello"
I want to split the first column by the _ so it becomes:
1 number column ranking + 100 200 Target "Hello"
This is the code I have been trying:
awk -F"\t" '{n=split($1,a,"_");for (i=1;i<=n;i++) print $1"\t"a[i]}'
But it's not quite what I need.
Any help is appreciated (the other threads on this topic were not helpful for me).
No need to split, just replace would do:
awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1'
Eg:
$ cat file
1_number_column_ranking_+ 100 200 Target "Hello"
$ awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1' file
1 number column ranking + 100 200 Target "Hello"
gsub will replace all occurances, when no 3rd argument given, it will replace in $0.
Last 1 is a shortcut for {print}. (always true, implied {print}.)
Another awk, if the "_" appears only in the first column.
Split the input field by regex "[_\t]+" and just do a dummy operation like $1=$1 in the main section, so that $0 is reconstructed with OFS="\t"
$ cat steveman.txt
1_number_column_ranking_+ 100 200i Target "Hello"
$ awk -F"[_\t]" ' BEGIN { OFS="\t"} { $1=$1; print } ' steveman.txt
1 number column ranking + 100 200i Target "Hello"
$
Thanks #Ed, updated from -F"[_\t]+" to -F"[_\t]" that will avoid concatenating empty fields.

AWK compare two columns in two seperate files

I would like to compare two files and do something like this: if the 5th column in the first file is equal to the 5th column in the second file, I would like to print the whole line from the first file. Is that possible? I searched for the issue but was unable to find a solution :(
The files are separated by tabulators and I tried something like this:
zcat file1.txt.gz file2.txt.gz | awk -F'\t' 'NR==FNR{a[$5];next}$5 in a {print $0}'
Did anybody tried to do a similar thing? :)
Thanks in advance for help!
Your script is fine, but you need to provide each file individually to awk and in reverse order.
$ cat file1.txt
a b c d 100
x y z w 200
p q r s 300
1 2 3 4 400
$ cat file2.txt
. . . . 200
. . . . 400
$ awk 'NR==FNR{a[$5];next} $5 in a {print $0}' file2.txt file1.txt
x y z w 200
1 2 3 4 400
EDIT:
As pointed out in the comments, the generic solution above can be improved and tailored to OP's situation of starting with compressed tab-separated files:
$ awk -F'\t' 'NR==FNR{a[$5];next} $5 in a' <(zcat file2.txt) <(zcat file1.txt)
x y z w 200
1 2 3 4 400
Explanation:
NR is the number of the current record being processed and FNR is the number
of the current record within its file . Thus NR == FNR is only
true when awk is processing the first file given to it (which in our case is file2.txt).
a[$5] adds the value of the 5th column as an index to the array a. Arrays in awk are associative arrays, but often you don't care about associating a value and just want to make a nice collection of things. This is a
pithy way to make a collection of all the values we've seen in 5th column of the
first file. The next statement, which follows, says to immediately get the next
available record without looking at any anymore statements in the awk program.
Summarizing the above, this line says "If you're reading the first file (file2.txt),
save the value of column 5 in the array called a and move on to the record without
continuing with the rest of the awk program."
NR == FNR { a[$5]; next }
Hopefully it's clear from the above that the only way we can past that first line of
the awk program is if we are reading the second file (file1.txt in our case).
$5 in a evaluates to true if the value of the 5th column occurs as an index in
the a array. In other words, it is true for every record in file1.txt whose 5th
column we saw as a value in the 5th column of file2.txt.
In awk, when the pattern portion evaluates to true, the accompanying action is
invoked. When there's no action given, as below, the default action is triggered
instead, which is to simply print the current record. Thus, by just saying
$5 in a, we are telling awk to print all the records in file1.txt whose 5th
column also occurs in file2.txt, which of course was the given requirement.
$5 in a

Resources