Using BASH to annotate Intervals - linux

I have a very large text file with intervals of 500 (let's call it the main file.) It looks something like this:
Line1 0 500
Line1 500 1000
Line1 1000 1500
I have a second file that has different annotations at various intervals (let's call it the secondary file.)
Annotation1 379 498
Annotation2 1002 1048
....
I want to create a third file that annotates the main file with the secondary file, to look something like this:
Line1 0 500 Annotation1
Line1 500 1000 NA
Line1 1000 1500 Annotation2
In the situation of overlaps, I would prefer that the first annotation that fits the interval is placed.
Any help would be greatly appreciated!

awk 'NR==FNR{a[$2]=$1;next}{for(i in a)if(i-$2>=0 && $3-i>0)$0=$0 OFS a[i]}1' 2.txt 1.txt
Brief explanation,
NR==FNR{a[$2]=$1;next: in 2.txt, record $2 as the key and $1 as the value into array a
Scan array a to see if any key is between the range for each record in 1.txt.

Using awk:
$ awk 'NR==FNR{
min[$1]=$2
max[$1]=$3
next
}{
for(i in min){
if($2<=min[i] && $3>=max[i]){
print $0,i
next
}
}
print $0,"NA"
}' file2 file1
Line1 0 500 Annotation1
Line1 500 1000 NA
Line1 1000 1500 Annotation2
The first block statement stores the minimum, maximum and annotation values of the second file into the arrays min and max.
The second block statement loops through the array to find the annotation to print based on the max and min value of the current line. If the range doesn't match, the NA string is displayed.

Related

How do I use grep to get numbers larger than 50 from a txt file

I am relatively new to grep and unix. I am trying to get the names of people who have won more than 50 races from a txt file. So far the code I have used is, cat file.txt|grep -E "[5-9][0-9]$" but this is only giving me numbers from 50-99. How could I get it from 50-200. Thank you!!
driver
races
wins
Some_Man
90
160
Some_Man
10
80
the above is similar to the format of the data, although it is not tabulated.
Do you have to use grep? you could use awk like this:
awk '{if($[replace with the field number]>50)print$2}' < file.txt
assuming your fields are delimited by spaces, otherwise you could use -F flag to specify delimiter.
if you must use grep, then it's regular expression like you did. to make it 50 to 200 you will do:
cat file.txt|grep -E "(\b[5-9][0-9]|\b1[0-9][0-9])$"
Input:
Rank Country Driver Races Wins
1 [United_Kingdom] Lewis_Hamilton 264 94
2 [Germany] Sebastian_Vettel 254 53
3 [Spain] Fernando_Alonso 311 32
4 [Finland] Kimi_Raikkonen 326 21
5 [Germany] Nico_Rosberg 200 23
Awk would be a better candidate for this:
awk '$4>=50 && $4<=200 { print $0 }' file
Check to see if the fourth space delimited field ($4 - Change to what ever field number this actually is) if both greater than or equal to 50 and less than or equal to 200 and print the line ($0) if the condition is met

Splitting the first column of a file in multiple columns using AWK

File looks like this, but with millions of lines (TAB separated):
1_number_column_ranking_+ 100 200 Target "Hello"
I want to split the first column by the _ so it becomes:
1 number column ranking + 100 200 Target "Hello"
This is the code I have been trying:
awk -F"\t" '{n=split($1,a,"_");for (i=1;i<=n;i++) print $1"\t"a[i]}'
But it's not quite what I need.
Any help is appreciated (the other threads on this topic were not helpful for me).
No need to split, just replace would do:
awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1'
Eg:
$ cat file
1_number_column_ranking_+ 100 200 Target "Hello"
$ awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1' file
1 number column ranking + 100 200 Target "Hello"
gsub will replace all occurances, when no 3rd argument given, it will replace in $0.
Last 1 is a shortcut for {print}. (always true, implied {print}.)
Another awk, if the "_" appears only in the first column.
Split the input field by regex "[_\t]+" and just do a dummy operation like $1=$1 in the main section, so that $0 is reconstructed with OFS="\t"
$ cat steveman.txt
1_number_column_ranking_+ 100 200i Target "Hello"
$ awk -F"[_\t]" ' BEGIN { OFS="\t"} { $1=$1; print } ' steveman.txt
1 number column ranking + 100 200i Target "Hello"
$
Thanks #Ed, updated from -F"[_\t]+" to -F"[_\t]" that will avoid concatenating empty fields.

AWK compare two columns in two seperate files

I would like to compare two files and do something like this: if the 5th column in the first file is equal to the 5th column in the second file, I would like to print the whole line from the first file. Is that possible? I searched for the issue but was unable to find a solution :(
The files are separated by tabulators and I tried something like this:
zcat file1.txt.gz file2.txt.gz | awk -F'\t' 'NR==FNR{a[$5];next}$5 in a {print $0}'
Did anybody tried to do a similar thing? :)
Thanks in advance for help!
Your script is fine, but you need to provide each file individually to awk and in reverse order.
$ cat file1.txt
a b c d 100
x y z w 200
p q r s 300
1 2 3 4 400
$ cat file2.txt
. . . . 200
. . . . 400
$ awk 'NR==FNR{a[$5];next} $5 in a {print $0}' file2.txt file1.txt
x y z w 200
1 2 3 4 400
EDIT:
As pointed out in the comments, the generic solution above can be improved and tailored to OP's situation of starting with compressed tab-separated files:
$ awk -F'\t' 'NR==FNR{a[$5];next} $5 in a' <(zcat file2.txt) <(zcat file1.txt)
x y z w 200
1 2 3 4 400
Explanation:
NR is the number of the current record being processed and FNR is the number
of the current record within its file . Thus NR == FNR is only
true when awk is processing the first file given to it (which in our case is file2.txt).
a[$5] adds the value of the 5th column as an index to the array a. Arrays in awk are associative arrays, but often you don't care about associating a value and just want to make a nice collection of things. This is a
pithy way to make a collection of all the values we've seen in 5th column of the
first file. The next statement, which follows, says to immediately get the next
available record without looking at any anymore statements in the awk program.
Summarizing the above, this line says "If you're reading the first file (file2.txt),
save the value of column 5 in the array called a and move on to the record without
continuing with the rest of the awk program."
NR == FNR { a[$5]; next }
Hopefully it's clear from the above that the only way we can past that first line of
the awk program is if we are reading the second file (file1.txt in our case).
$5 in a evaluates to true if the value of the 5th column occurs as an index in
the a array. In other words, it is true for every record in file1.txt whose 5th
column we saw as a value in the 5th column of file2.txt.
In awk, when the pattern portion evaluates to true, the accompanying action is
invoked. When there's no action given, as below, the default action is triggered
instead, which is to simply print the current record. Thus, by just saying
$5 in a, we are telling awk to print all the records in file1.txt whose 5th
column also occurs in file2.txt, which of course was the given requirement.
$5 in a

Awk statement generating extra output line; skip blank input lines

jon,doe 5 5
sam,smith 10 5
I am required to calculate average for row & column. So basically the inputfile contains name score1 and score2 and i am required to read the contents from a file and then calculate average row-wise and column-wise. I am getting the desired result but there is one extra '0' that i am getting due to white space i would appreciate if someone could help.
awk 'BEGIN {print "name\tScore1\tScore2\tAverage"} {s+=$2} {k+=$3} {print $1,"\t",$2,"\t",$3,"\t",($2+$3)/2} END {print "Average", s/2,k/2}' input.txt
This is the output that i am getting-
name Score1 Score2 Average
jon,doe 5 5 5
sam,smith 10 5 7.5
0
Average 7.5 5
It looks like you have an extra empty or blank (all-whitespace) line in your input file.
Adding NF==0 {next} as the first pattern-action pair will skip all empty or blank lines and give the desired result.
NF==0 only matches if no fields (data) were found in the input line.
The next statement skips remaining statements for the current input line (record) and continues processing on the next line (record).
awk 'BEGIN {print "name\tScore1\tScore2\tAverage"} NF==0 {next} {s+=$2} {k+=$3} {print $1,"\t",$2,"\t",$3,"\t",($2+$3)/2} END {print "Average", s/2,k/2}' input.txt

Find the first line having a variable value bigger than a specific number

I have a very huge text file and I want to know how can I find the first line in which the value of a variable is bigger than 1000?
assuming that the variable and its value have only one space in between like this:
abcd 24
Find the first occurrence of abcd greater than 1000 and print the line number and matching line and quit:
$ awk '$1=="abcd" && $2>1000{print NR, $0; exit}' file
To find any variable greater than 1000 just drop the first condition:
$ awk '$2>1000{print NR, $0; exit}' file

Resources