How to print the last column from a specific row? - linux

I've got datafile file which more or less looks something like this:
*** some text ***
Results
1 50
2 -75
3 80
*** some text ***
What I'd like to do is:
Find the line that contains the string "Results".
List those three results but only the last column is significant.
Show only the positive ones.
I was trying to solve my problem with awk command which for each result looks like this:
res1=$(awk '/Results/{nr[NR+1]}; NR in nr' datafile | awk '{print $NF}')
I hoped to get the first positive results by:
If [ $res1 -gt 0 ]; then
echo "$res1"
fi
But instead of the expected result I've got the error Integer expression expected. So it leads to a conclusion that the variable res1 isn't a numeric value. Any idea how to define it properly?

Something like this might work:
$ awk '$0 == "Results" { f = 3; next } f && f-- && $NF > 0 { print $NF }' input
50
80
Basically the variable f is set to 3 when the line Results is passed.
Then the last column from the next lines are printed as long as f > 0 and $NF > 0 $NF is the last column.

Related

Awk column value in file 1 is in the range of two columns in file 2

I modified the question based on the comments.
I would like to match two files: if $4 in file 1 is in the range of $3 and $4 in file 2, I would like to print file 1 with $6 in file 2. If there is no match, I would like to print NA in the output. If there are overlapping ranges, I would like to print the first match (sorting based on $4 of file 1).
File 1:
1 rs537182016 0 8674590 A C
1 rs575272151 0 69244805 G C
1 rs544419019 0 69244469 G C
1 rs354682 0 1268900 G C
File 2:
18 16 8674587 8784575 + ABAT
10349 17 69148007 69244815 - ABCA10
23461 17 69244435 69327182 - ABCA5
Output:
1 rs537182016 0 8674590 A C ABAT
1 rs575272151 0 69244805 G C ABCA10
1 rs544419019 0 69244469 G C ABCA10
1 rs354682 0 1268900 G C NA
I tried the following based on previous answers, but it did not work. The output is an empty file.
awk 'FNR == NR { val[$1] = $4 }
FNR != NR { if ($1 in val && val[$1] >= $3 && val[$1] <= $4)
print $1, val[$1], $6
}' file1 file2 > file3
Assumptions:
in the case of multiple matches OP has stated we only use the 'first' match; OP hasn't (yet) defined 'first' so I'm going to assume it means the order in which lines appear in file2 (aka the line number)
One awk idea:
awk '
FNR==NR { min[++n]=$3 # save 1st file values in arrays; use line number as index
max[n]=$4
col6[n]=$6
next
}
{ for (i=1;i<=n;i++) # loop through 1st file entries
if (min[i] <= $4 && $4 <= max[i]) { # if we find a range match then ...
print $0, col6[i] # print current line plus column #6 from 1st file and then ...
next # skip to next line of input; this keeps us from matching on additional entries from 1st file
}
print $0, "NA" # if we got here there was no match so print current line plus "NA"
}
' file2 file1
NOTE: make note of the order of the input files; the first answer (below) was based on an input of file1 file2; this answer requires the order of the input files to be flipped, ie, file2 file1
This generates:
1 rs537182016 0 8674590 A C ABAT
1 rs575272151 0 69244805 G C ABCA10
1 rs544419019 0 69244469 G C ABCA10
1 rs354682 0 1268900 G C NA
NOTE: following is based on OP's original question and expected output (revision #2); OP has since modified the expected output to such an extent that the following answer is no longer valid ...
Assumptions:
in file1 both rs575272151 / 69244805 and rs544419019 / 69244469 match 2 different (overlapping) ranges from file2 but OP has only showed one set of matches in the expected output; from this I'm going to assume ...
once a match is found for an entry from file1, remove said entry from any additional matching; this will eliminate multiple matches for file1 entries
once a match is found for a line from file2 then stop looking for matches for that line (ie, go to the next intput line from file2); this will eliminate multiple-matches for file2
OP has not provided any details on how to determine which mulit-match to keep so we'll use the first match we find
One awk idea:
awk '
FNR==NR { val[$2]=$4; next }
{ for (i in val) # loop through list of entries from 1st file ...
if ($3 <= val[i] && val[i] <= $4) { # looking for a range match and if found ...
print $0,i # print current line plus 2nd field from 1st file and then ...
delete val[i] # remove 1st file entry from further matches and ...
next # skip to next line of input from 2nd file, ie, stop looking for additional matches for the current line
}
}
' file1 file2
This generates:
18 16 8674587 8784575 + ABAT rs537182016
10349 17 69148007 69244815 - ABCA10 rs575272151
23461 17 69244435 69327182 - ABCA5 rs544419019
NOTES:
the for (i in val) construct is not guaranteed to process the array entries in a consistent manner; net result is that in the instance where there are multiple matches we simply match on the 'first' array entry provided by awk; if this 'random' nature of the for (i in val) is not acceptable then OP will need to update the question with additional details on how to handle multiple matches
for this particular case we actually generate the same output as expected by OP, but the assignments of rs575272151 and rs544419019 could just as easily be reversed (due to the nature of the for (i in val) construct)

Sum of 2nd and 3rd column for same value in 1st column

I want to sum the value in column 2nd and 3rd column for same value in 1st column
1555971000 6 1
1555971000 0 2
1555971300 2 0
1555971300 3 0
Output would be like
1555971000 6 3
1555971300 5 0
I have tried below command
awk -F" " '{b[$2]+=$1} END { for (i in b) { print b[i],i } } '
but this seems to be for only one column.
Here is another way with reading Input_file 2 times and it will provide output in same sequence as Input_file's sequence.
awk 'FNR==NR{a[$1]+=$2;b[$1]+=$3;next} ($1 in a){print $1,a[$1],b[$1];delete a[$1]}' Input_file Input_file
if data in 'd' without sort, tried on gnu awk,
awk 'BEGIN{f=1} {if($1==a||f){b+=$2;c+=$3;f=0} else{print a,b,c;b=$2;c=$3} a=$1} END{print a,b,c}' d
with sort gnu awk
awk '{w[NR]=$0} END{asort(w);f=1;for(;i++<NR;){split(w[i],v);if(v[1]==a||f){f=0;b+=v[2];c+=v[3]} else{print a,b,c;b=v[2];c=v[3];} a=v[1]} print a,b,c;}' d
You can do it with awk by first saving the fields in the first record, and then for all subsequent records, comparing if the first field matches, if so, add the contents of fields two and three and continue. If the first field fails to match, then output your first field and the running-sums, e.g.
awk '{
if ($1 == a) {
b+=$2; c+=$3;
}
else {
print a, b, c; a=$1; b=$2; c=$3;
}
} END { print a, b, c; }' file
With your input in file, you can copy and paste the foregoing into your terminal and obtain, the following:
Example Use/Output
$ awk '{
> if ($1 == a) {
> b+=$2; c+=$3;
> }
> else {
> print a, b, c; a=$1; b=$2; c=$3;
> }
> } END { print a, b, c; }' file
1555971000 6 3
1555971300 5 0
Using awk Arrays
A shorter more succinct alternative using arrays that does not require your input to be in sorted order would be:
awk '{a[$1]+=$2; b[$1]+=$3} END{ for (i in a) print i, a[i], b[i] }' file
(same output)
Using arrays allows the summing of columns for like field1 to work equally well if your data file contained the following lines in random order, e.g.
1555971300 2 0
1555971000 0 2
1555971000 6 1
1555971300 3 0
Another awk that would work regardless of any order of records whether or not they are not sorted :
awk '{r[$1]++}
r[$1]==1{o[++c]=$1}
{f[$1]+=$2;s[$1]+=$3}
END{for(i=1;i<=c;i++){print o[i],f[o[i]],s[o[i]]}}' file
Assuming when you wrote:
awk -F" " '{b[$2]+=$1} END { for (i in b) { print b[i],i } } '
you meant to write:
awk '{ b[$1]+=$2 } END{ for (i in b) print i,b[i] }'
It shouldn't be a huge leap to figure out:
$ awk '{ b[$1]+=$2; c[$1]+=$3 } END{ for (i in b) print i,b[i],c[i] }' file
1555971000 6 3
1555971300 5 0
Please get the book "Effective Awk Programming", 4th Edition, by Arnold Robbins and just read a paragraph or 2 about fields and arrays.

Splitting the first column of a file in multiple columns using AWK

File looks like this, but with millions of lines (TAB separated):
1_number_column_ranking_+ 100 200 Target "Hello"
I want to split the first column by the _ so it becomes:
1 number column ranking + 100 200 Target "Hello"
This is the code I have been trying:
awk -F"\t" '{n=split($1,a,"_");for (i=1;i<=n;i++) print $1"\t"a[i]}'
But it's not quite what I need.
Any help is appreciated (the other threads on this topic were not helpful for me).
No need to split, just replace would do:
awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1'
Eg:
$ cat file
1_number_column_ranking_+ 100 200 Target "Hello"
$ awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1' file
1 number column ranking + 100 200 Target "Hello"
gsub will replace all occurances, when no 3rd argument given, it will replace in $0.
Last 1 is a shortcut for {print}. (always true, implied {print}.)
Another awk, if the "_" appears only in the first column.
Split the input field by regex "[_\t]+" and just do a dummy operation like $1=$1 in the main section, so that $0 is reconstructed with OFS="\t"
$ cat steveman.txt
1_number_column_ranking_+ 100 200i Target "Hello"
$ awk -F"[_\t]" ' BEGIN { OFS="\t"} { $1=$1; print } ' steveman.txt
1 number column ranking + 100 200i Target "Hello"
$
Thanks #Ed, updated from -F"[_\t]+" to -F"[_\t]" that will avoid concatenating empty fields.

Linux SHELL script, read each row for different number of columns

I have file and for example values in it:
1 value1.1 value1.2
2 value2.1
3 value3.1 value3.2 value3.3
I need to read values using the shell script from it but number of columns in each row is different!!!
I know that if for example I want to read second column I will do it like this (for row number as input parameter)
$ awk -v key=1 '$1 == key { print $2 }' input.txt
value1.1
But as I mentioned number of columns is different for each row.
How to make this read dynamic?
For example:
if input parameter is 1 it means I should read columns from the first row so output should be
value1.1 value1.2
if input parameter is 2 it means I should read columns from the second row so output should be
value2.1
if input parameter is 3 it means I should read columns from the third row so output should be
value3.1 value3.2 value3.2
Th point is that number of columns is not static and I should read columns from that specific row until the end of the row.
Thank you
Then you can simply say:
awk -v key=1 'NR==key' input.txt
UPDATED
If you want to process with the column data, there will be several ways.
With awk you can say something like:
awk -v key=3 'NR==key {
for (i=1; i<=NF; i++)
printf "column %d = %s\n", i, $i
}' input.txt
which outputs:
column 1 = value3.1
column 2 = value3.2
column 3 = value3.2
In awk you can access each column value by $1, $2, $3 directly or by $i indirectly where variable i holds either of 1, 2, 3.
If you prefer going with bash, try something like:
line=$(awk -v key=3 'NR==key' input.txt)
set -- $line # split into columns
for ((i=1; i<=$#; i++)); do
echo column $i = ${!i}
done
which outputs the same results.
In bash the indirect access is a little bit complex and you need to say ${!i} where i is a variable name.
Hope this helps.

awk print number of row only in uniq column

I have data set like this:
1 A
1 B
1 C
2 A
2 B
2 C
3 B
3 C
And I have a script which calculates me:
Number of occurrences in searching string
Number of rows
awk -v search="A" \
'BEGIN{count=0} $2 == search {count++} END{print count "\n" NR}' input
That works perfectly fine.
I would like to add to my awk one liner number of unique lines from the first column.
So the output should be separated by \n:
2
8
3
I can do this in separate awk code, but I am not able to integrate it to my original awk code.
awk '{a[$1]++}END{for(i in a){print i}}' input | wc -l
Any idea how to integrate it in one awk solution without piping ?
Looks like you want this:
awk -v search="A" '{a[$1]++}
$2 == search {count++}
END{OFS="\n";print count+0, NR, length(a)}' file

Resources