how to conditionally replace values in columns with value of specific column in the same line by Unix and awk commands - linux

I want to conditionally replace values in columns with value of specific column in the same line in one file, by Unix and awk commands.
For example, I have myfile.txt (3 lines, 5 columns, tab-delimited):
1 A . C .
2 C T . T
3 T C C .
There are "." in columns 3 to 5. I want to replace those "." in columns 3 - 5 with the value in column 2 on the same line.
Could you please show me any directions on that?

This seems to do what you're asking for:
% awk 'BEGIN {
IFS = OFS = "\t"
}
{
for (column = 3; column <= NF; ++column) {
if ($column == ".") {
$column = $2
}
}
print
}
' test.tsv
1 A A C A
2 C T C T
3 T C C T
You've asked a few questions (and accepted no answers!) on awk now. May
I humbly suggest a tutorial?

awk '{FS="\t"; for(i=3;i<=5;i++) if($i==".") $i=$2; print}' myfile.txt

Related

Awk column value in file 1 is in the range of two columns in file 2

I modified the question based on the comments.
I would like to match two files: if $4 in file 1 is in the range of $3 and $4 in file 2, I would like to print file 1 with $6 in file 2. If there is no match, I would like to print NA in the output. If there are overlapping ranges, I would like to print the first match (sorting based on $4 of file 1).
File 1:
1 rs537182016 0 8674590 A C
1 rs575272151 0 69244805 G C
1 rs544419019 0 69244469 G C
1 rs354682 0 1268900 G C
File 2:
18 16 8674587 8784575 + ABAT
10349 17 69148007 69244815 - ABCA10
23461 17 69244435 69327182 - ABCA5
Output:
1 rs537182016 0 8674590 A C ABAT
1 rs575272151 0 69244805 G C ABCA10
1 rs544419019 0 69244469 G C ABCA10
1 rs354682 0 1268900 G C NA
I tried the following based on previous answers, but it did not work. The output is an empty file.
awk 'FNR == NR { val[$1] = $4 }
FNR != NR { if ($1 in val && val[$1] >= $3 && val[$1] <= $4)
print $1, val[$1], $6
}' file1 file2 > file3
Assumptions:
in the case of multiple matches OP has stated we only use the 'first' match; OP hasn't (yet) defined 'first' so I'm going to assume it means the order in which lines appear in file2 (aka the line number)
One awk idea:
awk '
FNR==NR { min[++n]=$3 # save 1st file values in arrays; use line number as index
max[n]=$4
col6[n]=$6
next
}
{ for (i=1;i<=n;i++) # loop through 1st file entries
if (min[i] <= $4 && $4 <= max[i]) { # if we find a range match then ...
print $0, col6[i] # print current line plus column #6 from 1st file and then ...
next # skip to next line of input; this keeps us from matching on additional entries from 1st file
}
print $0, "NA" # if we got here there was no match so print current line plus "NA"
}
' file2 file1
NOTE: make note of the order of the input files; the first answer (below) was based on an input of file1 file2; this answer requires the order of the input files to be flipped, ie, file2 file1
This generates:
1 rs537182016 0 8674590 A C ABAT
1 rs575272151 0 69244805 G C ABCA10
1 rs544419019 0 69244469 G C ABCA10
1 rs354682 0 1268900 G C NA
NOTE: following is based on OP's original question and expected output (revision #2); OP has since modified the expected output to such an extent that the following answer is no longer valid ...
Assumptions:
in file1 both rs575272151 / 69244805 and rs544419019 / 69244469 match 2 different (overlapping) ranges from file2 but OP has only showed one set of matches in the expected output; from this I'm going to assume ...
once a match is found for an entry from file1, remove said entry from any additional matching; this will eliminate multiple matches for file1 entries
once a match is found for a line from file2 then stop looking for matches for that line (ie, go to the next intput line from file2); this will eliminate multiple-matches for file2
OP has not provided any details on how to determine which mulit-match to keep so we'll use the first match we find
One awk idea:
awk '
FNR==NR { val[$2]=$4; next }
{ for (i in val) # loop through list of entries from 1st file ...
if ($3 <= val[i] && val[i] <= $4) { # looking for a range match and if found ...
print $0,i # print current line plus 2nd field from 1st file and then ...
delete val[i] # remove 1st file entry from further matches and ...
next # skip to next line of input from 2nd file, ie, stop looking for additional matches for the current line
}
}
' file1 file2
This generates:
18 16 8674587 8784575 + ABAT rs537182016
10349 17 69148007 69244815 - ABCA10 rs575272151
23461 17 69244435 69327182 - ABCA5 rs544419019
NOTES:
the for (i in val) construct is not guaranteed to process the array entries in a consistent manner; net result is that in the instance where there are multiple matches we simply match on the 'first' array entry provided by awk; if this 'random' nature of the for (i in val) is not acceptable then OP will need to update the question with additional details on how to handle multiple matches
for this particular case we actually generate the same output as expected by OP, but the assignments of rs575272151 and rs544419019 could just as easily be reversed (due to the nature of the for (i in val) construct)

Sum of 2nd and 3rd column for same value in 1st column

I want to sum the value in column 2nd and 3rd column for same value in 1st column
1555971000 6 1
1555971000 0 2
1555971300 2 0
1555971300 3 0
Output would be like
1555971000 6 3
1555971300 5 0
I have tried below command
awk -F" " '{b[$2]+=$1} END { for (i in b) { print b[i],i } } '
but this seems to be for only one column.
Here is another way with reading Input_file 2 times and it will provide output in same sequence as Input_file's sequence.
awk 'FNR==NR{a[$1]+=$2;b[$1]+=$3;next} ($1 in a){print $1,a[$1],b[$1];delete a[$1]}' Input_file Input_file
if data in 'd' without sort, tried on gnu awk,
awk 'BEGIN{f=1} {if($1==a||f){b+=$2;c+=$3;f=0} else{print a,b,c;b=$2;c=$3} a=$1} END{print a,b,c}' d
with sort gnu awk
awk '{w[NR]=$0} END{asort(w);f=1;for(;i++<NR;){split(w[i],v);if(v[1]==a||f){f=0;b+=v[2];c+=v[3]} else{print a,b,c;b=v[2];c=v[3];} a=v[1]} print a,b,c;}' d
You can do it with awk by first saving the fields in the first record, and then for all subsequent records, comparing if the first field matches, if so, add the contents of fields two and three and continue. If the first field fails to match, then output your first field and the running-sums, e.g.
awk '{
if ($1 == a) {
b+=$2; c+=$3;
}
else {
print a, b, c; a=$1; b=$2; c=$3;
}
} END { print a, b, c; }' file
With your input in file, you can copy and paste the foregoing into your terminal and obtain, the following:
Example Use/Output
$ awk '{
> if ($1 == a) {
> b+=$2; c+=$3;
> }
> else {
> print a, b, c; a=$1; b=$2; c=$3;
> }
> } END { print a, b, c; }' file
1555971000 6 3
1555971300 5 0
Using awk Arrays
A shorter more succinct alternative using arrays that does not require your input to be in sorted order would be:
awk '{a[$1]+=$2; b[$1]+=$3} END{ for (i in a) print i, a[i], b[i] }' file
(same output)
Using arrays allows the summing of columns for like field1 to work equally well if your data file contained the following lines in random order, e.g.
1555971300 2 0
1555971000 0 2
1555971000 6 1
1555971300 3 0
Another awk that would work regardless of any order of records whether or not they are not sorted :
awk '{r[$1]++}
r[$1]==1{o[++c]=$1}
{f[$1]+=$2;s[$1]+=$3}
END{for(i=1;i<=c;i++){print o[i],f[o[i]],s[o[i]]}}' file
Assuming when you wrote:
awk -F" " '{b[$2]+=$1} END { for (i in b) { print b[i],i } } '
you meant to write:
awk '{ b[$1]+=$2 } END{ for (i in b) print i,b[i] }'
It shouldn't be a huge leap to figure out:
$ awk '{ b[$1]+=$2; c[$1]+=$3 } END{ for (i in b) print i,b[i],c[i] }' file
1555971000 6 3
1555971300 5 0
Please get the book "Effective Awk Programming", 4th Edition, by Arnold Robbins and just read a paragraph or 2 about fields and arrays.

Extract value based on column header from Comma separated file using bash

I want to extract 1st value from a csv for a specific column name using bash. For example, i want to extract first value of column "bb". Columns can be in any order
aa,bb,cc
1,2,3
4,5,6
The output should be 2.
Awk solution:
awk -F',' 'NR == 1{ for(i=1; i<=NF; i++) if ($i == "bb") { pos = i; break } }
NR == 2{ print $pos; exit }' file.csv
The output:
2
Use this using csvkit :
csvcut -c 2 file.csv | awk 'NR==2'
Output :
2

Swap column with condition in linux

I have 4 columns in my data file, Column 1 is the chr number and second column is the start site and third is the end site.The fourth column is the strand + or -. Now if the 4th column is the negative strand i want to swap the column 3 with column 2 but for + strand i want no change.
Chr1 94847 3737474 +
Chr1 27374 3948475 +
Chr1 93947 9283736 -
So the first two rows are good, but for the third row i want to swap column 2 with column 3, as in the 4th column the strand in -.
I tried this code but the system generates an error with & operand...
cat hg-19_promoter_knownGene.filtered.bed | awk 'BEGIN {OFS="\t"} { if ($4 == "+") {print $1,$2,$3} & if ($4 == "-") { print $1,$3,$2 }}' > hg-19_promoter_knownGene.filter3.bed
To start with, chnage the "&" to ";".
This should work:
awk 'BEGIN {OFS="\t"} { if ($4 == "+") {print $1,$2,$3} ; if ($4 == "-") { print $1,$3,$2 }}'

AWK write to new column base on if else of other column

I have a CSV file with columns A,B,C,D. Column D contains values on a scale of 0 to 1. I want to use AWK to write to a new column E base in values in column D.
For example:
if value in column D <0.7, value in column E = 0.
if value in column D>=0.7, value in column E =1.
I am able to print the output of column E but not sure how to write it to a new column. Its possible to write the output of my code to a new file then paste it back to the old file but i was wondering if there was a more efficient way. Here is my code:
awk -F"," 'NR>1 {if ($3>=0.7) $4= "1"; else if ($3<0.7) $4= "0"; print $4;}' test_file.csv
below awk command should give you intended output
cat yourfile.csv|awk -F "," '{if($4>=0.7)print $0",1";else if($4<0.7)print $0",0"}' > test_file.csv
You can use:
awk -F, 'NR>1 {$0 = $0 FS (($4 >= 0.7) ? 1 : 0)} 1' test_file.csv

Resources