Swap column with condition in linux - linux

I have 4 columns in my data file, Column 1 is the chr number and second column is the start site and third is the end site.The fourth column is the strand + or -. Now if the 4th column is the negative strand i want to swap the column 3 with column 2 but for + strand i want no change.
Chr1 94847 3737474 +
Chr1 27374 3948475 +
Chr1 93947 9283736 -
So the first two rows are good, but for the third row i want to swap column 2 with column 3, as in the 4th column the strand in -.
I tried this code but the system generates an error with & operand...
cat hg-19_promoter_knownGene.filtered.bed | awk 'BEGIN {OFS="\t"} { if ($4 == "+") {print $1,$2,$3} & if ($4 == "-") { print $1,$3,$2 }}' > hg-19_promoter_knownGene.filter3.bed

To start with, chnage the "&" to ";".
This should work:
awk 'BEGIN {OFS="\t"} { if ($4 == "+") {print $1,$2,$3} ; if ($4 == "-") { print $1,$3,$2 }}'

Related

Sum of 2nd and 3rd column for same value in 1st column

I want to sum the value in column 2nd and 3rd column for same value in 1st column
1555971000 6 1
1555971000 0 2
1555971300 2 0
1555971300 3 0
Output would be like
1555971000 6 3
1555971300 5 0
I have tried below command
awk -F" " '{b[$2]+=$1} END { for (i in b) { print b[i],i } } '
but this seems to be for only one column.
Here is another way with reading Input_file 2 times and it will provide output in same sequence as Input_file's sequence.
awk 'FNR==NR{a[$1]+=$2;b[$1]+=$3;next} ($1 in a){print $1,a[$1],b[$1];delete a[$1]}' Input_file Input_file
if data in 'd' without sort, tried on gnu awk,
awk 'BEGIN{f=1} {if($1==a||f){b+=$2;c+=$3;f=0} else{print a,b,c;b=$2;c=$3} a=$1} END{print a,b,c}' d
with sort gnu awk
awk '{w[NR]=$0} END{asort(w);f=1;for(;i++<NR;){split(w[i],v);if(v[1]==a||f){f=0;b+=v[2];c+=v[3]} else{print a,b,c;b=v[2];c=v[3];} a=v[1]} print a,b,c;}' d
You can do it with awk by first saving the fields in the first record, and then for all subsequent records, comparing if the first field matches, if so, add the contents of fields two and three and continue. If the first field fails to match, then output your first field and the running-sums, e.g.
awk '{
if ($1 == a) {
b+=$2; c+=$3;
}
else {
print a, b, c; a=$1; b=$2; c=$3;
}
} END { print a, b, c; }' file
With your input in file, you can copy and paste the foregoing into your terminal and obtain, the following:
Example Use/Output
$ awk '{
> if ($1 == a) {
> b+=$2; c+=$3;
> }
> else {
> print a, b, c; a=$1; b=$2; c=$3;
> }
> } END { print a, b, c; }' file
1555971000 6 3
1555971300 5 0
Using awk Arrays
A shorter more succinct alternative using arrays that does not require your input to be in sorted order would be:
awk '{a[$1]+=$2; b[$1]+=$3} END{ for (i in a) print i, a[i], b[i] }' file
(same output)
Using arrays allows the summing of columns for like field1 to work equally well if your data file contained the following lines in random order, e.g.
1555971300 2 0
1555971000 0 2
1555971000 6 1
1555971300 3 0
Another awk that would work regardless of any order of records whether or not they are not sorted :
awk '{r[$1]++}
r[$1]==1{o[++c]=$1}
{f[$1]+=$2;s[$1]+=$3}
END{for(i=1;i<=c;i++){print o[i],f[o[i]],s[o[i]]}}' file
Assuming when you wrote:
awk -F" " '{b[$2]+=$1} END { for (i in b) { print b[i],i } } '
you meant to write:
awk '{ b[$1]+=$2 } END{ for (i in b) print i,b[i] }'
It shouldn't be a huge leap to figure out:
$ awk '{ b[$1]+=$2; c[$1]+=$3 } END{ for (i in b) print i,b[i],c[i] }' file
1555971000 6 3
1555971300 5 0
Please get the book "Effective Awk Programming", 4th Edition, by Arnold Robbins and just read a paragraph or 2 about fields and arrays.

Extract part of Column Awk

I am trying to count the number of occurrences per second in a log file for a term searched. I've been using AWK and have the issue of the time stamp being locate in a column with additional information. Is it possible to get the number of occurrences per second by only looking for the time pattern 00:00:00 - 24:00:00?
Data example:
[01/May/2018:23:59:59.532
[01/May/2018:23:59:59.848
[01/May/2018:23:59:59.851
[01/May/2018:23:59:59.911
[01/May/2018:23:59:59.923
[01/May/2018:23:59:59.986
[01/May/2018:23:59:59.988
[01/May/2018:23:59:59.756
[01/May/2018:23:59:59.786
[01/May/2018:23:59:59.883
So far I can extract the data easily enough using:
awk '/00:00:00/,/24:00:00/{if(/search_term/) a[$4]++} END{for(k in a) print k " - " a[k]}' file.log |sort
This will return:
[02/May/2018:10:40:05.903 - 1
[02/May/2018:10:40:05.949 - 1
[02/May/2018:10:40:05.975 - 1
[02/May/2018:10:40:05.982 - 2
[02/May/2018:10:40:06.022 - 1
[02/May/2018:10:40:06.051 - 1
[02/May/2018:10:40:06.054 - 1
[02/May/2018:10:40:06.086 - 1
[02/May/2018:10:40:06.094 - 1
[02/May/2018:10:40:06.126 - 1
What I'm aiming for is more:
10:40:05 - 5
10:40:06 - 6
No idea if I'm even thinking about this correctly. New to AWK in general.
Use colon and dot as the field separators, and we have hours in col2, minutes in col3 and seconds in col4
awk -F'[:.]' '
{count[$2 ":" $3 ":" $4]++}
END {for (time in count) print time " - " count[time]}
' file
10:40:05 - 4
10:40:06 - 6
Output will not necessarily be sorted. If you're using GNU awk, use
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (time in count)
print time " - " count[time]
}
(reference),
or simply pipe the output to | sort
One thing you can do is this:
awk 'BEGIN{FIELDWIDTHS = "1 11 1 12"} {print $4}' datetimes
Specify the field widths and then this will give you your time, for example. If you don't care about milliseconds, then "1 11 1 8 4"
You can use substr for the line as index of an array. for example, you have this file
cat 1.txt
[01/May/2018:23:59:59.532
[01/May/2018:01:59:59.848
[01/May/2018:02:59:59.851
[01/May/2018:02:59:59.911
[01/May/2018:02:59:59.923
[01/May/2018:02:00:59.986
you can use an awk command like this
cat 1.txt | awk '{a[substr($0,index($0,":")+1,8)]++} END{for(i in a) print i" - "a[i]}'
where substr($0,index($0,":")+1,8) cuts 8 chars from the occurrence of the first ":", use this as index of the array

Extract value based on column header from Comma separated file using bash

I want to extract 1st value from a csv for a specific column name using bash. For example, i want to extract first value of column "bb". Columns can be in any order
aa,bb,cc
1,2,3
4,5,6
The output should be 2.
Awk solution:
awk -F',' 'NR == 1{ for(i=1; i<=NF; i++) if ($i == "bb") { pos = i; break } }
NR == 2{ print $pos; exit }' file.csv
The output:
2
Use this using csvkit :
csvcut -c 2 file.csv | awk 'NR==2'
Output :
2

AWK write to new column base on if else of other column

I have a CSV file with columns A,B,C,D. Column D contains values on a scale of 0 to 1. I want to use AWK to write to a new column E base in values in column D.
For example:
if value in column D <0.7, value in column E = 0.
if value in column D>=0.7, value in column E =1.
I am able to print the output of column E but not sure how to write it to a new column. Its possible to write the output of my code to a new file then paste it back to the old file but i was wondering if there was a more efficient way. Here is my code:
awk -F"," 'NR>1 {if ($3>=0.7) $4= "1"; else if ($3<0.7) $4= "0"; print $4;}' test_file.csv
below awk command should give you intended output
cat yourfile.csv|awk -F "," '{if($4>=0.7)print $0",1";else if($4<0.7)print $0",0"}' > test_file.csv
You can use:
awk -F, 'NR>1 {$0 = $0 FS (($4 >= 0.7) ? 1 : 0)} 1' test_file.csv

how to conditionally replace values in columns with value of specific column in the same line by Unix and awk commands

I want to conditionally replace values in columns with value of specific column in the same line in one file, by Unix and awk commands.
For example, I have myfile.txt (3 lines, 5 columns, tab-delimited):
1 A . C .
2 C T . T
3 T C C .
There are "." in columns 3 to 5. I want to replace those "." in columns 3 - 5 with the value in column 2 on the same line.
Could you please show me any directions on that?
This seems to do what you're asking for:
% awk 'BEGIN {
IFS = OFS = "\t"
}
{
for (column = 3; column <= NF; ++column) {
if ($column == ".") {
$column = $2
}
}
print
}
' test.tsv
1 A A C A
2 C T C T
3 T C C T
You've asked a few questions (and accepted no answers!) on awk now. May
I humbly suggest a tutorial?
awk '{FS="\t"; for(i=3;i<=5;i++) if($i==".") $i=$2; print}' myfile.txt

Resources