Print first and every nth column using awk [duplicate] - linux

This question already has answers here:
Print the 1st and every nth column of a text file using awk
(3 answers)
Closed 16 days ago.
I want to print the 1st column (gene) and all the raw_counts columns in a tab-seperated file.
I've tried:
BEGIN {FS = "\t"}
{for (i = 3; i <= NF; i += 1) printf ("%s%c", $i, i + 1 <= NF ? "\t" : "\n");}
but the output is the same as the input.
awk -f prog.awk < input.csv > output.csv
Input Data:
head -3 input.txt
Hybridization REF TCGA-A3-3306-01A-01R-0864-07 TCGA-A3-3306-01A-01R-0864-07 TCGA-A3-3306-01A-01R-0864-07 TCGA-A3-3307-01A-01R-0864-07 TCGA-A3-3307-01A-01R-0864-07 TCGA-A3-3307-01A-01R-0864-07
gene raw_counts median_length_normalized RPKM raw_counts median_length_normalized RPKM
?|100130426 1 0.122549019607843 0.0330807728010661 0 0 0
Desired output:
Hybridization REF TCGA-A3-3306-01A-01R-0864-07 TCGA-A3-3307-01A-01R-0864-07
gene raw_counts raw_counts RPKM
?|100130426 1 0

A couple tweaks:
start the loop counter at 2
increment the loop counter by +3 on each pass
Modifying OP's code:
$ awk 'BEGIN {FS=OFS="\t"} {printf "%s",$1; for (i=2;i<=NF;i+=3) printf "%s%s",OFS,$i; print ""}' input.csv
gene raw_counts raw_counts raw_counts raw_counts raw_counts
After multiple changes to the sample input and expected output, the latest:
$ cat input.csv
Hybridization REF TCGA-A3-3306-01A-01R-0864-07 TCGA-A3-3306-01A-01R-0864-07 TCGA-A3-3306-01A-01R-0864-07 TCGA-A3-3307-01A-01R-0864-07 TCGA-A3-3307-01A-01R-0864-07 TCGA-A3-3307-01A-01R-0864-07
gene raw_counts median_length_normalized RPKM raw_counts median_length_normalized RPKM
?|100130426 1 0.122549019607843 0.0330807728010661 0 0 0
The above awk generates:
Hybridization REF TCGA-A3-3306-01A-01R-0864-07 TCGA-A3-3307-01A-01R-0864-07
gene raw_counts raw_counts
?|100130426 1 0

You can do this:
awk 'BEGIN{FS=OFS="\t"}
FNR==1{
header[1]
for(i=2;i<=NF;i++) if($i=="raw_counts") header[i]
}
{
for (i=1;i<=NF;i++)
if(i in header) {printf("%s%s", sep, $i); sep=OFS}
print ""
}' file
First time though, it prints your headers and from then on only the values associated with those headers.

UPDATE 1: got the whole thing to work end to end ( the https is essential) :
—- I used bsd-tar instead of gnu-tar
curl -s -L -f -g '
https://gdac.broadinstitute.org/runs/stddata__2016_01_28/
data/KIPAN/20160128/
gdac.broadinstitute.org_KIPAN.
Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3
__gene_expression__data.Level_3.2016012800.0.0.tar.gz' |
tar -xvO -f- |
mawk '{ print $1 '"$( jot -s '' -w ',$%d' - 2 14 3 )"' }' OFS='\t' |
gcat -n
1 8c6f0954749188f5266253fd418d6d9f KIPAN.rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__gene_expression__data.data.txt
2 Hybridization REF TCGA-A3-3306-01A-01R-0864-07 TCGA-A3-3307-01A-01R-0864-07 TCGA-A3-3308-01A-02R-1325-07 TCGA-A3-3311-01A-02R-1325-07
3 gene raw_counts raw_counts raw_counts raw_counts raw_counts
4 ?|100130426 1 0 0 3 0
5 ?|100133144 70 47 159 168 236
6 ?|100134869 46 19 50 71 138
7 ?|10357 135 138 245 325 266
8 ?|10431 1715 1625 2743 3801 5171
9 ?|136542 0 0 0 0 0
10 ?|155060 464 476 1874 771 740
.
.
20531 ZYX|7791 4944 8779 19825 19639 12883
20532 ZZEF1|23140 3233 3306 8731 9156 17034
20533 ZZZ3|26009 3061 2405 4269 5967 6594
20534 psiTPTE22|387590 379 377 675 669 1385
20535 tAKR|389932 5 5 31 27 36
Don't waste time looping the columns — use either seq or jot to dynamically generate static awk code on the fly :
gawk -be '{ print $1 '"$( jot -s '' -w ',$%d' - 2 14 3 )"' }'
# gawk profile, created Fri Feb 3 13:08:10 2023
# Rule(s)
1 {
1 print $1, $2, $5, $8, $11, $14
}
gene raw_counts raw_counts raw_counts raw_counts raw_counts

Related

Executing Concatenation for all rows

I'm working with GWAS data.
Using p-link command I was able to get SNPslist, SNPs.map, SNPs.ped.
Here are the data files and commands I have for 2 SNPs (rs6923761, rs7903146):
$ cat SNPs.map
0 rs6923761 0 0
0 rs7903146 0 0
$ cat SNPs.ped
6 6 0 0 2 2 G G C C
74 74 0 0 2 2 A G T C
421 421 0 0 2 2 A G T C
350 350 0 0 2 2 G G T T
302 302 0 0 2 2 G G C C
bash commands I used:
echo -n IID > SNPs.csv
cat SNPs.map | awk '{printf ",%s", $2}' >> SNPs.csv
echo >> SNPs.csv
cat SNPs.ped | awk '{printf "%s,%s%s,%s%s\n", $1, $7, $8, $9, $10}' >> SNPs.csv
cat SNPs.csv
Output:
IID,rs6923761,rs7903146
6,GG,CC
74,AG,TC
421,AG,TC
350,GG,TT
302,GG,CC
This is about 2 SNPs, so I can see manually their position so I added and called using the above command. But now I have 2000 SNPs IDs and their values. Need help with bash command which can parse over 2000 SNPs in the same way.
One awk idea that replaces all of the current code:
awk '
BEGIN { printf "IID" }
# process 1st file:
FNR==NR { printf ",%s", $2; next }
# process 2nd file:
FNR==1 { print "" } # terminate 1st line of output
{ printf $1 # print 1st column
for (i=7;i<=NF;i=i+2) # loop through columns 7-NF, incrementing index +2 on each pass
printf ",%s%s", $i, $(i+1) # print (i)th and (i+1)th columns
print "" # terminate line
}
' SNPs.map SNPs.ped
NOTE: remove comments to declutter code
This generates:
IID,rs6923761,rs7903146
6,GG,CC
74,AG,TC
421,AG,TC
350,GG,TT
302,GG,CC
You can use --recodeA flag in plink to have your IID as rows and SNPs as columns.

how to write awk code with specific condition

I want to create a code that operates on a certain number of a row of data, for which I just want to count negative numbers to make them positive by multiplying by the number itself negative
example
data
10
11
-12
-13
-14
expected output
10
11
144
169
196
this is what I've been try
awk 'int($0)<0 {$4 = int($0) + 360}
END {print $4}' data.txt
but I don't even get the output, anyone can help me?
awk '$0 < 0 { $0 = $0 * $0 } 1' data.txt
The first condition multiplies the value by itself when it's negative. The condition 1 is always true, so the line is printed unconditionally.
Also:
awk '{print($0<0)?$0*$0:$0}' input
$ awk '{print $0 ^ (/-/ ? 2 : 1)}' file
10
11
144
169
196
You could also match only digits that start with - and in that case multiply them by themselves
awk '{print (/^-[0-9]+$/ ? $0 * $0 : $0)}' data.txt
Output
10
11
144
169
196

Add the elements of 2 rows based on pattern condition

I want to add 2 rows based on a pattern
I have this table
1 - 513 1478 966 1
2 - 1594 2130 537 1
3 + 2171 2539 369 1
4 - 2587 3159 573 1
What Iam looking for is to add an $7 column with the first element starts with 0 and if the $2 is "-" then substract -1 from $7 else add +1 in the $7
like this:
1 - 513 1478 966 1 -1
2 - 1594 2130 537 1 -2
3 + 2171 2539 369 1 -1
4 - 2587 3159 573 1 -2 `
I wrote this
awk '$7==0,i=1;{for i in $1 do {if($2="-"){$7=$7+1}else{$7=$7-1} done print}'
The issue with my code is that if I remove the for condition turns the entire $2 in - and the entire $7 is -1
Your code is not working at all. It complaints about a couple of syntax errors. In any case, I think you are overthinking the problem. If I didn't understood you wrong, the solution is simpler:
awk 'BEGIN {v=0} {if ($2=="-") {v=v-1} else {v=v+1}; $7=v; print}'
Use a var v to keep the last value and add or substract one depending on $2 content. Once v is updated, assign it to $7 and print the entire record. In the next line you already have the last valu of the seventh column in v.
using #RavinderSingh13's trick
$ awk '{print $0 "\t" (c+=$2"1")}' file
1 - 513 1478 966 1 -1
2 - 1594 2130 537 1 -2
3 + 2171 2539 369 1 -1
4 - 2587 3159 573 1 -2
This should be as simple as the following.
awk 'BEGIN{OFS="\t\t"} {$2=$2"1";$(NF+1)=$2*$NF+prev;prev=$NF} 1' Input_file
Brief explanation:
Adding 1 to $2 2nd field value of every line.
Multiplying value of $2 with last field(which means multiplying 1 with +ve or -ve to last field) and saving its value to newly created last field.
Now adding previous line's last field value to it here as per OP's question.
Saving current last field(new created one by using $(NF+1)) to prev variable so that it could be used to add in next line's calculations.
Detailed explanation:
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this awk program from here.
OFS="\t\t" ##Setting value of 2 times TAB for each line here.
} ##Close BEGIN section of this code here.
{
$2=$2"1" ##Concatenating 1 to value of $2 here.
$(NF+1)=$2*$NF+prev ##Creating new last field whose value is $2*$NF and adding prev variable to it.
prev=$NF ##Setting current last field value to variable prev here.
}
1 ##Printing edited/non-edited lines here.
' Input_file ##mentioning Input_file name here.
Output will be as follows for provided samples.
1 -1 513 1478 966 1 -1
2 -1 1594 2130 537 1 -2
3 +1 2171 2539 369 1 -1
4 -1 2587 3159 573 1 -2

Appending the line even though there is no match with awk

I am trying to compare two files and append another column if there is certain condition satisfied.
file1.txt
1 101 111 . BCX 123
1 298 306 . CCC 234
1 299 305 . DDD 345
file2.txt
1 101 111 BCX P1#QQQ
1 299 305 DDD P2#WWW
The output should be:
1 101 111 . BCX 123;P1#QQQ
1 298 306 . CCC 234
1 299 305 . DDD 345;P2#WWW
What I can do is, to only do this for the lines having a match:
awk 'NR==FNR{ a[$1,$2,$3,$4]=$5; next }{ s=SUBSEP; k=$1 s $2 s $3 s $5 }k in a{ print $0,a[k] }' file2.txt file1.txt
1 101 111 . BCX 123 P1#QQQ
1 299 305 . DDD 345 P2#WWW
But then, I am missing the second line in file1.
How can I still keep it even though there is no match with file2 regions?
If you want to print every line, you need your print command not to be limited by your condition.
awk '
NR==FNR {
a[$1,$2,$3,$4]=$5; next
}
{
s=SUBSEP; k=$1 s $2 s $3 s $5
}
k in a {
$6=$6 ";" a[k]
}
1' file2.txt file1.txt
The 1 is shorthand that says "print every line". It's a condition (without command statements) that always evaluates "true".
The k in a condition simply replaces your existing 6th field with the concatenated one. If the condition is not met, the replacement doesn't happen, but we still print because of the 1.
Following awk may help you in same.
awk 'FNR==NR{a[$1,$2,$3,$4]=$NF;next} (($1,$2,$3,$5) in a){print $0";"a[$1,$2,$3,$5];next} 1' file2.txt file1.txt
Output will be as follows.
1 101 111 . BCX 123;P1#QQQ
1 298 306 . CCC 234
1 299 305 . DDD 345;P2#WWW
another awk
$ awk ' {t=5-(NR==FNR); k=$1 FS $2 FS $3 FS $t}
NR==FNR {a[k]=$NF; next}
k in a {$0=$0 ";" a[k]}1' file2 file1
1 101 111 . BCX 123;P1#QQQ
1 298 306 . CCC 234
1 299 305 . DDD 345;P2#WWW
last component of the key is either 4th or 5th field based on first or second file input; set it accordingly and use a single k variable in the script. Note that
t=5-(NR==FNR)
can be written as conventionally,
t=NR==FNR?4:5

"Finding and extracting matches with single hit" from blat output, Mac vs. linux syntax?

Problem: the output file "single_hits.txt" is blank:
cut -f10 genome_v_trans.pslx | sort | uniq -c | grep ' 1 ' | sed -e 's/ 1 /\\\</' -e 's/$/\\\>/' > single_hits.txt
I have downloaded the script from Linux to be used on Mac OSX 10.7.5. There are some changes that need to be made as it is not working. I have nine "contigs" of DNA data that need to be filtered to remove all but unique contigs. blat is used to compare two datasets and output a .pslx file with these contigs, which worked:
964 0 0 0 0 0 3 292 + m.1 1461 0 964 3592203 ...
501 0 0 0 0 0 3 468 - m.1 1461 960 1461 5269699 ...
1168 0 0 0 1 2 7 1232 - m.7292 1170 0 1170 5233270 ...
Then this script is supposed to remove identical contigs such as the top two (m.1)
This seems to work on the limited data you gave,
grep -v `awk '{print $10}' genome_v_trans.pslx | uniq -d` genome_v_trans.pslx
unless you want it to have <> in place of the duplicates, then you can sed substitute the duplicate entries then you can do something like:
IFS=$(echo -en "\n\b") && for a in $(awk '{print $10}' genome_v_trans.pslx | uniq -d); do sed -i "s/$a/<>/g" genome_v_trans.pslx; done && unset IFS
results in:
964 0 0 0 0 0 3 292 + <> 1461 0 964 3592203 ...
501 0 0 0 0 0 3 468 - <> 1461 960 1461 5269699 ...
1168 0 0 0 1 2 7 1232 - m.7292 1170 0 1170 5233270 ...
or if you wanted that in the singlehits file:
IFS=$(echo -en "\n\b") && for a in $(awk '{print $10}' dna.txt | uniq -d); do sed "s/$a/<>/g" dna.txt >> singlehits.txt; done && unset IFS
SINGLE_TMP=/tmp/_single_tmp_$$ && awk '{if ($10 == "<>") print}' singlehits.txt > "$SINGLE_TMP" && mv "$SINGLE_TMP" singlehits.txt && unset SINGLE_TMP
or more elegant: sed -ni '/<>/p' singlehits.txt
singlehits.txt:
964 0 0 0 0 0 3 292 + <> 1461 0 964 3592203 ...
501 0 0 0 0 0 3 468 - <> 1461 960 1461 5269699 ...

Resources