extract columns from multiple text files with bash

extract columns from multiple text files with bash - linux

I am trying to extract columns from multiple text files(3000 files). The sample of my text file is shown below.
res ABS sum
SER A 1 161.15 138.3
CYS A 2 66.65 49.6
PRO A 3 21.48 15.8
ALA A 4 77.68 72.0
ILE A 5 15.70 9.0
HIS A 6 10.88 5.9
I would like to print
1) resnames(first column) only if the sum(last column) is >25.
2) I would like to store the output in to one file
3) I would like to add a new column to the outputfile with the name of the txt file from where the data was extracted and also need to print the total number of resnames( from all text files only if sum is >25)
I would like to get the following output
SER AA.txt
CYS AA.txt
ALA AA.txt
SER BB.txt
Total numberof SER- 2
Total number of ALA- 1
Total number of CYS- 1
How can I get this output with Bash? I tried the following code
for i in files/*.txt
do
awk 'BEGIN{FS=OFS=" "}{if($5 > 25) print $1,i}'
done
Any suggestions please?

Try:
awk '{ a[$1]++ }
END { for (k in a) print "Total number of " k " - " a[k] }' FILES
(Not tested)

awk '{
if ($NF ~ /([0-9])+(\.)?([0-9])+/ && $NF > 25) {
print $1, FILENAME;
res[$1]++;
}
}
END {
for (i in res) {
print "Total number of ", i, "-", res[i];
}
}' res.txt
Here's the output I get for your example:
SER res.txt
CYS res.txt
ALA res.txt
Total number of SER - 1
Total number of CYS - 1
Total number of ALA - 1

Related

How to extract some values in a column and calculate them, then generate a new column in linux

Here is my vcf file, now I want to extract some values form the last column V350092589_L01_84 to creat new columns.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT V350092589_L01_84
chr19 11224265 . A G 23868.64 PASS AC=1;AF=0.500;AN=2;DP=3417;ExcessHet=0.0000;FS=8.538;MLEAC=1;MLEAF=0.500;MQ=41.37;MQRankSum=1.59;QD=7.57;ReadPosRankSum=9.38;SOR=0.783 GT:AD:DP:GQ:PL 0/1:2029,1125:3154:99:23876,0,49821
chr19 11227576 . C T 8055.64 PASS AC=1;AF=0.500;AN=2;DP=1025;ExcessHet=0.0000;FS=3.316;MLEAC=1;MLEAF=0.500;MQ=41.34;MQRankSum=-4.736e+00;QD=9.12;ReadPosRankSum=2.55;SOR=0.982 GT:AD:DP:GQ:PL 0/1:533,350:883:99:8063,0,15924
Specifically.
Extracting the sub-column DP=xxxx which is in the last column and generating a new column.
Extract the sub-column AD: 2092, 1125; AD: 553, 350 and calculate a ratio value, e.g. Ratio=1125/(1125+2092), Ratio=350/(350+553). Then take the Ratio value and generate a new column
Then,combine the two columns to a new csv file, like this
DP Ratio
3154 0.34
883 0.38

You can do it easily in awk, but your numbers seem to be off by 0.02 (see below for reason). For instance, you can do:
awk '
FNR==1 { # if first record
printf "DP\tRatio\n" # output heading
next # skip to next record
}
{
nf = $NF # save copy of last field
gsub (/[,:]/, " ", nf) # replace "," and ":" with " "
split (nf, arr, " ") # split into arr on space
printf "%s\t%.2f\n", arr[4], arr[3]/(arr[2]+arr[3]) # output result
}
' file
Where for example with record 2, arr[3] == 1125, and arr[2] == 2029. The result of the computation is 0.36 not 0.34? (you used 2092 in your calculation instead of the actual 2029 and 553 instead of the actual 533 -- mystery solved)
Example Use/Output
You can just paste the script into an xterm with your data file (named file or whatever you change the filename to) as:
$ awk '
> FNR==1 { # if first record
> printf "DP\tRatio\n" # output heading
> next # skip to next record
> }
> {
> nf = $NF # save copy of last field
> gsub (/[,:]/, " ", nf) # replace "," and ":" with " "
> split (nf, arr, " ") # split into arr on space
> printf "%s\t%.2f\n", arr[4], arr[3]/(arr[2]+arr[3]) # output result
> }
> ' file
DP Ratio
3154 0.36
883 0.40
So the output based on the file with the fields split as shown above is:
DP Ratio
3154 0.36
883 0.40
Creating an Awk Script
Creating a script with awk to run from the command line is a convenient way to make the awk commands reusable. For example you can create a file say, splitrec.awk containing:
#!/bin/awk
FNR == 1 { # if first record
printf "DP\tRatio\n" # output heading
next # skip to next record
}
{
nf = $NF # save copy of last field
gsub (/[,:]/, " ", nf) # replace "," and ":" with " "
split (nf, arr, " ") # split into arr on space
printf "%s\t%.2f\n", arr[4], arr[3]/(arr[2]+arr[3]) # output result
}
Then your use of the script becomes:
$ awk -f splitrec.awk file
DP Ratio
3154 0.36
883 0.40

Here's a way to extract out the fields, the rest is trivial :
{m,g} '!_<NR ? sub(OFS, _, $!(NF=NF)) : $_="DP\tRatio" ' \
OFS='\t' \
FS='^.*;DP=|[;][^ \t]+[ \t]+[^ \t]+[ \t]+[^:]+:|[0-9]+:[0-9]+,.*,.*$|[:,]'
DP Ratio
3417 2029 1125 3154
1025 533 350 883
A less intuitive way of doing it would be (adding NF check to protect blank lines)
{m,g}awk '+_~NF ? !_ : !_<NR ? sub(".",_,$!--NF)--NF : $_="DP\tRatio"'

How to run a function on more than one file in AWK , when the answer of the function needs to be saved to use it on the next file?

I wrote a function in AWk that will print the difference of two matrix
like f1 contains this matrix:
12 35 68 99
2 6
1
and f2 contains :
10 25 100
2 5 4
2
It will print in a file called tmp
2 10 -32 99
0 1 4
The code is :
function matrix_difference(file1,file2) {
printf "" > "tmp"
for (o=1;o<=NR;o++){
for (x=1;x<=NF;x++){
d=A[file1,o,x]
p=A[file2,o,x]
sum=d-p
printf sum " ">> "tmp"
}
print "" >> "tmp"
}
close("tmp")
}
I tried to write in AWK a code which gets a number of files that contains a matrix and prints : "The difference is (name of files with - between each name) is : " it will print the difference in all the files . IF there are four files . f1 f2 f3 f4 it prints f1-f2-f3-f4
I tried to write the code but It doesn't work only on two files , I tried to do a loop but it doesn't work . It only works on two files only when I write matrix_difference(ARGV[1],ARGV[2]).
#!/usr/bin/awk -f
{
for (i=1;i<=NF;i++)
A[FILENAME,FNR,i]=$i
}
{
if ( FNR ==1)
print "The matrix " FILENAME " is :"
print $0
}
END{
matrix_difference(ARGV[1],ARGV[2])
for ( m=1;m<=NR;m++){
getline x <"tmp"
print x
}
print " The matrix difference A-B-C-D is:"
}
function matrix_difference(file1,file2) {
printf "" > "tmp"
for (o=1;o<=NR;o++){
for (x=1;x<=NF;x++){
d=A[file1,o,x]
p=A[file2,o,x]
sum=d-p
printf sum " ">> "tmp"
}
print "" >> "tmp"
}
close("tmp")
}
also when I write the difference of Matrix then the files name don't know how to do that.

here is a sample code that will work with multiple files, after the first one, everything else will be subtracted. Assumes all matrices have matching dimensions.
$ awk 'NR==FNR {for(i=1;i<=NF;i++) a[NR,i]=$i; next}
{for(i=1;i<=NF;i++) a[FNR,i]-=$i}
END {for(i=1;i<=FNR;i++)
for(j=1;j<=NF;j++)
printf "%d%s",a[i,j],(j==NF?ORS:OFS)}' file1 file2
For the input files
==> file1 <==
1 2
3 4
==> file2 <==
1 0
0 1
script returns
0 2
3 3
try with more files
$ awk '...' file1 file2 file2
-1 2
3 2
you can convert to a function, not sure it helps!?
$ awk 'function sum(sign) {for(i=1;i<=NF;i++) a[FNR,i]+=sign*$i}
NR==FNR {sum(+1); next}
{sum(-1)}
END {for(i=1;i<=FNR;i++)
for(j=1;j<=NF;j++)
printf "%d%s",a[i,j],(j==NF?ORS:OFS)}'

Substitute values based on condition and compare values from multiple files with AWK

I've spent the day trying to figure this out but didn't succeed. I have two files like this:
File1:
chr id pos
14 ABC-00 123
13 AFC-00 345
5 AFG-99 988
File2:
index id chr
1 ABC-00 14
2 AFC-00 11
3 AFG-99 7
I wanna check if the value for chr in File 1 != from chr in File 2 for the same ID, if this is true I want to print some columns from both files to have an output like this one below.
Expected output file:
ID OLD_chr(File1) NEW_chr(File2)
AFC-00 13 11
AFG-99 5 7
.....
Total number of position changes: 2
I got one caveat though. In File 1 I have to substitute some values in the $1 column before comparing the files. Like this:
30 and 32 >> X
31 >> Y
33 >> MT
Because in File 2 that's how those values are coded. And then compare the two files. How in the hell can I achieve this?
I've tried to recode File 1:
awk '{
if($1=30 || $1=32) gsub(/30|32/,"X",$1);
if($1=31) gsub(/31/,"Y",$1);
if($1=33) gsub(/33/,"MT",$1);
print $0
}' File 1 > File 1 Recoded
And I was trying to match the columns and print the output with:
awk 'NR==FNR{a[$1]=$1;next} (a[$1] !=$3){print $2, a[$1], $3 }' File 1 File 2 > output file

$ cat tst.awk
BEGIN {
map[30] = map[32] = "X"
map[31] = "Y"
map[33] = "MT"
print "ID", "Old_chr("ARGV[1]")", "NEW_chr("ARGV[2]")"
}
NR==FNR {
a[$2] = ($1 in map ? map[$1] : $1)
next
}
a[$2] != $3 {
print $2, a[$2], $3
cnt++
}
END {
print "Total number of position changes: " cnt+0
}
.
$ awk -f tst.awk file1 file2
ID Old_chr(file1) NEW_chr(file2)
AFC-00 13 11
AFG-99 5 7
Total number of position changes: 2

Like this:
awk '
BEGIN{ # executed at the BEGINning
print "ID OLD_chr("ARGV[1]") NEW_chr("ARGV[2]")"
}
FNR==NR{ # this code block for File1
if ($1 == 30 || $1 == 32) $1 = "X"
if ($1 == 31) $1 = "Y"
if ($1 == 33) $1 = "MT"
a[$2]=$1
next
}
{ # this for File2
if (a[$2] != $3) {
print $2, a[$2], $3
count++
}
}
END{ # executed at the END
print "Total number of position changes: " count+0
}
' File1 File2
ID OLD_chr(File1) NEW_chr(File2)
AFC-00 13 11
AFG-99 5 7
Total number of position changes: 2

NTILE a column in csv - Linux

I have a csv file that reads like this:
a,b,c,2
d,e,f,3
g,h,i,3
j,k,l,4
m,n,o,5
p,q,r,6
s,t,u,7
v,w,x,8
y,z,zz,9
I want to assign quintiles to this data (like we do it in sql), using preferably bash command in linux. The quintiles, if assigned as a new column, will make the final output look like:
a,b,c,2, 1
d,e,f,3, 1
g,h,i,3, 2
j,k,l,4, 2
m,n,o,5, 3
p,q,r,6, 3
s,t,u,7, 4
v,w,x,8, 4
y,z,z,9, 5
The only thing i am able to achieve is to add a new incremental column to the csv file:
`awk '{$3=","a[$3]++}1' f1.csv > f2.csv`
But not sure how do the quintiles. Please help. thanks.

awk '{a[NR]=$0}
END{
for(i=1;i<=NR;i++) {
p=100/NR*i
q=1
if(p>20){q=2}
if(p>40){q=3}
if(p>60){q=4}
if(p>80){q=5}
print a[i] ", " q
}
}' file
Output:
a,b,c,2, 1
d,e,f,3, 2
g,h,i,3, 2
j,k,l,4, 3
m,n,o,5, 3
p,q,r,6, 4
s,t,u,7, 4
v,w,x,8, 5
y,z,zz,9, 5

Short wc + awk approach:
awk -v n=$(cat file | wc -l) \
'BEGIN{ OFS=","; n=sprintf("%.f\n", n*0.2); c=1 }
{ $(NF+1)=" "c }!(NR % n){ ++c }1' file
n=$(cat file | wc -l) - get the total number of lines of the input file file
n*0.2 - 1/5th (20 percent) of the range
$(NF+1)=" "c - set next last field with current rank value c
The output:
a,b,c,2, 1
d,e,f,3, 1
g,h,i,3, 2
j,k,l,4, 2
m,n,o,5, 3
p,q,r,6, 3
s,t,u,7, 4
v,w,x,8, 4
y,z,zz,9, 5

In bash how can I split a column in several column of fixed dimension

how can I split a single column in several column of fixed dimension, for example I have a column like this:
1
2
3
4
5
6
7
8
and for size p. ex 4, I want to obtain
1 5
2 6
3 7
4 8
or for size p. ex 2, I want to obtain
1 3 5 7
2 4 6 8

Using awk:
awk '
BEGIN {
# Numbers of rows to print
n=4;
}
{
# Add to array with key = 0, 1, 2, 3, 0, 1, 2, ..
l[(NR-1)%n] = l[(NR-1)%n] " " $0
};
END {
# print the array
for (i = 0; i < length(l); i++) {
print l[i];
}
}
' file

OK, this is a bit long winded and not infallible but the following should work:
td=$( mktemp -d ); split -l <rows> <file> ${td}/x ; paste $( ls -1t ${td}/x* ) ; rm -rf ${td}; unset td
Where <cols> is the number of rows you want and file is your input file.
Explanation:
td=$( mktemp -d )
Creates a temporary directory so that we can put temporary files into it. Store this in td - it's possible that your shell has a td variable already but if you sub-shell for this your scope should be OK.
split -l <rows> <file> f ${td}/x
Split the original file into many smaller file, each <rows> long. These will be put into your temp directory and all files will be prefixed with x
paste $( ls -1t ${td}/x* )
Write these files out so that the lines in consecutive columns
rm -rf ${td}
Remove the files and directory.
unset td
Clean the environment.

Assuming you know the number of rows in your column (here, 8):
n=8
# to get output with 4 rows:
seq $n | pr -ts" " -$((n/4))
1 5
2 6
3 7
4 8
# to get output with 2 rows:
seq $n | pr -ts" " -$((n/2))
1 3 5 7
2 4 6 8

If you know the desired output width you can use column.
# Display in columns for an 80 column display
cat file | column -c 80

$ cat tst.awk
{ a[NR] = $0 }
END {
OFS=","
numRows = (numRows ? numRows : 1)
numCols = ceil(NR / numRows)
for ( rowNr=1; rowNr<=numRows; rowNr++ ) {
for ( colNr=1; colNr<=numCols; colNr++ ) {
idx = rowNr + ( (colNr - 1) * numRows )
printf "%s%s", a[idx], (colNr<numCols ? OFS : ORS)
}
}
}
function ceil(x, y){y=int(x); return(x>y?y+1:y)}
$ awk -v numRows=2 -f tst.awk file
1,3,5,7
2,4,6,8
$ awk -v numRows=4 -f tst.awk file
1,5
2,6
3,7
4,8
Note that above produces a CSV with the same number of fields in every row even when the number of input rows isn't an exact multiple of the desired number of output rows:
$ seq 10 | awk -v numRows=4 -f tst.awk
1,5,9
2,6,10
3,7,
4,8,
See https://stackoverflow.com/a/56725452/1745001 for how to do the opposite, i.e. generate a number of rows given a specified number of columns.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

extract columns from multiple text files with bash - linux

Try: awk '{ a[$1]++ } END { for (k in a) print "Total number of " k " - " a[k] }' FILES (Not tested)

Related

How to extract some values in a column and calculate them, then generate a new column in linux

How to run a function on more than one file in AWK , when the answer of the function needs to be saved to use it on the next file?

Substitute values based on condition and compare values from multiple files with AWK

NTILE a column in csv - Linux

In bash how can I split a column in several column of fixed dimension

Categories

Resources