Counting number of rows depending on more than 1 column condition - linux

I have a data file like this
H1 H2 H3 E1 E2 E3 C1 C2 C3
0 0 0 0 0 0 0 0 1
1 0 0 0 1 0 0 0 1
0 1 0 0 1 0 1 0 1
now i want to count the rows where H1,H2,H3 has the same pattern as E1,E2 and E3. for example, i want to count the number of time H1,H2,H3 and E1,E2,E3 both are 010 or 000.
I tried to use this code but it doesnt really work
awk -F "" '!($1==0 && $2==1 && $3==0 && $4==0 && $5==1 && $6==0)' file | wc -l

Something like
>>> awk '$1$2$3 == $4$5$6' input | wc -l
2
What it does?
$1$2$3 == $4$5$6 Checks if the string formed by columns 1 2 and 3 is equal to the columns formed by 4 5 and 6. When it is true, awk takes the default action of printing the entire line and the wc takes care of counting those lines.
Or, if you want complete awk solution, you can write
>>> awk '$1$2$3 == $4$5$6{count++} END{print count}' input
2

Related

How to reorder columns of hunderds of tab deliminated file in linux?

I have large scale tab-delimited files (a couple of hundreds), and but the order of the columns is different across the different files( the same columns, but different locations). Hence, I need to reorder all the columns in all the files and write it back in tab-deliminated format.
I would like to write a shell script that takes a specified order of columns and reorder all the columns in all the files and write it back. Can someone help me with it?
Here is how the header of my files looks like:
file1)
sLS72 chrX
A B E C F H
2 1 4 5 7 8
0 0 0 0 0 0
and the header of my second file:
S721 chrX
A E B F H C
12 11 2 3 4 1
0 0 0 0 0 0
here is the order of the columns that I want to achieve:
Order=[A ,B ,C ,E,F,H]
and here is the expected outputs for each file based on this ordering:
sLS72 chrX
A B C E F H
2 1 5 4 7 8
0 0 0 0 0 0
file 2:
S721 chrX
A B C E F H
12 2 1 11 3 4
0 0 0 0 0 0
I was trying to use awk:
awk -F'\t' '{s2=$A; $3=$B; $4=$C; $5=$E; $1=s}1' OFS='\t' in file
but the point is the, first, the order of columns are different in different files, and second, the names of the columns start from the second line of the file. In order words, first line is the header, I don't want to change it, but the second line is the colnames of the columns, so I want to order all files based on that. it's kind of tricky
$ awk -v order="A B C E F H" '
BEGIN {n=split(order,ho)}
FNR==1 {print; next}
FNR==2 {for(i=1;i<=NF;i++) hn[$i]=i}
{for(i=1;i<=n;i++) printf "%s",$hn[ho[i]] (i==n?ORS:OFS)}' file1 > tmp && mv tmp file1
sLS72 chrX
A B C E F H
0 0 0 0 0 0
0 0 0 0 0 0
if working on multiple files at the same time, change it to
$ awk -v ...
{... printf "%s",$hn[ho[i]] (i==n?ORS:OFS) > (FILENAME"_reordered") }' dir/files*
and do a mass rename afterwards. Alternative is run the original script if a loop for each file.

Linux filter text rows by sum specific colums

From raw sequencing data I created a count file (.txt) with the counts of unique sequences per sample.
The data looks like this:
sequence seqLength S1 S2 S3 S4 S5 S6 S7 S8
AAAAA... 46 0 1 1 8 1 0 1 5
AAAAA... 46 50 1 5 0 2 0 4 0
...
TTTTT... 71 0 0 5 7 5 47 2 2
TTTTT... 81 5 4 1 0 7 0 1 1
I would like to filter the sequences per row sum, so only rows with a total sum of all samples (sum of S1 to S8) lower than for example 100 are removed.
This can probably be done with awk, but I have no experience with this text-processing utility.
Can anyone help?
Give a try to this:
awk 'NR>1 {sum=0; for (i=3; i<=NF; i++) { sum+= $i } if (sum > 100) print}' file.txt
It will skip line 1 NR>1
Then will sum items per row starting from item 3 (S1 to S8) in your example:
{sum=0; for (i=3; i<=NF; i++) { sum+= $i }
Then will only print rows with sum is > than 100: if (sum > 100) print}'
You could modify/test with the condition based on the sum, but hope this can give you an idea about how to do it with awk
Following awk may help you on same.
awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"};sum=""}' Input_file
In case you need different different out files then following may help.
awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"++i};sum=""}' Input_file

How to get the rows corespondent to a specific column values using linux command.

I have an o/p like below.I want the values of first column correspondent to a input value for second column.
Ex: in column 1, 0 and 1 belongs to 0 value of column 2.
So I need a command in which if I pass 0(second column values) I must get 0,1
dmpgdo dbsconfig 0 | grep AMP | grep Online | awk -F' ' '{print $1,$4}'
0 0
1 0
2 1
3 1
4 2
5 2
6 3
7 3
Will this do?
printf "0 0\n1 0\n2 1\n3 1\n4 2\n5 2\n6 3\n7 3" | awk '{if ($2 == 0) print $1}'
0
1

add header to columns from list text file awk

I have a very large text file with hundreds of columns. I want to add a header to every column from an independent text file containing a list.
My large file looks like this:
largefile.txt
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
my list of headers:
headers.txt
h1
h2
h3
wanted output:
output.txt
h1 h2 h3 h4 h5 h6 h7 etc..
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
$ awk 'NR==FNR{h=h OFS $0; next} FNR==1{print OFS OFS h} 1' head large | column -s ' ' -t
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
or if you prefer:
$ awk -v OFS='\t' 'NR==FNR{h=h OFS $0; next} FNR==1{print OFS OFS h} {$1=$1}1' head large
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
Well, here's one. OFS is tab for eye candy. From the OP I concluded that the headers should start from the fourth field, hence +3s in the code.
$ awk -v OFS="\t" ' # tab OFS
NR==FNR { a[NR]=$1; n=NR; next } # has headers
FNR==1 { # print headers in the beginning of 2nd file
$1=$1 # rebuild record for tabs
b=$0 # buffer record
$0="" # clear record
for(i=1;i<=n;i++) # spread head to fields
$(i+3)=a[i]
print $0 ORS b # output head and buffered first record
}
{ $1=$1 }1' head data # implicit print with record rebuild
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
Then again, this would also do the trick:
$ awk 'NR==FNR{h=h (NR==1?"":OFS) $0;next}FNR==1{print OFS OFS OFS h}1' head date
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
Use paste to pivot the headers into a single line and then cat them together with the main file (- instead of a file name means stdin to cat):
$ paste -s -d' ' headers.txt | cat - largefile.txt
If you really need the headers to line up as in your example output you can preprocess (either manually or with a command) the headers file, or you can finish with sed (for just one option) as below:
$ paste -s -d' ' headers.txt | cat - largefile.txt | sed '1 s/^/ /'
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc

How to select rows in which column two and three are not equal to each other and to 0 or 1?(with awk)

I have a file like this:
AX-75448119 0 1
AX-75448118 0.45 0.487179
AX-75474642 0 0
AX-75474643 0.25 0.820513
AX-75448113 1 0
AX-75474641 1 1
and I want to select the rows that column 2 and 3 are not equal each other and 0 or 1 (both of them)! (i.e if column 2 and 3 are similar but equal to 0.5 (or any other number except 0 and 1) I would like to have that row)
so the output would be:
AX-75448119 0 1
AX-75448118 0.45 0.487179
AX-75474643 0.25 0.820513
AX-75448113 1 0
I know how to write the command to select the rows that column 2 and 3 are equal to each other and are equal to 0 or 1 which is this:
awk '$2=$3==1 || $2=$3==0' test.txt | wc -l
but I want exactly the opposite, to select every rows that are not the output of the above command!
Thanks, I hope I was able to explain what I want
It might work for you if I get your requirements correctly.
awk ' $2 != $3 { print; next } $2 == $3 && $2 != 0 && $2 != 1 { print }' INPUTFILE
See it in action at Ideone.com
This might work for you:(?)
awk '($2==0 || $2==1) && ($3==0 || $3==1) && $2==$3{next}1' file

Resources