How to calculate the Sum of numbers in rows in a space delimited file? - linux

I have space delimited file which is like this:
probeset_id submitted_id chr snp_pos alleleA alleleB 562_201 562_202 562_203 562_204 562_205 562_206 562_207 562_208 562_209 562_210 562_211 562_212 562_213 562_214 562_215 562_216 562_217 562_218 562_219 562_220 562_221 562_222 562_223 562_224 562_225 562_226 562_227 562_228 562_229 562_230 562_231 562_232 562_233 562_234 562_235 562_236 562_237 562_238 562_239 562_240 562_241 562_242 562_243 562_244 562_245 562_246 562_247 562_248 562_249 562_250 562_251 562_252 562_253 562_254 562_255 562_256 562_257 562_258 562_259 562_260 562_261 562_262 562_263 562_264 562_265 562_266 562_267 562_268 562_269 562_270 562_271 562_272 562_273 562_274 562_275 562_276 562_277 562_278 562_279 562_280 562_281 562_283 562_284 562_285 562_289 562_291 562_292 562_294 562_295 562_296 562_400 562_401 562_402 562_403 562_404 562_405
AX-75448119 Chr1_41908741 1 41908741 T C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 2 2 0 0 0 0 0 1 0 0 0 0 0
AX-75448118 Chr1_41908545 1 41908545 T C 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 2 2 2 2 2 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 2 2 2 0 1 1 1 2 -1 1 2 0 0 2 1 1 0 1 0 1 2 1 0 0 1 2 2 1 2 2 0 1 2 2 2 2 2 2 0 1 0 0 0 1 2 2 2 2 0
what I would like to do is to have the sum of all numbers in each row and if there is a negative number (only -1 exist) just ignore it so I would like to have this as result:
AX-75448119 Chr1_41908741 1 41908741 T C 13
(which is 1+1+1+1+1+1+1+1+2+2+1)
and
AX-75448118 Chr1_41908545 1 41908545 T C 98
which in this case the -1 is ignored!
I was thinking of using awk in Linux which I usually use for space delimited file but I only know how to use it for columns not for rows

This maybe what you're looking for using pure awk.
awk 'NR >=2 {for (i=7;i<=NF;i++) if ($i !~ /^-/) sum += $i; print $1,$2,$3,$4,$5,$6,sum; sum = 0}' data.txt
Output:
AX-75448119 Chr1_41908741 1 41908741 T C 13
AX-75448118 Chr1_41908545 1 41908545 T C 98

I would like to suggest a Perl script:
#!/usr/bin/env perl
while(<>) {
my ($line,$sum,$next);
# repeat while there are two (or more) integers after the "... T C" prefix:
while (/^(AX-\d+\s+\S+\s+\d+\s+\d+\s+\w+\s+\w+\s+)(\d+)\s+(-?\d+)/) {
$line = $1;
$sum = $2;
$next = $3;
$sum += $next if ($next > 0); # do not add negative numbers.
# replace the two integers by their sum.
s/$line\d+\s+$next/$line$sum/;
}
print;
}
which you can run like: cat data | ./script.pl
I get:
probeset_id submitted_id chr snp_pos alleleA alleleB 562_201 562_202 562_203 562_204 562_205 562_206 562_207 562_208 562_209 562_210 562_211 562_212 562_213 562_214 562_215 562_216 562_217 562_218 562_219 562_220 562_221 562_222 562_223 562_224 562_225 562_226 562_227 562_228 562_229 562_230 562_231 562_232 562_233 562_234 562_235 562_236 562_237 562_238 562_239 562_240 562_241 562_242 562_243 562_244 562_245 562_246 562_247 562_248 562_249 562_250 562_251 562_252 562_253 562_254 562_255 562_256 562_257 562_258 562_259 562_260 562_261 562_262 562_263 562_264 562_265 562_266 562_267 562_268 562_269 562_270 562_271 562_272 562_273 562_274 562_275 562_276 562_277 562_278 562_279 562_280 562_281 562_283 562_284 562_285 562_289 562_291 562_292 562_294 562_295 562_296 562_400 562_401 562_402 562_403 562_404 562_405
AX-75448119 Chr1_41908741 1 41908741 T C 13
AX-75448118 Chr1_41908545 1 41908545 T C 98

In case you really wanted to avoid perl (why?) you could do this hacky thing, which, obviously, doesn't perform too well:
while read f1 f2 f3 f4 f5 f6 line
do
echo "$f1 $f2 $f3 $f4 $f5 $f6 $(echo "$line" |
xargs -n1 | grep -v '^-' | paste -sd+ | bc)"
done < input
I get:
AX-75448119 Chr1_41908741 1 41908741 T C 13
AX-75448118 Chr1_41908545 1 41908545 T C 98

Slightly changed version of #steve's awk solution
awk '
NR>1{
s = 0;
for (i = 7 ; i <= NF ; i++)
{
if ($i != -1)
{
s+=$i;
}
}
for (j = 1 ; j < 7 ; j++)
{
printf("%s ", $j);
}
print s;
}' file
Test:
[jaypal:~/Temp] awk '
NR>1{
s = 0;
for (i = 7 ; i <= NF ; i++)
{
if ($i != -1)
{
s+=$i;
}
}
for (j = 1 ; j < 7 ; j++)
{
printf("%s ", $j);
}
print s;
}' file
AX-75448119 Chr1_41908741 1 41908741 T C 13
AX-75448118 Chr1_41908545 1 41908545 T C 98

Related

How to filter a matrix based on another column

I want to filter a matrix file using a column from another file.
I have 2 tab-separated files. One includes a matrix. I want to filter my matrix file based on the first column of FileB. If the headers(column names) of this matrix file (FileA) are present in the first column of File B, I want to filter them to use in a new file. All solutions I could try were based on filtering rows, not fields. Any help is appreciated. Thanks!
FileA
A B C D E F G H I J K L M N
R1 0 0 0 0 0 0 0 0 0 1 0 0 1 1
R2 1 1 0 1 0 0 0 0 1 0 1 0 0 0
R3 0 0 0 0 0 0 0 0 0 0 0 0 0 1
R4 1 1 0 1 0 0 0 1 0 1 0 1 0 0
R5 0 0 0 0 1 0 1 0 1 0 1 0 1 0
FileB
A Green
B Purple
K Blue
L Blue
Z Green
M Purple
N Red
O Red
U Red
My expected output is:
ExpectedOutput
A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0
Oh, what the heck, I'm not sure having you post an R script is really going to make any difference other than satisfying my need to be pedantic so here y'go:
$ cat tst.awk
NR == FNR {
outFldNames2Nrs[$1] = ++numOutFlds
next
}
FNR == 1 {
$0 = "__" FS $0
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
outFldNr = outFldNames2Nrs[$inFldNr]
out2inFldNrs[outFldNr] = inFldNr
}
}
{
printf "%s", $1
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2inFldNrs[outFldNr]
if (inFldNr) {
printf "%s%s", OFS, $inFldNr
}
}
print ""
}
$ awk -f tst.awk fileB fileA
__ A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0
I'm using the term "field name" to apply to the letter at the top of each column ("field" in awk). Try to figure the rest out for yourself from looking at the man pages and adding "prints" if/when useful and then feel free to ask questions if you have any.
I added __ at the front of your header line so you'd have the same number of columns in every line of output - that makes it easier to pass along to other tools to manipulate further but it's easy to tweak the code to not do that if you don't like it.
As #EdMorton mentions, bash may not be a suitable tool to manipulate
complex data structure as a table from maintainability and robustness
point of view.
Here is a bash script example just for information:
#!/bin/bash
declare -A seen
declare -a ary include
while read -r alpha color; do
seen["$alpha"]=1
done < FileB
while read -r -a ary; do
if (( $((nr++)) == 0 )); then # handle header line
echo -n " "
for (( i=0; i<${#ary[#]}; i++ )); do
alpha="${ary[$i]}"
if [[ ${seen["$alpha"]} = 1 ]]; then
echo -n " $alpha"
include[$((i+1))]=1
fi
done
else
echo -n "${ary[0]}"
for (( i=1; i<${#ary[#]}; i++ )); do
if [[ ${include[$i]} = 1 ]]; then
echo -n " ${ary[$i]}"
fi
done
fi
echo
done < FileA
If python is your option, you can say instead something like:
import pandas as pd
dfb = pd.read_csv("./FileB", sep="\s+", header=None)
vb = [x[0] for x in dfb.values.tolist()]
dfa = pd.read_csv("./FileA", sep="\s+")
va = dfa.columns.tolist()
print(dfa[sorted(set(va) & set(vb))])
Output:
A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0

Searching file for a string in the first field depending on the input from another file and piping the result to new file

I have an input file like below
Model related text
Model specifications
*ELEMENT_SHELL
$# eid pid n1 n2 n3 n4 n5 n6 n7 n8
76737 1 79322 79323 79324 79511 0 0 0 0
76738 1 79510 79203 79204 79512 0 0 0 0
76739 1 79511 79324 79325 79513 0 0 0 0
76740 1 79512 79204 79205 79514 0 0 0 0
76741 1 79514 79205 79206 79515 0 0 0 0
76742 1 79515 79206 79207 79516 0 0 0 0
76743 1 79516 79207 79208 79517 0 0 0 0
76744 1 79517 79208 79209 79518 0 0 0 0
76745 1 79518 79209 79210 79519 0 0 0 0
76746 1 79519 79210 79211 79520 0 0 0 0
In another file File 2 I have only numbers like
76737
76738
76739
76740
76741
I have to compare these each numbers from File2.txt with the numbers in the first line of the File1.txt and if it matches, the complete line from File1.txt would be output to model.txt
The output would be
Model related text
Model specifications
*ELEMENT_SHELL
$# eid pid n1 n2 n3 n4 n5 n6 n7 n8
76737 1 79322 79323 79324 79511 0 0 0 0
76738 1 79510 79203 79204 79512 0 0 0 0
76739 1 79511 79324 79325 79513 0 0 0 0
76740 1 79512 79204 79205 79514 0 0 0 0
76741 1 79514 79205 79206 79515 0 0 0 0
can anybody suggest me with AWK, SED etc?
This can be very easily done using awk
awk 'FNR==NR{ value[$1]; next} $1 in value || FNR < 5'
Test
$ awk 'FNR==NR{ value[$1]; next} $1 in value || FNR < 5' file2 file1
Model related text
Model specifications
*ELEMENT_SHELL
$# eid pid n1 n2 n3 n4 n5 n6 n7 n8
76737 1 79322 79323 79324 79511 0 0 0 0
76738 1 79510 79203 79204 79512 0 0 0 0
76739 1 79511 79324 79325 79513 0 0 0 0
76740 1 79512 79204 79205 79514 0 0 0 0
76741 1 79514 79205 79206 79515 0 0 0 0
If you are not interested in the leading headers in the output, the script can be further simplified as
awk 'FNR==NR{ value[$1]; next} $1 in value' file2 file1
76737 1 79322 79323 79324 79511 0 0 0 0
76738 1 79510 79203 79204 79512 0 0 0 0
76739 1 79511 79324 79325 79513 0 0 0 0
76740 1 79512 79204 79205 79514 0 0 0 0
76741 1 79514 79205 79206 79515 0 0 0 0
What it does?
FNR==NR Checks if the number of records read from the current file is equal to total number of records read. Basically this evaluates true only for the first file, that is here for file2
value[$1]; next Creates an associative array indexed by $1, the value from the file2
$1 in value checks if the column 1 is present in the associative array
EDIT
Print only the first occurence.
You can use delete to remove the entry from the associative array once the line has been printed. This ensures that the line is not printed for the second occurence.
awk 'FNR==NR{ value[$1]; next} $1 in value{ print; delete value[$1] }'

how to select only those lines which have same string in all columns with first line of different header in linux

I have a text file with 200 columns like:
sample1 0 12 11 23 12
sample2 3 16 89 12 0
sample3 0 0 0 0 0
sample4 33 22 0 0 0
sample5 0 0 0 0 0
And I want only those lines which have only 0 from column 2 to 6. desired out put is:
sample3 0 0 0 0 0
sample5 0 0 0 0 0
Like this, for example:
$ awk '!$2 && !$3 && !$4 && !$5 && !$6' file
sample3 0 0 0 0 0
sample5 0 0 0 0 0
Which is the same as:
$ awk '!($2 || $3 || $4 || $5 || $6)' file
sample3 0 0 0 0 0
sample5 0 0 0 0 0
As per your comment
that is for example but i want to do that from column 2 to 200th
This can be a way:
$ awk '{for (i=2;i<=200;i++) if ($i) {next}}1' file
sample3 0 0 0 0 0
sample5 0 0 0 0 0
Note that $i refers to the field in the position i. $i is true when it has got a "true" value. Hence, $i will be false when it is 0.
Based on that approach, we loop through all values. In case one value is True, meaning not 0, then we do next, which means that the line is not analyzed any more. For the rest of the cases (2nd to 200th column being 0 or empty), the next is not accomplished so it interprets the 1, which makes {print $0} to be executed.

How to combine two AWK commands?

I have two command line with AWK which works perfectly:
awk 'NR >=2 {for (i=7;i<=NF;i++) if ($i ~ /^-/) sum1 += $i; print $1,$2,$3,$4,$5,$6,sum1, ; sum1 = 0}' test.txt
awk 'NR >=2 {for (i=7;i<=NF;i++) if ($i ~! /^-/) sum += $i; print $1,$2,$3,$4,$5,$6,sum, sum/192 ; sum = 0}' test.txt
I want to combine these two commands to one command so I would be able to get sum and sum1 and I want to print them both! and if posible use an equation!
something like this:
awk 'NR >=2 {for (i=7;i<=NF;i++) if ($i ~! /^-/) sum += $i; {for (i=7;i<=NF;i++) if ($i ~ /^-/) sum1 += $i ; print $1,$2,$3,$4,$5,$6,sum,sum1, sum/(192 +(sum1*2)) ; sum = 0 ; sum1 = 0}' test.txt
or
awk 'NR >=2 {for (i=7;i<=NF;i++) if ($i ~! /^-/) sum += $i && {for (i=7;i<=NF;i++) if ($i ~ /^-/) sum1 += $i ; print $1,$2,$3,$4,$5,$6,sum,sum1, sum/(192 +(sum1*2)) ; sum = 0 ; sum1 = 0}' test.txt
but I get this error:
awk: cmd. line:1:
^ unexpected newline or end of string
in case it helps my file is something like this:
probeset_id submitted_id chr snp_pos alleleA alleleB 562_201 562_202 562_203 562_204 562_205 562_206 562_207 562_208 562_209 562_210 562_211 562_212 562_213 562_214 562_215 562_216 562_217 562_218 562_219 562_220 562_221 562_222 562_223 562_224 562_225 562_226 562_227 562_228 562_229 562_230 562_231 562_232 562_233 562_234 562_235 562_236 562_237 562_238 562_239 562_240 562_241 562_242 562_243 562_244 562_245 562_246 562_247 562_248 562_249 562_250 562_251 562_252 562_253 562_254 562_255 562_256 562_257 562_258 562_259 562_260 562_261 562_262 562_263 562_264 562_265 562_266 562_267 562_268 562_269 562_270 562_271 562_272 562_273 562_274 562_275 562_276 562_277 562_278 562_279 562_280 562_281 562_283 562_284 562_285 562_289 562_291 562_292 562_294 562_295 562_296 562_400 562_401 562_402 562_403 562_404 562_405
AX-75448119 Chr1_41908741 1 41908741 T C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 2 2 0 0 0 0 0 1 0 0 0 0 0
AX-75448118 Chr1_41908545 1 41908545 T C 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 2 2 2 2 2 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 2 2 2 0 1 1 1 2 -1 1 2 0 0 2 1 1 0 1 0 1 2 1 0 0 1 2 2 1 2 2 0 1 2 2 2 2 2 2 0 1 0 0 0 1 2 2 2 2 0
and I want the result to be like this
AX-75448119 Chr1_41908741 1 41908741 T C 13 0 0.067
AX-75448118 Chr1_41908545 1 41908545 T C 98 -1 0.515
Here is the nicely formatted version with explanation:
awk '
NR>1{
#Initialize the variables to 0 for every iteration
sum=0;
sum1=0;
#Loop from 7th column till the end
for(i=7;i<=NF;i++)
{
#Test if the value in that column is greater than zero
if($i>0)
{
#If test returns true, add value to variable sum
sum+=$i;
}
else
{
#If test returns false, add value to variable sum1
sum1+=$i;
}
}
#Loop again through the column 1-6
for(i=1;i<7;i++)
#Print the values of those columns
printf("%s ",$i);
#Print variables and function
printf("%d %d %f\n",sum,sum1,sum/(192 + (sum1*2)))
}' test.txt
Test:
[jaypal:~/Temp] awk '
NR>1{
sum=0;
sum1=0;
for(i=7;i<=NF;i++)
{
if($i>0)
{
sum+=$i;
}
else
{
sum1+=$i;
}
}
for(i=1;i<7;i++)
printf("%s ",$i);
printf("%d %d %f\n",sum,sum1,sum/(192 + (sum1*2)))
}' test.txt
AX-75448119 Chr1_41908741 1 41908741 T C 13 0 0.067708
AX-75448118 Chr1_41908545 1 41908545 T C 98 -1 0.515789
This should do the trick (and avoids multiple iterations).
$awk 'BEGIN{sum=0;sum1=0} NR >=2 {for (i=7;i<=NF;i++) if ($i !~ /^-/) sum += $i; else if ($i ~ /^-/) sum1 += $i; print $1,$2,$3,$4,$5,$6,sum, sum1,sum/(192 +(sum1*2)) ; sum=0;sum1=0}' test.txt
AX-75448119 Chr1_41908741 1 41908741 T C 13 0.0677083
AX-75448118 Chr1_41908545 1 41908545 T C -1 98 0.515789

Convert column pattern

I have this kind of file:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
8 0 1
10 0 1
11 0 1
The RS separator is an empty line by default.
If there was a double blank line, we have to substitute on of them by a pattern $1 0 0, where $1 means the increased "number" before the $1 0 * record.
If the separator is empty line + 1 empty line we have to increase the $1 by 1.
If the separator is empty line + 2 empty line we have to increase the $1 by 2.
...
and I need to get this output:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 0 1
11 0 1
Thanks in advance!
awk 'NF{f=0;n=$1;print;next}f{print ++n " 0 0"}{print;f=1}' ./infile
Output
$ awk 'NF{f=0;n=$1;print;next}f{print ++n " 0 0"}{print;f=1}' ./infile
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 0 1
11 0 1
Explanation
NF{f=0;n=$1;print;next}: if the current line has data, unset flag f, save the number in the first field to n, print the line and skip the rest of the script
{print;f=1}: We only reach this action if the current line is blank. If so, print the line and set the flag f
f{print ++n " 0 0"}: We only execute this action if the flag f is set which only happens if the previous line was blank. If we enter this action, print the missing fields with an incremented n
You can try something like this. The benefit of this way is that your input file need not have an empty line for the missing numbers.
awk -v RS="" -v ORS="\n\n" -v OFS="\n" '
BEGIN{getline; col=$1;line=$0;print line}
$1==col{print $0;next }
($1==col+1){print $0;col=$1;next}
{x=$1;y=$0; col++; while (col < x) {print col" 0 0";col++};print y;next}' file
Input File:
[jaypal:~/Temp] cat file
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
8 0 1
10 0 1
11 0 1
Script Output:
[jaypal:~/Temp] awk -v RS="" -v ORS="\n\n" -v OFS="\n" '
BEGIN{getline; col=$1;line=$0;print line}
$1==col{print $0;next }
($1==col+1){print $0;col=$1;next}
{x=$1;y=$0; col++; while (col < x) {print col" 0 0";col++};print y;next}' file
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1
9 0 0
10 0 1
11 0 1

Resources