Problems combining awk scripts - linux

I am trying to use awk to parse a tab delimited table -- there are several duplicate entries in the first column, and I need to remove the duplicate rows that have a smaller total sum of the other 4 columns in the table. I can remove the first or second row easily, and sum the columns, but I'm having trouble combining the two. For my purposes there will never be more than 2 duplicates.
Example file: http://pastebin.com/u2GBnm2D
Desired output in this case would be to remove the rows:
lmo0330 1 1 0 1
lmo0506 7 21 2 10
And keep the other two rows with the same gene id in the column. The final parsed file would look like this: http://pastebin.com/WgDkm5ui
Here's what I have tried (this doesn't do anything. But the first part removes the second duplicate, and the second part sums the counts):
awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'
I tried modifying the 2nd part of the script in the best answer of this question: Removing lines containing a unique first field with awk?
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
But unfortunately I don't really understand what's going on well enough to get it working. Can anyone help me out? I think I need to replace the a[$1] > 1 part with [remove (first duplicate count or 2nd duplicate count depending on which is larger].
EDIT: I'm also using GNU Awk 3.1.7 if that matters.

You can use this awk command:
awk 'NR == 1 {
print;
next
} {
s = $2+$3+$4+$5
} s >= sum[$1] {
sum[$1] = s;
if (!($1 in rows))
a[++n] = $1;
rows[$1] = $0
} END {
for(i=1; i<=n; i++)
print rows[a[i]]
}' file | column -t
Output:
gene SRR034450.out.rpkm_0 SRR034451.out.rpkm_0 SRR034452.out.rpkm_0 SRR034453.out.rpkm_0
lmo0001 160 323 533 293
lmo0002 135 317 504 306
lmo0003 1 4 5 3
lmo0004 35 59 58 48
lmo0005 113 218 257 187
lmo0006 279 519 653 539
lmo0007 563 1053 1165 1069
lmo0008 34 84 203 107
lmo0009 13 45 90 49
lmo0010 57 210 237 169
lmo0011 65 224 247 179
lmo0012 65 226 250 215
lmo0013 342 500 738 682
lmo0014 662 1032 1283 1311
lmo0015 321 413 631 637
lmo0016 175 253 273 325
lmo0017 3 6 6 6
lmo0018 33 38 46 45
lmo0019 13 1 39 1
lmo0020 3 12 28 15
lmo0021 3 4 14 12
lmo0022 2 3 5 1
lmo0023 2 0 3 2
lmo0024 1 0 2 6
lmo0330 1 1 1 3
lmo0506 151 232 60 204

Related

Extract columns from multiple text files with bash or awk or sed?

I am trying to extract column1 and column4 from multiple text files.
file1.txt:
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
CFLAU10s46802|kraken:taxid|33189 1 125 2 105 84 1.68 36.8 24
CFLAU10s46898|kraken:taxid|33189 1 116 32 116 100 23.5862 35.7 19.4
CFLAU10s46988|kraken:taxid|33189 1 105 2 53 50.4762 1.00952 36.9 11
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 1 1102 2 88 7.98548 0.15971 36.4 10
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 1 2133 6 113 5.2977 0.186592 36.6 13
file2.txt:
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
CFLAU10s46802|kraken:taxid|33189 1 125 5 105 84 1.68 36.8 24
CFLAU10s46898|kraken:taxid|33189 1 116 40 116 100 23.5862 35.7 19.4
CFLAU10s46988|kraken:taxid|33189 1 105 6 53 50.4762 1.00952 36.9 11
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 1 1102 2 88 7.98548 0.15971 36.4 10
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 1 2133 6 113 5.2977 0.186592 36.6 13
output format (save the output as merged.txt in another directory). In the output file: Column1(#nname) will be once because this is same for every file, but there will be multiple column4 (numreads) as many as files and the rename the column4 should be according to each file name.
Output file looks like:
#rname file1_numreads file2_numreads
CFLAU10s46802|kraken:taxid|33189 2 5
CFLAU10s46898|kraken:taxid|33189 32 40
CFLAU10s46988|kraken:taxid|33189 2 6
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 2 88
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 6 113
Your suggestions would be appreciated.
Here is something I put together. awk gurus might have a simpler - shorter version but I am still learning awk.
Create a file script.awk and make it executable. Put in it:
#!/usr/bin/awk -f
BEGIN { FS="\t" }
# process files, ignoring comments
!/^#/ {
# keep the first column values.
# Only add a new value if it is not already in the array.
if (!($1 in firstcolumns)) {
firstcolumns[$1] = $1
}
# extract the 4th column of file1, put it in the array (column 1).1
if (FILENAME == ARGV[1]) {
results[$1 ".1"] = $4
}
# extract the 4th column of file2, put it in the array (column 1).2
if (FILENAME == ARGV[2]) {
results[$1 ".2"] = $4
}
}
# print the results
END {
# for each first column value...
for (key in firstcolumns) {
# Print the first column, then (column 1).1, then (column 1).2
print key "\t" results[key ".1"] "\t" results[key ".2"]
}
}
Call it like this: ./script.awk file1.txt file2.txt.
Since awk parses the files line per line, I keep the possible values of the first column in an array (firstcolumns).
For each line, if the 4th column comes from file1.txt (ARGV[1]) I store it in the results array under (firstcolumn).1.
For each line, if the 4th column comes from file2.txt (ARGV[2]) I store it in the results array under (firstcolumn).2.
In the END block, loop through the possible firstcolumn values and print the values (firstcolumn).1 and (firstcolumn).2, separated by "\t" for tabs.
Results:
$ ./so.awk file1.txt file2.txt
AUZW01004514.1 C4 C4
CFLAU10s46988|kraken:taxid|33189 2 6
CFLAU10s46802|kraken:taxid|33189 2 5
AUZW01004739.1 C4 C4
CFLAU10s46898|kraken:taxid|33189 32 40

how can I use multiple operation in awk to edit text file

I have a text file like this small example:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 150 151 2 BA
chr10:103909786-103910082 152 153 1 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 294 295 4 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 2932 2933 2 CA
chr10:104573088-104576021 58 59 1 BA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
in this file there are 5 tab separated columns. the first column is considered as ID. for example in the first row the whole "chr10:103909786-103910082" is ID.
1- in the 1st step I would like to filter out the rows based on the 4th column.
if the number in the 4th column is less than 10 and the same row but in the 5th column the group is BA, that row will be filtered out. also if the number in the 4th column is less than 5 and the same row but in the 5th column the group is CA, that row will be filtered out.
3- 3rd step:
I want to get the ratio of number in 4th column. in fact in the 1st column there are repeated values which represent the same ID. I want to get one ratio per ID, so in the output every ID will be repeated only once. each ID has both BA and CA in the 5th column. for each ID I should get 2 values for CA and BA separately and get the ration of CA/BA as the final value for each ID. to get one value as CA, I should add up all values in the 4th column which belong the same ID and classified as CA and to get one value as BA, I should add up all values in the 4th column which belong the same ID and classified as BA. the last step is to get the ration of CA/BA per ID. the expected output for the small example would look like this:
1- after filtration:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
2- after summarizing each group (CA and BA):
chr10:103909786-103910082 147 148 35 BA
chr10:103909786-103910082 274 275 35 CA
chr10:104573088-104576021 2925 2926 144 CA
chr10:104573088-104576021 819 820 45 BA
3- the final output(this ratio is made using the values in 4th column):
chr10:103909786-103910082 1
chr10:104573088-104576021 3.2
in the above lines, 1 = 35/35 and 3.2 = 144/45.
I am trying to do that in awk
awk -F "\t" '{ (if($4 < -10 & $5==BA)), (if($4 < -5 & $5==CA)) ; print $2 = BA/CA} file.txt > out.txt
I tried to follow the steps that mentioned in the code but did not succeed. do you know how to solve the problem?
If the records with the same ID are always consecutive, you can do that:
awk 'ID!=$1 {
if (ID) {
print ID, a["CA"]/a["BA"]; a["CA"]=a["BA"]=0;
}
ID=$1
}
$5=="BA" && $4>=10 || $5=="CA" && $4>=5 { a[$5]+=$4 }
END{ print ID, a["CA"]/a["BA"] }' file.txt
The first block tests if the ID has changed, in this case, it displays the previous ID and the ratio.
The second block filter unwanted records.
The END block displays the result for the last ID.

sort and uniquize a tab separated file for two columns [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
i have a very large tab separated file, a part of which looks like this:
33 x 171 297 126
4 x 171 300 129
2 x 171 303 132
11 y 163 289 126
5 y 163 290 127
3 y 163 291 128
2 y 163 292 129
2 y 170 289 119
2 z 166 307 141
2 z 166 308 142
6 z 166 309 143
4 z 166 329 163
2 z 166 330 164
i want to sort and select only one line for each: x,y, z based on the highest value associated with it in the first column (in unix)
You can do this with awk:
awk '
{
key = $2;
flag = 0;
if (key in value) { max = value[key] ; flag = 1 };
if (flag == 0 || max < $1) { value[key] = $1; line[key] = $0 };
}
END {
for (key in line) { print line[key] };
}
' data.tsv

awk with zero output

I have four column and I would like to do this:
INUPUT=
429 0 10 0
287 115 89 64
0 629 0 10
542 0 7 0
15 853 0 12
208 587 5 4
435 203 12 0
604 411 27 3
0 232 0 227
471 395 5 5
802 706 15 15
1288 1135 11 23
1063 386 13 2
603 678 7 14
0 760 0 11
awk '{if (($2+$4)/($1+$3)<0.2 || ($1+$3)==0) print $0; else if (($1+$3)/($2+$4)<0.2 || ($2+$4)==0) print $0; else print $0}' INPUT
But I have error message :
awk: cmd. line:1: (FILENAME=- FNR=3) fatal: division by zero attempted
Even if I have added condition:
...|| ($1+$3)==0...
Can somebody explain me what I am doing wrong?
Thank you so much.
PS: print $0 is just for illustration.
Move the "($1+$3) == 0" to be the first clause of the if statement. Awk will evalulate them in turn. Hence it still attempts the first clause of the if statement first, triggering the divide by zero attempt. If the first clause is true, it won't even attempt to evaulate the second one. So:-
awk '{if (($1+$3)==0 || ($2+$4)/($1+$3)<0.2) print $0; else if (($1+$3)/($2+$4)<0.2 || ($2+$4)==0) print $0; else print $0}' INPUT
You're already dividing by zero in your conditional statement ($1+$3=0 on the ninth line of your list). That's where the error comes from. You should change the ordering in your conditional statement: first verify that $1+$3!=0 and only then use it to define your next condition.

Filter a large text file using ID in another text file

I have a two text file, one file is composed of about 60,000 rows and 14 columns and another has one column containing the subset of one of the columns (first column) in the first file. I would like to filter the File 1 based on ID name in the file 2. I tried some command on net but none of them were not useful. It's a few lines of two text file (I'm on linux system)
File 1:
Contig100 orange1.1g013919m 75.31 81 12 2 244 14 2 78 4e-29 117 1126 435
Contig1000 orange1.1g045442m 65.50 400 130 2 631 1809 2 400 1e-156 466 2299 425
Contig10005 orange1.1g003445m 83.86 824 110 2 3222 808 1 820 0.0 1322 3583 820
Contig10006 orange1.1g047384m 81.82 22 4 0 396 331 250 271 7e-05 41.6 396 412
File 2:
Contig1
Contig1000
Contig10005
Contig10017
Please let me know your great suggestion to solve this issue.
Thanks in advance.
You can do this with python:
with open('filter.txt', 'r') as f:
mask = f.read()
with open('data.txt', 'r') as f:
while True:
l = f.readline()
if not l:
break
if l.split(' ')[0] in mask:
print(l[:-1])
If you're on Linux/Mac, you can do it on the command line (the $ symbolized the command prompt, don't type it).
First, we create a file2-patterns from your file2 by appending .* to each line:
$ while read line; do echo "$line .*"; done < file2 > file2-patterns
And have a look at that file:
$ cat file2-patterns
Contig1 .*
Contig1000 .*
Contig10005 .*
Contig10017 .*
Now we can use these patterns to filter out lines from file1.
$ grep -f file2-patterns file1
Contig1000 orange1.1g045442m 65.50 400 130 2 631 1809 2 400 1e-156 466 2299 425
Contig10005 orange1.1g003445m 83.86 824 110 2 3222 808 1 820 0.0 1322 3583 820

Resources