This is probably a very basic problem but I am stumped.
I am attempting create a new file from two large tab-delimited files with a common column. The heads of the two files are:
file1
k141_1 319 4 0
k141_2 400 9 0
k141_3 995 43 0
k141_4 670 21 0
k141_5 372 8 0
k141_6 359 9 0
k141_7 483 18 0
k141_8 1826 76 0
k141_9 566 15 0
k141_10 462 14 0
file2
U k141_1 0
U k141_11 0
U k141_24 0
U k141_30 0
C k141_32 2 18 77133,212695,487010, 5444279,5444689,68971626, TIEYSSLHACRSTLEDPT, cellular organisms; Bacteria;
C k141_38 1566886 16 1566886, 50380646, ELVMDREAWCAAIHGV, cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Corynebacteriales; Mycobacteriaceae; Mycobacterium; Mycobacterium sp. WCM 7299;
U k141_46 0
C k141_57 186802 23 1496,1776046,1776047, 64601048,64601468,64601628,64603689,64604310,64605360,71436886,71436980,71437249,71437272,71437295, CLLYTSDAADDLLCVDLGGRRII, cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Clostridiales;
U k141_64 0
C k141_73 131567 14 287,305,1496,2209,1483596, 47871795,47873311,47873322,47880313,47880625,53485494,53485498,62558724,71434583,71434608, LSRGLGDVYKRQIL,SCLVGSEMCIRDRY,YLSLIHISEPTRQE, cellular organisms;
I want the new file to contain all 4 columns from file 1 and the 8th column of file 2 (taxonomic information separated by semi colons).
I have attempted to sort the files based on the common column but the outputs are not the same despite the columns having the exact same values.
For example,
[user#compute02 Contigs]$ sort -k 1 file1 | head
k141_1000 312 253 0
k141_1001 553 13 0
k141_1002 518 19 0
k141_1003 812 30 0
k141_1004 327 13 0
k141_1005 454 18 0
k141_100 595 20 0
k141_1006 1585 78 0
k141_1007 537 23 0
[user#compute02 Contigs]$ sort -k 2 file2 | head
U k141_1 0
C k141_1000 305 26 305, 62554095,62558735, PVSYTHLRAHETRGNLVCRLLLEKKK, cellular organisms; Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Burkholderiaceae; Ralstonia; Ralstonia solanacearum;
C k141_1001 946362 11 946362, 5059526, SGRNGLPLKVR, cellular organisms; Eukaryota; Opisthokonta; Choanoflagellida; Craspedida; Salpingoecidae; Salpingoeca; Salpingoeca rosetta;
C k141_1002 131567 15 287,305,2209,1483596, 47870166,47873029,47873592,53485045,55518854,62558495, RTCLLYTSPSPRDKR,NLSLIHISEPTRQEA,EPVSYTHLRAHETRG, cellular organisms;
C k141_100 2 14 287,1496,1776047, 53544868,64603691,71437007, SRSSAASDVYKRQV, cellular organisms; Bacteria;
U k141_1003 0
C k141_1004 2 14 518,1776046,1776047, 28571314,64603094,64605737, LFFFNDTATTEIYT, cellular organisms; Bacteria;
U k141_1005 0
C k141_1006 948 13 948, 73024016, QAPLSMGFSRQEY, cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Anaplasmataceae; Anaplasma; phagocytophilum group; Anaplasma phagocytophilum;
C k141_1007 287 14 287, 50594737, RRQRQMCIRDRVGS, cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales; Pseudomonadaceae; Pseudomonas; Pseudomonas aeruginosa group; Pseudomonas aeruginosa;
Any assistance would be greatly appreciated :)
This solution should work.
for i in `cat file1.txt|awk -F" " '{print $1}'`
do
F1=`grep -w $i file1.txt`
F2=`grep -w $i file2.txt|awk -F" " '{$1=$2=$3=$4=$5=$6=$7=""; print $0}'`
echo $F1 $F2
done
I have a text file like this small example:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 150 151 2 BA
chr10:103909786-103910082 152 153 1 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 294 295 4 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 2932 2933 2 CA
chr10:104573088-104576021 58 59 1 BA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
in this file there are 5 tab separated columns. the first column is considered as ID. for example in the first row the whole "chr10:103909786-103910082" is ID.
1- in the 1st step I would like to filter out the rows based on the 4th column.
if the number in the 4th column is less than 10 and the same row but in the 5th column the group is BA, that row will be filtered out. also if the number in the 4th column is less than 5 and the same row but in the 5th column the group is CA, that row will be filtered out.
3- 3rd step:
I want to get the ratio of number in 4th column. in fact in the 1st column there are repeated values which represent the same ID. I want to get one ratio per ID, so in the output every ID will be repeated only once. each ID has both BA and CA in the 5th column. for each ID I should get 2 values for CA and BA separately and get the ration of CA/BA as the final value for each ID. to get one value as CA, I should add up all values in the 4th column which belong the same ID and classified as CA and to get one value as BA, I should add up all values in the 4th column which belong the same ID and classified as BA. the last step is to get the ration of CA/BA per ID. the expected output for the small example would look like this:
1- after filtration:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
2- after summarizing each group (CA and BA):
chr10:103909786-103910082 147 148 35 BA
chr10:103909786-103910082 274 275 35 CA
chr10:104573088-104576021 2925 2926 144 CA
chr10:104573088-104576021 819 820 45 BA
3- the final output(this ratio is made using the values in 4th column):
chr10:103909786-103910082 1
chr10:104573088-104576021 3.2
in the above lines, 1 = 35/35 and 3.2 = 144/45.
I am trying to do that in awk
awk -F "\t" '{ (if($4 < -10 & $5==BA)), (if($4 < -5 & $5==CA)) ; print $2 = BA/CA} file.txt > out.txt
I tried to follow the steps that mentioned in the code but did not succeed. do you know how to solve the problem?
If the records with the same ID are always consecutive, you can do that:
awk 'ID!=$1 {
if (ID) {
print ID, a["CA"]/a["BA"]; a["CA"]=a["BA"]=0;
}
ID=$1
}
$5=="BA" && $4>=10 || $5=="CA" && $4>=5 { a[$5]+=$4 }
END{ print ID, a["CA"]/a["BA"] }' file.txt
The first block tests if the ID has changed, in this case, it displays the previous ID and the ratio.
The second block filter unwanted records.
The END block displays the result for the last ID.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
i have a very large tab separated file, a part of which looks like this:
33 x 171 297 126
4 x 171 300 129
2 x 171 303 132
11 y 163 289 126
5 y 163 290 127
3 y 163 291 128
2 y 163 292 129
2 y 170 289 119
2 z 166 307 141
2 z 166 308 142
6 z 166 309 143
4 z 166 329 163
2 z 166 330 164
i want to sort and select only one line for each: x,y, z based on the highest value associated with it in the first column (in unix)
You can do this with awk:
awk '
{
key = $2;
flag = 0;
if (key in value) { max = value[key] ; flag = 1 };
if (flag == 0 || max < $1) { value[key] = $1; line[key] = $0 };
}
END {
for (key in line) { print line[key] };
}
' data.tsv
I have a two text file, one file is composed of about 60,000 rows and 14 columns and another has one column containing the subset of one of the columns (first column) in the first file. I would like to filter the File 1 based on ID name in the file 2. I tried some command on net but none of them were not useful. It's a few lines of two text file (I'm on linux system)
File 1:
Contig100 orange1.1g013919m 75.31 81 12 2 244 14 2 78 4e-29 117 1126 435
Contig1000 orange1.1g045442m 65.50 400 130 2 631 1809 2 400 1e-156 466 2299 425
Contig10005 orange1.1g003445m 83.86 824 110 2 3222 808 1 820 0.0 1322 3583 820
Contig10006 orange1.1g047384m 81.82 22 4 0 396 331 250 271 7e-05 41.6 396 412
File 2:
Contig1
Contig1000
Contig10005
Contig10017
Please let me know your great suggestion to solve this issue.
Thanks in advance.
You can do this with python:
with open('filter.txt', 'r') as f:
mask = f.read()
with open('data.txt', 'r') as f:
while True:
l = f.readline()
if not l:
break
if l.split(' ')[0] in mask:
print(l[:-1])
If you're on Linux/Mac, you can do it on the command line (the $ symbolized the command prompt, don't type it).
First, we create a file2-patterns from your file2 by appending .* to each line:
$ while read line; do echo "$line .*"; done < file2 > file2-patterns
And have a look at that file:
$ cat file2-patterns
Contig1 .*
Contig1000 .*
Contig10005 .*
Contig10017 .*
Now we can use these patterns to filter out lines from file1.
$ grep -f file2-patterns file1
Contig1000 orange1.1g045442m 65.50 400 130 2 631 1809 2 400 1e-156 466 2299 425
Contig10005 orange1.1g003445m 83.86 824 110 2 3222 808 1 820 0.0 1322 3583 820
I am trying to use awk to parse a tab delimited table -- there are several duplicate entries in the first column, and I need to remove the duplicate rows that have a smaller total sum of the other 4 columns in the table. I can remove the first or second row easily, and sum the columns, but I'm having trouble combining the two. For my purposes there will never be more than 2 duplicates.
Example file: http://pastebin.com/u2GBnm2D
Desired output in this case would be to remove the rows:
lmo0330 1 1 0 1
lmo0506 7 21 2 10
And keep the other two rows with the same gene id in the column. The final parsed file would look like this: http://pastebin.com/WgDkm5ui
Here's what I have tried (this doesn't do anything. But the first part removes the second duplicate, and the second part sums the counts):
awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'
I tried modifying the 2nd part of the script in the best answer of this question: Removing lines containing a unique first field with awk?
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
But unfortunately I don't really understand what's going on well enough to get it working. Can anyone help me out? I think I need to replace the a[$1] > 1 part with [remove (first duplicate count or 2nd duplicate count depending on which is larger].
EDIT: I'm also using GNU Awk 3.1.7 if that matters.
You can use this awk command:
awk 'NR == 1 {
print;
next
} {
s = $2+$3+$4+$5
} s >= sum[$1] {
sum[$1] = s;
if (!($1 in rows))
a[++n] = $1;
rows[$1] = $0
} END {
for(i=1; i<=n; i++)
print rows[a[i]]
}' file | column -t
Output:
gene SRR034450.out.rpkm_0 SRR034451.out.rpkm_0 SRR034452.out.rpkm_0 SRR034453.out.rpkm_0
lmo0001 160 323 533 293
lmo0002 135 317 504 306
lmo0003 1 4 5 3
lmo0004 35 59 58 48
lmo0005 113 218 257 187
lmo0006 279 519 653 539
lmo0007 563 1053 1165 1069
lmo0008 34 84 203 107
lmo0009 13 45 90 49
lmo0010 57 210 237 169
lmo0011 65 224 247 179
lmo0012 65 226 250 215
lmo0013 342 500 738 682
lmo0014 662 1032 1283 1311
lmo0015 321 413 631 637
lmo0016 175 253 273 325
lmo0017 3 6 6 6
lmo0018 33 38 46 45
lmo0019 13 1 39 1
lmo0020 3 12 28 15
lmo0021 3 4 14 12
lmo0022 2 3 5 1
lmo0023 2 0 3 2
lmo0024 1 0 2 6
lmo0330 1 1 1 3
lmo0506 151 232 60 204