Adding rows to a .txt file with 2 tab separated columns - text

I have a 2 columns (tab separated) .txt file that looks like:
1.00 GO:0005789,GO:0016021,GO:0005509,GO:0005506
3.33 GO:0005615,GO:0030325,GO:0009653
1.67 GO:0005615,GO:0030325
26.76 GO:0005737,GO:0003993,GO:0004726,GO:0004725
And I want to transform it into a 2 columns .txt file like:
1.00 GO:0005789
1.00 GO:0016021
1.00 GO:0005509
1.00 GO:0005506
3.33 GO:0005615
3.33 GO:0030325
3.33 GO:0009653
1.67 GO:0005615
1.67 GO:0030325
26.76 GO:0005737
26.76 GO:0003993
26.76 GO:0004726
26.76 GO:0004725
I tried sed 's/\(^[^,]*\).*/\1/g' <in.txt but what it does is to delete the GOterms except for the first one in each line. It gives me this:
1.00 GO:0005789
3.33 GO:0005615
1.67 GO:0005615
26.76 GO:0005737
Any suggestion? Using sed or not, everything is going to be welcome.
Thanks in advance.

Use awk for that:
awk -F',| +|\t' '{for(i=2;i<=NF;i++){print $1" "$i}}' input.txt

You could use awk for this:
$ cat test.txt
1.00 GO:0005789,GO:0016021,GO:0005509,GO:0005506
3.33 GO:0005615,GO:0030325,GO:0009653
1.67 GO:0005615,GO:0030325
26.76 GO:0005737,GO:0003993,GO:0004726,GO:0004725
$ awk -F'[\t,]' '{for (i=2;i<=NF;i++) print $1"\t"$i }' test.txt
Result:
1.00 GO:0005789
1.00 GO:0016021
1.00 GO:0005509
1.00 GO:0005506
3.33 GO:0005615
3.33 GO:0030325
3.33 GO:0009653
1.67 GO:0005615
1.67 GO:0030325
26.76 GO:0005737
26.76 GO:0003993
26.76 GO:0004726
26.76 GO:0004725
Explanation
-F sets the delimiters. Two delimiters are given here. One is \t and another is ,
NF tells us the number of fields. We loop from field #2 through as many fields there are. For each field found, we print the first field and current field

Related

How to replace a certain number of row values/text in a certain column with different text/values in Linux?

I have a file with a column (5 in the pic) that is filled with X's. I need to replace those X's with different letters. However, I need a certain amount of rows in that column to be the same letter, and then for another set of rows to be another letter, and so on. Example shown in 2nd pic. To be more clear, I need lines 2-10 to have the letter A in column 5, lines 11-20 to letter B, lines 21-30 the letter C, and so on. Is there a way to do this in Linux replacing those X's in the file by giving the row/line ranges and the letter I want, but NOT saving to a new file? I need a faster way than by hand because I have over a million lines in the file, and I have about 5,000 files to change.
What I have
What I need
awk '{ match($0,$5);printf "%s%c%s\n",substr($0,1,RSTART-1),64+$6,substr($0,RSTART+RLENGTH) }' file
Using awk, use the match function to find the starting position of the 5th space delimited field. We then print the start of the line to the 5th field, the character code of 64 + the 6th delimited field and then the rest of the field.
sample used:
ATOM 1 CA HIE X 1 105.967 123.567 112.345 0.00 0.00
ATOM 1 CA HIE X 2 105.967 123.567 112.345 0.00 0.00
ATOM 1 CA HIE X 3 105.967 123.567 112.345 0.00 0.00
Output:
ATOM 1 CA HIE A 1 105.967 123.567 112.345 0.00 0.00
ATOM 1 CA HIE B 2 105.967 123.567 112.345 0.00 0.00
ATOM 1 CA HIE C 3 105.967 123.567 112.345 0.00 0.00

AWK field substitution based on lookup table

I am trying to replace values in column 1 of file1 using a lookup table. A sample (tab separated):
chr1 1243 A T 0.14
chr5 1432 G C 0.0006
chr10 731 T C 0.9421
chr11 98234 T G .000032
chr12 1284 A T 0.93428
chr17 941 G T 0.1111
chr19 134325 T C 0.00001
chr21 9824 T C 0.9
Lookup table:
chr1 NC_000001.11
chr2 NC_000002.12
chr3 NC_000003.12
chr4 NC_000004.12
chr5 NC_000005.10
chr6 NC_000006.12
chr7 NC_000007.14
chr8 NC_000008.11
chr9 NC_000009.12
chr10 NC_000010.11
chr11 NC_000011.10
chr12 NC_000012.12
chr13 NC_000013.11
chr14 NC_000014.9
chr15 NC_000015.10
chr16 NC_000016.10
chr17 NC_000017.11
chr18 NC_000018.10
chr19 NC_000019.10
chr20 NC_000020.11
chr21 NC_000021.9
chr22 NC_000022.11
script being used:
awk 'FNR==NR{a[$1]=$2;next} {for (i in a)sub(i,a[i]);print' lookup.txt file1 > new_table.txt
output with comment on which line is correct/incorrect (with right answer in brackets):
NC_000001.11 1243 A T 0.14 #correct
NC_000005.10 1432 G C 0.0006 #correct
NC_000001.110 731 T C 0.9421 #incorrect (NC_000010.11)
NC_000001.111 98234 T G .000032 #incorrect (NC_000011.10)
NC_000012.12 1284 A T 0.93428 #correct
NC_000001.117 941 G T 0.1111 #incorrect (NC_000017.11)
NC_000001.119 134325 T C 0.00001 #incorrect (NC_000019.10)
NC_000021.9 9824 T C 0.9 #correct
I don't understand the pattern of why it isn't working and would welcome any help with the awk script. I thought it was just those with double digits e.g. chr17 but then chr21 seems to work fine.
Many thanks
Shouldn't it be:
awk 'FNR==NR{a[$1]=$2;next}{$1=a[$1]}1' lookup.txt file1
?
Output:
NC_000001.11 1243 A T 0.14
NC_000005.10 1432 G C 0.0006
NC_000010.11 731 T C 0.9421
NC_000011.10 98234 T G .000032
NC_000012.12 1284 A T 0.93428
NC_000017.11 941 G T 0.1111
NC_000019.10 134325 T C 0.00001
NC_000021.9 9824 T C 0.9
Explanation:
# true as long as we are reading the first file, lookup.txt
FNR==NR {
# create a lookup array 'a' indexed by field 1 of lookup txt
a[$1]=$2
# don't process further actions
next
}
# because of the 'next' statement above, this will be only executed
# when we are processing the second file, file1
{
# translate field 1. use the value from the lookup array
$1=a[$1]
}
# always true. print the line
1
PS: If there's the possibility that entries can't be found in the lookup table, you could use a special text for them:
awk 'FNR==NR{a[$1]=$2;next}{$1=($1 in a)?a[$1]:"NOT FOUND "$1}1' lookup.txt file1
I believe sub could be the problem in OP's attempt, not checked thoroughly, this could be done simply by:
awk 'FNR==NR{arr[$1]=$2;next} ($1 in arr){first=$1;$1="";print arr[first],$0}' lookup_table Input_file
Problem with OP's attempt(Only for understanding purposes NOT to be run to get shown samples results): Though OP's code shown one doesn't look like complete one to dig it out why its giving wrong output as per OP's question, I have written it as follows.
awk 'FNR==NR{a[$1]=$2;next} {for (i in a){line=$0;if(sub(i,a[i])){print (Previous line)line">>>(array key)"i"....(array value)"a[i]"............(new line)"$0}}}' lookup_table Input_file
So whenever a proper substitution happens then only its printing the line as follows, where we could see whats going wrong with OP's code.
chr1 1243 A T 0.14 chr1 1243 A T 0.14 >>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.11 1243 A T 0.14
chr5 1432 G C 0.0006chr5 1432 G C 0.0006>>>(array key)chr5....(array value)NC_000005.10............(new line)NC_000005.10 1432 G C 0.0006
chr10 731 T C 0.9421chr10 731 T C 0.9421>>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.110 731 T C 0.9421
chr11 98234 T G .000032chr11 98234 T G .000032>>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.111 98234 T G .000032
chr12 1284 A T 0.93428chr12 1284 A T 0.93428>>>(array key)chr12....(array value)NC_000012.12............(new line)NC_000012.12 1284 A T 0.93428
chr17 941 G T 0.1111chr17 941 G T 0.1111>>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.117 941 G T 0.1111
chr19 134325 T C 0.00001chr19 134325 T C 0.00001>>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.119 134325 T C 0.00001
chr21 9824 T C 0.9chr21 9824 T C 0.9>>>(array key)chr21....(array value)NC_000021.9............(new line)NC_000021.9 9824 T C 0.9
Where we could easily see
old line from chr1 1243 A T 0.14 chr1 1243 A T 0.14 to becomes like NC_000001.11 1243 A T 0.14 that's because array key(chr1) get substituted with array value (NC_000001.11) If you see output shown above for understanding purposes.
Looks like sub is causing the issue and so simply prepend the value for the index specified by $1 to the line with a space and print the line with short hand 1 and so:
awk 'FNR==NR{a[$1]=$2;next} {$0=a[$1]" "$0 }1' lookup.txt file1 > new_table.txt

Compare multiple rows to pick the one with smallest value

I would like to compare the rows in the second column, and get the row with the highest value in the consecutive columns, with priority of column 3> 4 > 5. I sorted my dataset for the second column so the same values will be together.
My dataset looks like this:
X1 A 0.38 24.68 2.93
X2 A 0.38 20.22 14.54
X3 A 0.38 20.08 00.48
X3.3 A 0.22 11.55 10.68
C43 B 0.22 11.55 20.08
C4.2 C 0.22 11.55 3.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
C44 D 0.22 1.10 1.24
P1 E 0.42 0.42 0.42
P2 E 0.42 0.42 0.42
P3 E 0.42 0.42 0.42
In here, I would like to say, if second column is the same value with another row, then I compare their values in the third column and pick the row with the highest value in the third column.
If the rows have the same second and third columns, then I go to forth column and compare their values in this column, and then get row with the highest value.
If the rows sharing second column still share the values in third and forth columns, then I pick the row with highest value in the fifth column.
If, second-third-forth-fifth columns are the same (complete duplicates), then I print them all, but add 'duplicate' next to their fifth column.
If a row does not share its value for the second column for any other rows, then there is no comparison and I keep this column.
Therefore, my expected output will be:
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42duplicate
P2 E 0.42 0.42 0.42duplicate
P3 E 0.42 0.42 0.42duplicate
What I tried at the moment fails, because I can only compare based on second column and not with multiple columns conditioning and I cannot keep complete duplicates.
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++'
X1 A 0.38 24.68 2.93
C43 B 0.22 11.55 20.08
C4.5 C 0.22 11.55 31.08
C42 D 0.96 21.15 11.24
P1 E 0.42 0.42 0.42
I appreciate to learn how to fix it.
I'm afraid the code below is not sophisticated, how about:
awk -v OFS="\t" '$1=$1' "data.txt" | sort -k2,2 -k3nr -k4nr -k5nr > "tmp.txt"
awk -v OFS="\t" '
NR==FNR {
vals = $3","$4","$5
if (max[$2] == "") max[$2] = vals
else if (max[$2] == vals) dupe[$2] = 1
next
} {
vals = $3","$4","$5
if (dupe[$2]) $6 = "duplicate"
if (max[$2] == vals) print
}' "tmp.txt" "tmp.txt"
rm -f "tmp.txt"
It saves the sorted result in a temporary file "tmp.txt".
The 2nd awk script processes the temporary file with two passes.
In the 1st pass, it extracts the "max value" for each 2nd column.
It also detects the duplications and set the variable dupe if found.
In the 2nd pass, it assigns the variable $6 to a string duplicate
if the line has the dupe flag.
Then it prints only the line(s) which have the max value for each 2nd column.
This may not be the most elegant solution but it works
cat data.txt | awk -v OFS="\t" '$1=$1' | sort -k2,2 -k3nr -k4nr -k5nr | awk '!a[$2]++' | cut -f2- > /tmp/fgrep.$$
cat data.txt | fgrep -f /tmp/fgrep.$$ | awk '{
rec[NR] = $0
idx = sprintf("%s %s %s %s",$2,$3,$4,$5)
irec[NR] = idx
dup[idx]++
}
END{
for(i in rec){
if(dup[irec[i]]> 1){
print rec[i] "duplicate"
}else{
print rec[i]
}
}
}'
rm /tmp/fgrep.$$

compare 2nd column of two or more files and print union of all files

I have four tab separated files 1.txt, 2.txt, 3.txt, 4.txt. Each having following format
89 ABI1 0.19
93 ABL1 0.15
94 ABL2 0.07
170 ACSL3 0.21
I want to compare 2nd column of all files and print union (based on 2nd column) into new file, like following:
1.txt 2.txt 3.txt 4.txt
ABL2 0.07 0.01 0.11 0.009
AKT1 0.31 0.05 0.05 0.017
AKT2 0.33 0.05 0.01 0.004
How is it possible in awk?
I tried following but this only compares first columns,
awk 'NR==FNR {h[$1] = $0; next} {print $1,h[$1]}' OFS="\t" 2.txt 1.txt
but when I change it to compare 2nd column it doesn't work
awk 'NR==FNR {h[$2] = $0; next} {print $1,h[$2]}' OFS="\t" 2.txt 1.txt
Also this only works on two files at a time.
Is there any way to do it on four files by comparing 2nd column in awk?
Using join on sorted input files, and assuming a shell that understands process substitutions with <(...) (I've used a copy of the data that you provided for every input file, just adding a line at the top for identification, this is the AAA line):
$ join <( join -1 2 -2 2 -o 0,1.3,2.3 1.txt 2.txt ) \
<( join -1 2 -2 2 -o 0,1.3,2.3 3.txt 4.txt )
AAA 1 2 3 4
ABI1 0.19 0.19 0.19 0.19
ABL1 0.15 0.15 0.15 0.15
ABL2 0.07 0.07 0.07 0.07
ACSL3 0.21 0.21 0.21 0.21
There are three joins here. The first two to be performed are the ones in <(...). The first of these join the first two files, while the second join the last two files. The result of one of these joins looks like
AAA 1 2
ABI1 0.19 0.19
ABL1 0.15 0.15
ABL2 0.07 0.07
ACSL3 0.21 0.21
The option -o 0,1.3,2.3 means "output the join field along with field 3 from both files". -1 2 -2 2 means "use field 2 of each file as join field (rather than field 1)".
The outermost join takes the two results and performs the final join that produces the output.
If the input files are not sorted on the join field:
$ join <( join -1 2 -2 2 -o 0,1.3,2.3 <(sort -k2,2 1.txt) <(sort -k2,2 2.txt) ) \
<( join -1 2 -2 2 -o 0,1.3,2.3 <(sort -k2,2 3.txt) <(sort -k2,2 4.txt) )

Adding time into a .plot file without adding a new line using awk

I am writing a shell script that runs the command mpstat and iostat to get CPU and disk usage, extract information from those and put them into a .plot file to later graph them using bargraph.pl. What I am having troubles on is when I go use awk to get the time from mpstat like this
mpstat | awk 'FNR == 4 {print $1;}' >> CPU_usage.plot
It will prints a new line at the end of the code. I tried using printf as this is working for my other lines of codes to get the specific information needed without adding a new line of code, but I don't know how I can format it. Is there any way to do this with awk or any other method that I can use to accomplish this? Thanks in advance.
When use the command mpstat this is what bash returns
Linux 3.4.0+ (DESKTOP-JM295S0) 04/30/2017 _x86_64_ (4 CPU)
03:56:43 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:56:43 PM all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
This is what I'm trying to accomplish, take the time, usr, sys, and idle and put them into a file called CPU_usage.plot. This is what I wanted to put into the file:
03:56:43 0.00 0.00 100.00
What I got instead is:
03:56:43
0.00 0.00 100.00
This is my code:
mpstat | awk 'FNR == 4 {print $1;}' >> CPU_usage.plot
mpstat | awk 'FNR == 4 {printf " %f", $4;}' >> CPU_usage.plot
mpstat | awk 'FNR == 4 {printf " %f", $6;}' >> CPU_usage.plot
mpstat | awk 'FNR == 4 {printf " %f\n", $13;}' >> CPU_usage.plot
Use the following awk approach:
mpstat | awk 'NR==4{print $1,$4,$6,$13}' OFS="\t" >> CPU_usage.plot
Now, CPU_usage.plot file should contain:
03:56:43 0.00 0.00 100.00

Resources