I have the following file
ENST001 ENST002 4 4 4 88 9 9
ENST004 3 3 3 99 8 8
ENST009 ENST010 ENST006 8 8 8 77 8 8
Basically I want to count how many times ENST* is repeated in each line so the expected results is
2
1
3
Any suggestion please ?
Try this (and see it in action here):
awk '{print gsub("ENST[0-9]+","")}' INPUTFILE
Related
I have several lines of text. I want to extract the number after specific word using awk.
I tried the following code but it does not work.
At first, create the test file by: vi test.text. There are 3 columns (the 3 fields are generated by some other pipeline commands using awk).
Index AllocTres CPUTotal
1 cpu=1,mem=256G 18
2 cpu=2,mem=1024M 16
3 4
4 cpu=12,gres/gpu=3 12
5 8
6 9
7 cpu=13,gres/gpu=4,gres/gpu:ret6000=2 20
8 mem=12G,gres/gpu=3,gres/gpu:1080ti=1 21
Please note there are several empty fields in this file.
what I want to achieve is to extract the number after the first gres/gpu= in each line (if no gres/gpu= occurs in this line, the default number is 0) using a pipeline like: cat test.text | awk '{some_commands}' to output 4 columns:
Index AllocTres CPUTotal GPUAllocated
1 cpu=1,mem=256G 18 0
2 cpu=2,mem=1024M 16 0
3 4 0
4 cpu=12,gres/gpu=3 12 3
5 8 0
6 9 0
7 cpu=13,gres/gpu=4,gres/gpu:ret6000=2 20 4
8 mem=12G,gres/gpu=3,gres/gpu:1080ti=1 21 3
Firstly: awk do not need cat, it could read files on its' own. Combining cat and awk is generally discouraged as useless use of cat.
For this task I would use GNU AWK following way, let file.txt content be
cpu=1,mem=256G
cpu=2,mem=1024M
cpu=12,gres/gpu=3
cpu=13,gres/gpu=4,gres/gpu:ret6000=2
mem=12G,gres/gpu=3,gres/gpu:1080ti=1
then
awk 'BEGIN{FS="gres/gpu="}{print $2+0}' file.txt
output
0
0
0
3
0
0
4
3
Explanation: I inform GNU AWK that field separator (FS) is gres/gpu= then for each line I do print 2nd field increased by zero. For lines without gres/gpu= $2 is empty string, when used in arithmetic context this is same as zero so zero plus zero gives zero. For lines with at least one gres/gpu= increasing by zero provokes GNU AWK to find longest prefix which is legal number, thus 3 (4th line) becomes 3, 4, (7th line) becomes 4, 3, (8th line) becomes 3.
(tested in GNU Awk 5.0.1)
With your shown samples in GNU awk you can try following code. Written and tested in GNU awk. Simple explanation would be using awk's match function where using regex gres\/gpu=([0-9]+)(escaping / here) and creating one and only capturing group to capture all digits coming after =. Once match is found printing current line followed by array's arr's 1st element +0(to print zero in case no match found for any line) here.
awk '
FNR==1{
print $0,"GPUAllocated"
next
}
{
match($0,/gres\/gpu=([0-9]+)/,arr)
print $0,arr[1]+0
}
' Input_file
Using sed
$ sed '1s/$/\tGPUAllocated/;s~.*gres/gpu=\([0-9]\).*~& \t\1~;1!{\~gres/gpu=[0-9]~!s/$/ \t0/}' input_file
Index AllocTres CPUTotal GPUAllocated
1 cpu=1,mem=256G 18 0
2 cpu=2,mem=1024M 16 0
3 4 0
4 cpu=12,gres/gpu=3 12 3
5 8 0
6 9 0
7 cpu=13,gres/gpu=4,gres/gpu:ret6000=2 20 4
8 mem=12G,gres/gpu=3,gres/gpu:1080ti=1 21 3
awk '
BEGIN{FS="\t"}
NR==1{
$(NF+1)="GPUAllocated"
}
NR>1{
$(NF+1)=FS 0
}
/gres\/gpu=/{
split($0, a, "=")
gp=a[3]; gsub(/[ ,].*/, "", gp)
$NF=FS gp
}1' test.text
Index AllocTres CPUTotal GPUAllocated
1 cpu=1,mem=256G 18 0
2 cpu=2,mem=1024M 16 0
3 4 0
4 cpu=12,gres/gpu=3 12 3
5 8 0
6 9 0
7 cpu=13,gres/gpu=4,gres/gpu:ret6000=2 20 4
8 mem=12G,gres/gpu=3,gres/gpu:1080ti=1 21 3
I want to replace specific lines in one file (File 1) with data contained in another file (File 2). For example:
File 1 (Input code):
other lines...
11 !!! Regular Expression
10 0.685682*100
11 0.004910*100
12 0.007012*100
13 0.146041*100
14 0.067827*100
15 0.019460*100
16 0.019277*100
17 0.001841*100
18 0.047950*100
other lines...
File 2 (to add new data):
1 0.36600*100
2 0.44466*100
3 0.0.046*100
4 0.15544*100
5 0.16600*100
6 0.14477*100
7 0.01927*100
8 0.00188*100
9 0.05566*100
How could I replace the Input data (File 1) from line 1 to line n with the data contained in File 2 (data). I tried using sed as follows:
sed '/!!! Regular Expresion/r File2' File1
and I get the following:
1 !!! Regular Expression
2 0.36600*100
3 0.44466*100
4 0.0.046*100
5 0.15544*100
6 0.16600*100
7 0.14477*100
8 0.01927*100
9 0.00188*100
10 0.05566*100
11 0.685682*100
12 0.004910*100
13 0.007012*100
14 0.146041*100
15 0.067827*100
16 0.019460*100
17 0.019277*100
19 0.001841*100
20 0.047950*100
My problem is that this command can insert the lines contained in File 2 but not replace them. How can I replace only these lines (from 10 to 18) with the new data?.
Thanks in advance.
replace specific lines in one file from 10 to 18 with the data contained in File 2
Lets use dynamic programming and split "replacing" into "deleting" and "inserting".
[Delete] specific lines in one file from 10 to 18
That's easy:
sed '10,18d'
would delete lines from 10 to 18.
[Insert] the data contained in File 2 [to line 10]
That's also easy:
sed '9r file2'
It appends the content of file2 after line 9, so first line of file2 is the new line 10.
All together:
sed '10,18d; 9r file2'
Example:
# seq 8 | sed '3,6d; 2r '<(seq -f 2%.0f 5)
1
2
21
22
23
24
25
7
8
I have three files with different column and row size. For example,
ifile1.txt ifile2.txt ifile3.txt
1 2 2 1 6 3 8
2 5 6 3 8 9 0
3 8 7 6 8 23 6
6 7 6 23 6 44 5
9 87 87 44 7 56 7
23 6 6 56 8 78 89
44 5 76 99 0 95 65
56 6 7 99 78
78 7 8 106 0
95 6 7 110 6
99 6 4
106 5 34
110 6 4
Here ifile1.txt has 3 coulmns and 13 rows,
ifile2.txt has 2 columns and 7 rows,
ifile3.txt has 2 columns and 10 rows.
1st column of each ifile is the ID,
This ID is sometimes missing in ifile2.txt and ifile3.txt.
I would like to make an outfile.txt with 4 columns whose 1st column would have all the IDs as in ifile1.txt, while the 2nd coulmn will be $3 from ifile1.txt, 3rd and 4th column will be $2 from ifile2.txt and ifile3.txt and the missing stations in ifile2.txt and ifile3.txt will be assigned as a special charecter '?'.
Desire output:
outfile.txt
1 2 6 ?
2 6 ? ?
3 7 8 8
6 6 8 ?
9 87 ? 0
23 6 6 6
44 76 7 5
56 7 8 7
78 8 ? 89
95 7 ? 65
99 4 0 78
106 34 ? 0
110 4 ? 6
I was trying with the following algorithm, but can't able to write a script.
for each i in $1, awk '{printf "%3s %3s %3s %3s\n", $1, $3 (from ifile1.txt),
check if i is present in $1 (ifile2.txt), then
write corresponding $2 values from ifile2.txt
else write ?
similarly check for ifile3.txt
You can do that with GNU AWK using this script:
script.awk
# read lines from the three files
ARGIND == 1 { file1[ $1 ] = $3
# init the other files with ?
file2[ $1 ] = "?"
file3[ $1 ] = "?"
next;
}
ARGIND == 2 { file2[ $1 ] = $2
next;
}
ARGIND == 3 { file3[ $1 ] = $2
next;
}
# output the collected information
END { for( k in file1) {
printf("%3s%6s%6s%6s\n", k, file1[ k ], file2[ k ], file3[ k ])
}
}
Run the script like this: awk -f script.awk ifile1.txt ifile2.txt ifile3.txt > outfile.txt
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I would like to extract sum, mean and average in each 6 numbers interval from a column.
I found many discussions related to this problem, but all those are for whole column. e.g.
To compute sum of a column:
awk '{sum+=$1} END { print sum}'
To calculate Average:
awk '{sum+=$1} END { print sum/NR}'
To find maximum or minimum, sort command can be used.
I need all these in an interval. e.g., my input file is
inputfile.txt
1 3
2 5
3 4
4 3
5 2
6 1
7 3
8 3
9 4
10 2
11 2
12 2
13 5
14 4
15 2
16 3
17 7
18 3
Output files are
sum.txt
1 18
2 16
3 24
average.txt
1 3
2 2.67
3 4
maximum.txt
1 5
2 4
3 7
Please make a search before you ask any question many posts are already there
You can try something like below, modify accordingly
Input
[akshay#localhost tmp]$ cat input.txt
1 3
2 5
3 4
4 3
5 2
6 1
7 3
8 3
9 4
10 2
11 2
12 2
13 5
14 4
15 2
16 3
17 7
18 3
Script
[akshay#localhost tmp]$ cat test.awk
{
sum += $2
max = max > $2 ? max : $2
}
!(FNR%6){
print ++c,sum > "sum.txt"
print c,sum/6 > "average.txt"
print c,max > "maximum.txt"
sum = max = ""
}
Output
[akshay#localhost tmp]$ awk -f test.awk input.txt
Sum
[akshay#localhost tmp]$ cat sum.txt
1 18
2 16
3 24
Average
[akshay#localhost tmp]$ cat average.txt
1 3
2 2.66667
3 4
Maximum
[akshay#localhost tmp]$ cat maximum.txt
1 5
2 4
3 7
This side isn't meant to write whole programs, and that's basically what you're asking us to do.
What you need to do is keep track of how many lines you've read and then every 6th line you produce output. Consider something like this:
awk '{sum += $1} (NR%6)==0 {print(sum); sum=0}' input.txt
I'm not going to explain what I did, because I expect you to please search the internet for awk tutorials, and get an understanding of what I am doing yourself.
Are there an efficient algorithm to search and dump all common substrings (which length is 3 or longer) between 2 strings?
Example input:
Length : 0 5 10 15 20 25 30
String 1 : ABC-DEF-GHI-JKL-ABC-ABC-STU-MWX-Y
String 2 : ABC-JKL-MNO-ABC-DEF-PQR-DEF-ZWX-Y
Example output:
In string 1 2
---------------------------
ABC-DEF 0 12
ABC-DE 0 12
BC-DEF 1 13
:
-ABC- 15,19 11
-JKL- 11 3
-DEF- 3 15
-JKL 11 3
JKL- 12 4
-DEF 3 15,23
DEF- 4 16
WX-Y 29 29
ABC- 0,16,20 0,12
-ABC 15,19 11
DEF- 4 16,24
DEF 4 16,24
ABC 0,16,20 0,12
JKL 12 4
WX- 29 29
X-Y 30 30
-AB 15,19 11
BC- 1,17,21 1,13
-DE 3 15,23
EF- 5 17,25
-JK 11 3
KL- 13 5
:
In the example, "-D", "-M" is also a common substring but is not required, because it's length is only 2. (There might be some missing outputs in example because there are so many of them...)
You can find all common substrings using a data structure called a Generalized suffix tree
Libstree contains some example code for finding the longest common substring. That example code can be modified to obtain all common substrings.