Grep rows with a length of 3 - linux

Hi i have a table which looks like this:
chr10 84890986 84891021 2 17.5 2 93 0 61 48 2 48 0 1.16 GA
chr10 84897562 84897613 2 25.5 2 100 0 102 50 49 0 0 1 AC
chr10 84899819 84899844 2 12.5 2 100 0 50 0 0 52 48 1 GT
chr10 84905282 84905318 6 5.8 6 87 6 54 80 19 0 0 0.71 AAAAAC
chr10 84955235 84955267 2 16 2 100 0 64 50 0 0 50 1 AT
chr10 84972254 84972288 2 17 2 93 0 59 2 0 47 50 1.16 GT
chr10 85011399 85011478 3 25.7 3 80 12 63 58 1 40 0 1.06 GAA
chr10 85011461 85011525 3 20.7 3 87 6 74 39 0 60 0 0.97 GAG
chr10 85014721 85014841 5 23.8 5 78 8 66 0 69 0 29 1 TTCCC
chr10 85021530 85021701 5 38.4 5 84 13 53 74 0 24 0 0.85 AAGAG
chr10 85045413 85045440 3 9 3 100 0 54 66 33 0 0 0.92 CAA
chr10 85059334 85059364 5 6 5 92 0 51 20 3 0 76 0.92 ATTTT
chr10 85072010 85072038 2 14 2 100 0 56 50 50 0 0 1 CA
chr10 85072037 85072077 4 10 4 84 10 55 25 22 0 52 1.47 ATCT
chr10 85084308 85084338 6 5 6 91 0 51 83 13 3 0 0.77 CAAAAA
chr10 85096597 85096640 3 14.7 3 95 4 79 69 30 0 0 0.88 AAC
chr10 85151154 85151190 6 6.5 6 87 12 51 0 11 0 88 0.5 TTTCTT
chr10 85168255 85168320 4 16.2 4 100 0 130 50 0 49 0 1 AGGA
chr10 85173155 85173184 2 14.5 2 100 0 58 48 0 0 51 1 TA
chr10 85196836 85196861 2 12.5 2 100 0 50 52 48 0 0 1 AC
chr10 85215511 85215546 2 17.5 2 100 0 70 51 48 0 0 1 AC
chr10 85225048 85225075 2 13.5 2 100 0 54 51 48 0 0 1 AC
chr10 85242322 85242357 2 17.5 2 93 0 61 0 2 48 48 1.16 TG
chr10 85245934 85245981 4 11 4 79 20 51 27 2 0 70 0.99 ATTT
chr10 85249139 85249230 5 18.8 5 88 6 116 0 60 0 39 0.97 TTCCC
chr10 85251100 85251153 5 11 5 97 2 92 0 0 37 62 0.96 GTTTG
chr10 85268725 85268752 4 6.8 4 100 0 54 0 25 0 74 0.83 CTTT
chr10 85268767 85268798 4 7.8 4 100 0 62 0 0 22 77 0.77 TTTG
chr10 85269189 85269239 6 8.8 6 79 16 54 84 2 12 2 0.8 AAAAGA
chr10 85330217 85330253 2 18 2 100 0 72 0 0 50 50 1 TG
chr10 85332256 85332314 4 15 4 82 7 75 70 1 27 0 0.97 AAGA
chr10 85337969 85337996 2 13.5 2 100 0 54 0 0 48 51 1 TG
chr10 85344795 85344957 2 75.5 2 83 12 198 45 4 3 45 1.42 TA
chr10 85349732 85349765 5 6.8 5 93 6 59 84 15 0 0 0.61 AAAAC
chr10 85353082 85353109 5 5.4 5 100 0 54 0 22 18 59 1.38 CTGTT
I want to extract all rows with have 3 and only 3 characters in the last column. My try till now is this:
grep -E "['ACTG']['ACTG']['ACTG']{1,3}$"
But this gives me everything from 3 and longer than 3. I tried many different combinations but nothing seems to give me what i want. Any ideas?

If you like to try awk, you can do:
awk '$NF~/\<...\>/' file
chr10 85011399 85011478 3 25.7 3 80 12 63 58 1 40 0 1.06 GAA
chr10 85011461 85011525 3 20.7 3 87 6 74 39 0 60 0 0.97 GAG
chr10 85045413 85045440 3 9 3 100 0 54 66 33 0 0 0.92 CAA
chr10 85096597 85096640 3 14.7 3 95 4 79 69 30 0 0 0.88 AAC
It will test if last field $NF has 3 character ...
This regex would also do: awk '$NF~/^...$/'
Or if you need exact characters. (PS this needs awk 4.x, or use of switch --re-interval)
awk '$NF~/^[ACTG]{3}$/' file
Using grep
grep -E " [ACTG]{3}$" file
chr10 85011399 85011478 3 25.7 3 80 12 63 58 1 40 0 1.06 GAA
chr10 85011461 85011525 3 20.7 3 87 6 74 39 0 60 0 0.97 GAG
chr10 85045413 85045440 3 9 3 100 0 54 66 33 0 0 0.92 CAA
chr10 85096597 85096640 3 14.7 3 95 4 79 69 30 0 0 0.88 AAC
You need the space, to separate last column, and {3} to get 3 and only 3 characters.

If you want to print the rows which has exactly three chars in the last column then you could use the below grep command.
grep -E " [ACTG]{3}$"
[ACTG]{3} Matches exactly three characters from the given list.

You have to grep either " ['ACTG']['ACTG']['ACTG']$" or " ['ACTG']{1,3}$".
Currently, you are grepping 3 to 5 'ACTG'.
Also, the quotes are unnecessary ['ACTG'] means "match anything between []" so any of the 5 characters 'ACTG, just grep " [ACTG]{1,3}$".
Be sure to use a delimiter for the left part (space ' ', tab\t if it is tab delimited, word boundary \b or \W).
If your lines are all ending with [ACTG]+, you can even only grep -E "\W.{,3}$"

Another way that you could do this would be using awk:
$ awk '$NF ~ /^[ACTG][ACTG][ACTG]$/' file
chr10 85011399 85011478 3 25.7 3 80 12 63 58 1 40 0 1.06 GAA
chr10 85011461 85011525 3 20.7 3 87 6 74 39 0 60 0 0.97 GAG
chr10 85045413 85045440 3 9 3 100 0 54 66 33 0 0 0.92 CAA
chr10 85096597 85096640 3 14.7 3 95 4 79 69 30 0 0 0.88 AAC
This prints all lines whose last field exactly matches 3 of the characters "A", "C", "T" or "G".

2 hours late but this is one way in awk
This can be easily edited for different lengths and fields.
awk 'length($NF)==3' file

As i was looking for answers myself i found out that Perl regex work more efficiently:
this does the deal : grep -P '\t...$' Way more compact code.
$ cat roi_new.bed | grep -P "\t...$"
chr10 81038152 81038182 3 9.7 3 92 7 51 30 0 0 70 0.88 TTA
chr10 81272294 81272320 3 8.7 3 100 0 52 0 30 69 0 0.89 GGC
chr10 81287690 81287720 3 10 3 100 0 60 66 33 0 0 0.92 CAA

Related

I would like to find consecutive numbers in column A and column B in python (pandas)

I would like to find consecutive numbers in column A and Column B in python, Column A should be ascending but Column B is descending. I am attaching an example file.
Input file
nucleotide
Pos_A
Pos_B
Connection_Pos20_Pos102
20
102
Connection_Pos19_Pos102
19
102
Connection_Pos20_Pos101
20
101
Connection_Pos18_Pos102
18
102
Connection_Pos19_Pos101
19
101
Connection_Pos20_Pos100
20
100
Connection_Pos17_Pos102
17
102
Connection_Pos18_Pos101
18
101
Connection_Pos19_Pos100
19
100
Connection_Pos20_Pos99
20
99
Connection_Pos16_Pos102
16
102
Connection_Pos17_Pos101
17
101
Connection_Pos18_Pos100
18
100
Connection_Pos19_Pos99
19
99
Connection_Pos20_Pos98
20
98
Connection_Pos15_Pos102
15
102
Connection_Pos16_Pos101
16
101
Connection_Pos17_Pos100
17
100
Connection_Pos18_Pos99
18
99
Connection_Pos19_Pos98
19
98
Connection_Pos20_Pos97
20
97
Connection_Pos14_Pos102
14
102
Connection_Pos15_Pos101
15
101
Connection_Pos16_Pos100
16
100
Output:
nucleotide
Pos_A
Pos_B
Consecutive ID
Consecutive Number (Size)
Connection_Pos20_Pos102
20
102
101
1
Connection_Pos19_Pos102
19
102
100
2
Connection_Pos20_Pos101
20
101
100
2
Connection_Pos18_Pos102
18
102
99
3
Connection_Pos19_Pos101
19
101
99
3
Connection_Pos20_Pos100
20
100
99
3
Connection_Pos17_Pos102
17
102
98
4
Connection_Pos18_Pos101
18
101
98
4
Connection_Pos19_Pos100
19
100
98
4
Connection_Pos20_Pos99
20
99
98
4
Connection_Pos16_Pos102
16
102
97
5
Connection_Pos17_Pos101
17
101
97
5
Connection_Pos18_Pos100
18
100
97
5
Connection_Pos19_Pos99
19
99
97
5
Connection_Pos20_Pos98
20
98
97
5
Connection_Pos15_Pos102
15
102
96
6
Connection_Pos16_Pos101
16
101
96
6
Connection_Pos17_Pos100
17
100
96
6
Connection_Pos18_Pos99
18
99
96
6
Connection_Pos19_Pos98
19
98
96
6
Connection_Pos20_Pos97
20
97
96
6
Connection_Pos14_Pos102
14
102
95
7
Connection_Pos15_Pos101
15
101
95
7
Connection_Pos16_Pos100
16
100
95
7
Connection_Pos17_Pos99
17
99
95
7
Connection_Pos18_Pos98
18
98
95
7
Connection_Pos19_Pos97
19
97
95
7
Connection_Pos20_Pos96
20
96
95
7
For Consecutive ID, if Pos_B's shifted difference != 1, then we want to subtract 1, so we mark those indexes as -1 with mul(-1) and cumsum them:
df['ID'] = df.Pos_B.shift().sub(df.Pos_B).ne(1).mul(-1).cumsum() + df.Pos_B[0]
For Consecutive Number, if Pos_A's shifted difference != -1, then we want to add 1, so we mark those indexes as 1 and cumsum again:
df['Number'] = df.Pos_A.shift().sub(df.Pos_A).ne(-1).mul(1).cumsum()
Result:
nucleotide Pos_A Pos_B ID Number
0 Connection_Pos20_Pos102 20 102 101 1
1 Connection_Pos19_Pos102 19 102 100 2
2 Connection_Pos20_Pos101 20 101 100 2
3 Connection_Pos18_Pos102 18 102 99 3
4 Connection_Pos19_Pos101 19 101 99 3
5 Connection_Pos20_Pos100 20 100 99 3
6 Connection_Pos17_Pos102 17 102 98 4
7 Connection_Pos18_Pos101 18 101 98 4
8 Connection_Pos19_Pos100 19 100 98 4
9 Connection_Pos20_Pos99 20 99 98 4
10 Connection_Pos16_Pos102 16 102 97 5
11 Connection_Pos17_Pos101 17 101 97 5
12 Connection_Pos18_Pos100 18 100 97 5
13 Connection_Pos19_Pos99 19 99 97 5
14 Connection_Pos20_Pos98 20 98 97 5
15 Connection_Pos15_Pos102 15 102 96 6
16 Connection_Pos16_Pos101 16 101 96 6
17 Connection_Pos17_Pos100 17 100 96 6
18 Connection_Pos18_Pos99 18 99 96 6
19 Connection_Pos19_Pos98 19 98 96 6
20 Connection_Pos20_Pos97 20 97 96 6
21 Connection_Pos14_Pos102 14 102 95 7
22 Connection_Pos15_Pos101 15 101 95 7
23 Connection_Pos16_Pos100 16 100 95 7
Do it one by one then groupby with ngroup
s1 = df.Pos_A.diff().le(0).cumsum()
s2 = df.Pos_B.diff().ge(0).cumsum()
df['out'] = df.groupby([s1,s2]).ngroup()+1
Out[452]:
0 1
1 2
2 2
3 3
4 3
5 3
6 4
7 4
8 4
9 4
10 5
11 5
12 5
13 5
14 5
15 6
16 6
17 6
18 6
19 6
20 6
21 7
22 7
23 7
24 7
25 7
26 7
27 7
dtype: int64

pandas read_csv not reading entire file

I have a really strange problem and don't know how to solve it.
I am using Ubuntu 18.04.2 together with Python 3.7.3 64-bit and use VScode as an editor.
I am reading data from a database and write it to a csv file with csv.writer
import pandas as pd
import csv
with open(raw_path + station + ".csv", "w+") as f:
file = csv.writer(f)
# Write header into csv
colnames = [par for par in param]
file.writerow(colnames)
# Write data into csv
for row in data:
file.writerow(row)
This works perfectly fine, it provides a .csv file with all the data I read from the database up to the current timestep. However in a later working step I have to read this data to a pandas dataframe and merge it with another pandas dataframe. I read the files like this:
data1 = pd.read_csv(raw_path + file1, sep=',')
data2 = pd.read_csv(raw_path + file2, sep=',')
And then merge the data like this:
comb_data = pd.merge(data1, data2, on="datumsec", how="left").fillna(value=-999)
For 5 out of 6 locations that I do this, everything works perfectly fine, the combined dataset has the same length as the two seperate ones. However for one location pd.read_csv seems not to read the csv files properly. I checked whether the problem is already in the database readout but everything is OK there, I can open both files with sublime and they have the same length, however when I read them with pandas.read_csv one shows less lines. The best part is, this problem is appearing totally random. Sometimes it works and reads the entire file, sometimes not. AND it occures at different locations in the file. Sometimes it stops after approx. 20000 entries, sometimes at 45000, sometimes somewhere else.. just totally random.
Here is an overview of my test output when I print all the lengths of the files
print(len(data1)): 57105
print(len(data2)): 57105
both values directly after read out from database, before writing it anywhere..
After saving the data as csv as described above and opening it in excel or sublime or anything I can confirm that the data contains 57105 rows. Everything is where it is supposed to be.
However if I try to read the data as with pd.read_csv
print(len(data1)): 48612
print(len(data2)): 57105
both values after reading in the data from the csv file
data1 48612
datumsec tl rf ff dd ffx
0 1538352000 46 81 75 288 89
1 1538352600 47 79 78 284 93
2 1538353200 45 82 79 282 93
3 1538353800 44 84 71 284 91
4 1538354400 43 86 77 288 96
5 1538355000 43 85 78 289 91
6 1538355600 46 80 79 286 84
7 1538356200 51 72 68 285 83
8 1538356800 52 71 68 281 73
9 1538357400 48 75 68 276 80
10 1538358000 45 78 62 271 76
11 1538358600 42 82 66 273 76
12 1538359200 43 81 70 274 78
13 1538359800 44 80 68 275 78
14 1538360400 45 78 66 279 72
15 1538361000 45 78 67 282 73
16 1538361600 43 79 63 275 71
17 1538362200 43 81 69 280 74
18 1538362800 42 80 70 281 76
19 1538363400 43 78 69 285 77
20 1538364000 43 78 71 285 77
21 1538364600 44 75 61 288 71
22 1538365200 45 73 56 290 62
23 1538365800 45 72 44 297 57
24 1538366400 44 73 51 286 57
25 1538367000 43 76 61 281 70
26 1538367600 40 79 66 284 73
27 1538368200 39 78 70 291 76
28 1538368800 38 80 71 287 81
29 1538369400 36 81 74 285 81
... ... .. ... .. ... ...
48582 1567738800 7 100 0 210 0
48583 1567739400 6 100 0 210 0
48584 1567740000 5 100 0 210 0
48585 1567740600 6 100 0 210 0
48586 1567741200 4 100 0 210 0
48587 1567741800 4 100 0 210 0
48588 1567742400 5 100 0 210 0
48589 1567743000 4 100 0 210 0
48590 1567743600 4 100 0 210 0
48591 1567744200 4 100 0 209 0
48592 1567744800 4 100 0 209 0
48593 1567745400 5 100 0 210 0
48594 1567746000 6 100 0 210 0
48595 1567746600 5 100 0 210 0
48596 1567747200 5 100 0 210 0
48597 1567747800 5 100 0 210 0
48598 1567748400 5 100 0 210 0
48599 1567749000 6 100 0 210 0
48600 1567749600 6 100 0 210 0
48601 1567750200 5 100 0 210 0
48602 1567750800 4 100 0 210 0
48603 1567751400 5 100 0 210 0
48604 1567752000 6 100 0 210 0
48605 1567752600 7 100 0 210 0
48606 1567753200 6 100 0 210 0
48607 1567753800 5 100 0 210 0
48608 1567754400 6 100 0 210 0
48609 1567755000 7 100 0 210 0
48610 1567755600 7 100 0 210 0
48611 1567756200 7 100 0 210 0
[48612 rows x 6 columns]
datumsec tl rf schnee ival6
0 1538352000 115 61 25 107
1 1538352600 115 61 25 107
2 1538353200 115 61 25 107
3 1538353800 115 61 25 107
4 1538354400 115 61 25 107
5 1538355000 115 61 25 107
6 1538355600 115 61 25 107
7 1538356200 115 61 25 107
8 1538356800 115 61 25 107
9 1538357400 115 61 25 107
10 1538358000 115 61 25 107
11 1538358600 115 61 25 107
12 1538359200 115 61 25 107
13 1538359800 115 61 25 107
14 1538360400 115 61 25 107
15 1538361000 115 61 25 107
16 1538361600 115 61 25 107
17 1538362200 115 61 25 107
18 1538362800 115 61 25 107
19 1538363400 115 61 25 107
20 1538364000 115 61 25 107
21 1538364600 115 61 25 107
22 1538365200 115 61 25 107
23 1538365800 115 61 25 107
24 1538366400 115 61 25 107
25 1538367000 115 61 25 107
26 1538367600 115 61 25 107
27 1538368200 115 61 25 107
28 1538368800 115 61 25 107
29 1538369400 115 61 25 107
... ... ... ... ... ...
57075 1572947400 -23 100 -2 -999
57076 1572948000 -23 100 -2 -999
57077 1572948600 -22 100 -2 -999
57078 1572949200 -23 100 -2 -999
57079 1572949800 -24 100 -2 -999
57080 1572950400 -23 100 -2 -999
57081 1572951000 -21 100 -1 -999
57082 1572951600 -21 100 -1 -999
57083 1572952200 -23 100 -1 -999
57084 1572952800 -23 100 -1 -999
57085 1572953400 -22 100 -1 -999
57086 1572954000 -23 100 -1 -999
57087 1572954600 -22 100 -1 -999
57088 1572955200 -24 100 0 -999
57089 1572955800 -24 100 0 -999
57090 1572956400 -25 100 0 -999
57091 1572957000 -26 100 -1 -999
57092 1572957600 -26 100 -1 -999
57093 1572958200 -27 100 -1 -999
57094 1572958800 -25 100 -1 -999
57095 1572959400 -27 100 -1 -999
57096 1572960000 -29 100 -1 -999
57097 1572960600 -28 100 -1 -999
57098 1572961200 -28 100 -1 -999
57099 1572961800 -27 100 -1 -999
57100 1572962400 -29 100 -2 -999
57101 1572963000 -29 100 -2 -999
57102 1572963600 -29 100 -2 -999
57103 1572964200 -30 100 -2 -999
57104 1572964800 -28 100 -2 -999
[57105 rows x 5 columns]
To me there is no obvious reason in the data why it should have problems reading the entire file and obviously there are none, considering that sometimes it reads the entire file and sometimes not.
I am really clueless about this. Do you have any idea how to cope with that and what could be the problem?
I finally solved my problem and as expected it was not within the file itself. I am using multiprocesses to run the named functions and some other things in parallel. The reading from database + writing to csv file and reading from csv file are performed in two different processes. Therefore the second process (reading from csv) did not know that the csv file was still being written and read only what was already available in the csv file. Because the file was opened by a different process it did not throw an exception when being opened.
I thought I already took care of this but obviously not thoroughly enough, excluding every possible case.
I had completely the same problem with a different application and also did not understand what was wrong, because sometimes it worked and sometimes it didn't.
In a for loop, I was extracting the last two rows of a dataframe that I was creating in the same file. Sometimes, the extracted rows where not the last two at all, but most of the times it worked fine. I guess the program started extracting the last two rows before the writing process was done.
I paused the script for half a second to make sure the writing process is done:
import time
time.sleep(0.5)
However, I don't think this is not a very elegant solution, since it might not be sufficient if somebody with a slower computer uses the script for instance.
Vroni, how did you solve this in the end, is there a way to define that a specific process must not be processed parallel with other tasks. I did not define anything about parallel processing in my program, so I think if this is the cause it is done automatically.

Convert Glyph path to SVG

I have the following glyph path:
<glyph glyph-name="right-nav-workflow" unicode="" d="M251 101c0 65 0 131 0 196 26 0 52 0 78 0 0 27 0 54 0 81-29 7-52 23-67 50-11 20-14 41-10 64 8 45 48 79 92 81 49 1 88-29 102-79 2 0 4 0 6 0 20 0 40 0 60 0 5 0 8 1 11 5 32 32 64 64 96 96 2 1 3 3 4 4 42-42 84-84 127-127-1 0-2-2-4-4-32-32-65-65-98-98-2-2-4-6-4-10 0-21 0-43 0-64 30-8 53-24 67-52 12-21 15-44 9-68-11-46-55-78-102-75-48 3-88 42-92 91-4 48 28 91 78 104 0 1 1 3 1 4 0 21 0 42 0 62 0 3-2 5-3 7-28 28-55 55-83 83-1 1-3 3-5 3-22 0-45 0-68 0-5-19-13-36-27-50-14-14-31-23-50-27 0-27 0-53 0-81 26 0 52 0 78 0 0-65 0-130 0-196-65 0-131 0-196 0z m157 156c-40 0-79 0-118 0 0-39 0-78 0-117 40 0 79 0 118 0 0 39 0 78 0 117z m275-58c0 33-26 59-59 59-32 0-59-26-59-59 0-33 27-59 60-59 32 0 58 26 58 59z m-334 216c33 0 59 27 59 59 0 33-26 59-59 59-33 0-59-26-59-59 0-33 26-59 59-59z m275-11c23 23 46 46 69 69-23 23-47 46-69 68-23-22-46-46-69-69 23-22 46-45 69-68z" horiz-adv-x="1000" />
How can I convert this into SVG? I tried to save it in a text file with svg extension but seems is not working.
Take the raw drawing data, put it in a path element, remove the glyph specific attributes, add an appropriate viewBox. Now you have something that works as an inline SVG. If you want to save it as a standalone SVG, then you need to add a name space declaration.
<svg height="400px" width="400px" viewBox="0 0 1000 1000">
<path transform="scale(1,-1) translate(0,-650)" fill="none" stroke="red" stroke-width="1" d="M251 101c0 65 0 131 0 196 26 0 52 0 78 0 0 27 0 54 0 81-29 7-52 23-67 50-11 20-14 41-10 64 8 45 48 79 92 81 49 1 88-29 102-79 2 0 4 0 6 0 20 0 40 0 60 0 5 0 8 1 11 5 32 32 64 64 96 96 2 1 3 3 4 4 42-42 84-84 127-127-1 0-2-2-4-4-32-32-65-65-98-98-2-2-4-6-4-10 0-21 0-43 0-64 30-8 53-24 67-52 12-21 15-44 9-68-11-46-55-78-102-75-48 3-88 42-92 91-4 48 28 91 78 104 0 1 1 3 1 4 0 21 0 42 0 62 0 3-2 5-3 7-28 28-55 55-83 83-1 1-3 3-5 3-22 0-45 0-68 0-5-19-13-36-27-50-14-14-31-23-50-27 0-27 0-53 0-81 26 0 52 0 78 0 0-65 0-130 0-196-65 0-131 0-196 0z m157 156c-40 0-79 0-118 0 0-39 0-78 0-117 40 0 79 0 118 0 0 39 0 78 0 117z m275-58c0 33-26 59-59 59-32 0-59-26-59-59 0-33 27-59 60-59 32 0 58 26 58 59z m-334 216c33 0 59 27 59 59 0 33-26 59-59 59-33 0-59-26-59-59 0-33 26-59 59-59z m275-11c23 23 46 46 69 69-23 23-47 46-69 68-23-22-46-46-69-69 23-22 46-45 69-68z"/>
</svg>
Update - Robert points out that the SVG Fonts spec has a y axis (0 on the bottom) that's inverted from the SVG norm (0 is the top) -> so you also need to flip the drawing along the x axis by using a scale (1,-1).

extracting string between two regular expressions "|" patterns

I want to extract all the strings between gi| and |. The position of the string is consistent in all the lines.
I am trying this one:
cat ERR594382_second_cat.test | sed -n '/gi\|/,/\|/p'
But, it is not working.
Here is a head of my file :
head ERR594382_second_cat.test
ERR594382.28316455_3_6_1 gi|914605561|ref|WP_050599988.1| 22 54 67 99 4.03e-15 77.0 100.000 33 0 0 225971;1306953 Bacteria Erythrobacter citreus;Erythrobacter citreus LAMA 915 ribonuclease D [Erythrobacter citreus]
ERR594382.28316455_65_2_3 gi|914605561|ref|WP_050599988.1| 13 46 11 44 2.15e-17 82.8 100.000 34 0 0 225971;1306953 Bacteria Erythrobacter citreus;Erythrobacter citreus LAMA 915 ribonuclease D [Erythrobacter citreus]
ERR594382.28316459_1_1_2 gi|1270336953|gb|PHR32068.1| 8 53 863 903 6.98e-08 56.6 63.043 46 12 1 2024840 Bacteria Methylophaga sp. phosphohydrolase [Methylophaga sp.]
ERR594382.28316464_2_2_3 gi|705244733|gb|AIW56710.1| 2 33 145 176 5.76e-12 67.8 93.750 32 2 0 340016 Viruses uncultured virus ribonucleotide reductase, partial [uncultured virus]
ERR594382.28316464_53_5_5 gi|1200458341|gb|OUV73944.1| 1 31 557 587 9.54e-11 64.3 80.645 31 6 0 1986721 Bacteria Flavobacteriales bacterium TMED123 hypothetical protein CBC83_04720 [Flavobacteriales bacterium TMED123]
ERR594382.28316465_3_3_2 gi|787065740|dbj|BAR36435.1| 1 46 204 249 5.55e-10 63.2 58.696 46 19 0 1407671 Viruses uncultured Mediterranean phage uvMED hypothetical protein [uncultured Mediterranean phage uvMED]
ERR594382.28316465_67_4_3 gi|787065740|dbj|BAR36435.1| 2 34 224 256 1.31e-07 55.1 66.667 33 11 0 1407671 Viruses uncultured Mediterranean phage uvMED hypothetical protein [uncultured Mediterranean phage uvMED]
ERR594382.28316466_18_6_3 gi|1200295886|gb|OUU17830.1| 1 33 92 124 1.73e-12 70.1 100.000 33 0 0 1986638 Bacteria Alphaproteobacteria bacterium TMED37 hypothetical protein CBB97_21775 [Candidatus Endolissoclinum sp. TMED37]
ERR594382.28316470_37_1_1 gi|787067413|dbj|BAR37857.1| 16 43 60 87 1.94e-09 58.9 96.429 28 1 0 1407671 Viruses uncultured Mediterranean phage uvMED terminase large subunit [uncultured Mediterranean phage uvMED]
ERR594382.28316474_2_5_1 gi|1219813777|gb|ASN63501.1| 1 33 62 94 3.55e-12 64.3 81.818 33 6 0 340016 Viruses uncultured
You could use grep or / pcregrep (in case of using macOS) with this:
pcregrep -o "gi\|\K.+?(?=\|)" file
or with:
grep -oP "gi\|\K.+?(?=\|)" file
The \K can be read as excluding everything to the left before it and return only the right part .+, and then the .+?(?=\|) match any characters until | is found.
The easiest way if only your delimiter is fixed could be with cut:
cut -f2 -d"|" file

Obtaining UNIQUE VALUE occurrences count in a set of COLUMNS using AWK

IGNORING columns 1 & 2 (only the rest of the columns); I would like to obtain the occurrence COUNT of UNIQUE EVEN values (ignoring ODD ones) for the following set of data.
I have tried:
awk '{ a[$3, $4, $5, $6, $7]++ } END { for (b in a) { cnt+=1 } {print cnt}}' file
I obtain 76 but I don’t expect this value.
> 0 0
> 1 0 0
> 2 0 2
> 3 0 0 6
> 4 0 0 8
> 5 0 0 10
> 6 0 2 14
> 7 0 2 16
> 8 0 0 6 20
> 9 0 0 8 24
> 10 0 0 8 26
> 11 0 0 10 32
> 12 0 0 10 34
> 13 0 2 14 40
> 14 0 2 16 42
> 15 0 0 8 24 48
> 16 0 0 8 24 50
> 17 0 0 8 26 56
> 18 0 0 10 32 60
> 19 0 0 10 34 64
> 20 0 0 10 34 66
> 21 0 2 14 40 72
> 22 0 0 8 24 48 76
> 23 0 0 8 24 50 82
> 24 0 0 8 26 56 88
> 25 0 0 8 26 56 90
> 26 0 0 10 32 60 96
> 27 0 0 10 32 60 98
> 28 0 0 10 34 64 104
> 29 0 0 10 34 64 106
> 30 0 0 10 34 66 112
> 31 0 0 10 34 66 114
> 0 1
> 1 1 2 5
> 2 1 2
> 3 1 2 12 23 19
> 4 1 2 12 23
> 5 1 2 12
> 6 1 2 12 28
> 7 1 2 12 28 36
> 8 1 2 12 30 47 45
> 9 1 2 12 30 47
> 10 1 2 12 30
> 11 1 2 12 30 52
> 12 1 2 12 28 38
> 13 1 2 12 28 38 62
> 14 1 2 12 28 38 62 68
> 15 1 2 12 30 54 75
> 16 1 2 12 30 54
> 17 1 2 12 30 54 78
> 18 1 2 12 30 54 78 84
> 19 1 2 12 30 54 78 84 92
> 20 1 2 12 28 38 62 70
> 21 1 2 12 28 38 62 70 108
> 22 1 2 12 30 54 80
> 23 1 2 12 30 54 78 86
> 24 1 2 12 30 54 78 86 120
> 25 1 2 12 30 54 78 84 94
> 26 1 2 12 30 54 78 84 94 124
> 27 1 2 12 30 54 78 84 92 102
> 28 1 2 12 30 54 78 84 92 102 128
> 29 1 2 12 28 38 62 70 110
> 30 1 2 12 28 38 62 70 110 130
> 31 1 2 12 28 38 62 70 108 116
> 0 2
> 1 2 2 5
> 2 2 2
> 3 2 2 5 6
> 4 2 2 5 6 18
> 5 2 2 5 6 18 22
> 6 2 2 14
> 7 2 2 16
> 8 2 2 5 6 20
> 9 2 2 5 6 20 44
> 10 2 2 5 6 18 26
> 11 2 2 5 6 18 22 32
> 12 2 2 5 6 18 22 32 58
> 13 2 2 14 40
> 14 2 2 16 42
> 15 2 2 5 6 20 44 50 75
> 16 2 2 5 6 20 44 50
> 17 2 2 5 6 18 26 56
> 18 2 2 5 6 18 22 32 60
> 19 2 2 14 40 72 109 101
> 20 2 2 14 40 72 109
> 21 2 2 14 40 72
> 22 2 2 5 6 20 44 50 80
> 23 2 2 5 6 20 44 50 80 118
> 24 2 2 5 6 20 44 50 80 118 120
> 25 2 2 5 6 20 44 50 80 118 120 122
> 26 2 2 14 40 72 109 101 102 127
> 27 2 2 14 40 72 109 101 102
> 28 2 2 14 40 72 109 101 104
> 29 2 2 14 40 72 116 133 131
> 30 2 2 14 40 72 116 133
> 31 2 2 14 40 72 116
> 0 3
> 1 3 0
> 2 3 0 4
> 3 3 0 6
> 4 3 0 6 18
> 5 3 0 6 18 22
> 6 3 0 4 16 37
> 7 3 0 4 16
> 8 3 0 6 20
> 9 3 0 6 18 26 47
> 10 3 0 6 18 26
> 11 3 0 6 18 22 32
> 12 3 0 6 18 22 32 58
> 13 3 0 4 16 42 69
> 14 3 0 4 16 42
> 15 3 0 6 18 26 47 48
> 16 3 0 6 18 26 47 48 74
> 17 3 0 6 18 26 56
> 18 3 0 6 18 22 32 60
> 19 3 0 6 18 22 32 58 64
> 20 3 0 6 18 22 32 58 66
> 21 3 0 6 18 22 32 58 66 108
> 22 3 0 6 18 26 47 48 76
> 23 3 0 6 18 26 56 86
> 24 3 0 6 18 26 56 88
> 25 3 0 6 18 26 56 90
> 26 3 0 6 18 22 32 60 96
> 27 3 0 6 18 22 32 60 98
> 28 3 0 6 18 22 32 58 64 104
> 29 3 0 6 18 22 32 58 64 106
> 30 3 0 6 18 22 32 58 66 112
> 31 3 0 6 18 22 32 58 66 114
> 0 4
> 1 4 0
> 2 4 2
> 3 4 0 6
> 4 4 0 8
> 5 4 0 10
> 6 4 2 16 37
> 7 4 2 16
> 8 4 0 8 24 45
> 9 4 0 8 24
> 10 4 0 8 26
> 11 4 0 8 26 52
> 12 4 2 16 37 38
> 13 4 2 16 42 69
> 14 4 2 16 42
> 15 4 0 8 24 48
> 16 4 0 8 24 50
> 17 4 0 8 26 56
> 18 4 0 8 26 52 60
> 19 4 2 16 37 38 64
> 20 4 2 16 42 69 70
> 21 4 2 16 42 69 72
> 22 4 0 8 24 48 76
> 23 4 0 8 24 50 82
> 24 4 0 8 26 56 88
> 25 4 0 8 26 52 60 94
> 26 4 0 8 26 52 60 96
> 27 4 0 8 26 52 60 98
> 28 4 2 16 37 38 64 104
> 29 4 2 16 42 69 70 110
> 30 4 2 16 42 69 70 112
> 31 4 2 16 42 69 70 114
You can try this awk command to count unique values ignoring 1st and 2nd column:
awk '{$1=$2=""; !seen[$0]++} END{print length(seen)}' file
130
If you are counting uniques excluding 1st and 2nd columns and ignoring odd numbers then use:
awk '{for (i=3; i<=NF; i++) !($i%2) && !seen[$i]++} END{print length(seen)}' file
63

Resources