If the last column consists of less than 2 values then the whole row will be removed
sample data:
18106|1.0.4.0/22|223 121 1836
3549|1.0.10.0/24|421 21
5413|1.0.0.0/16|789
2152|1.4.0.0/16|745 89 1876
3549|1.0.8.0/22|680
Expected output:
18106|1.0.4.0/22|223 121 1836
3549|1.0.10.0/24|421 21
2152|1.4.0.0/16|745 89 1876
Is there any way to do it?
If there are no spaces after the single value, you can just eliminate lines with no space after the last |:
grep -v '|[^ |]*$'
[...] is a character class. [ |] matches a space or |.
^ inside a character class negates it, i.e. [^ |] matches anything but space or |.
* means "repeated zero or more times"
$ matches the end of line
-v shows the lines not matching the pattern
So the whole thing means "skip lines that contain vertical bar followed by characters different to space and vertical bar till the end of the line"
It doesn't work for your sample data, though, as there's a space after 789. So, check there's a space followed by non-space after the last |:
grep '|[^ |]\+ [^ |]\+'
Here, \+ means "repated one or more times".
Short awk solution:
awk -F'\\|' 'split($NF,a," ")>=2' file
The output:
18106|1.0.4.0/22|223 121 1836
3549|1.0.10.0/24|421 21
2152|1.4.0.0/16|745 89 1876
split($NF,a," ") - split the last field by space and returns the number of chunks
Related
I have created this file concacaf.txt with the following input
David Canada 5
Larin Canada 5
Borges Costa Rica 2
Buchanan Canada 2
Davis Panama 2
Gray Jamaica 2
Henriquez El Salvador 2
Is there a way that I can either use the cut command and treat Costa Rica or El Salvador as a single word or modify the text so that when I use:
cut -f 1,3 -d ' ' concacaf.txt
I get 'Borges 2' instead of 'Borges Rica'. Thanks
It is not possible using cut but it is possible using sed:
sed -E 's/^([^ ]*) .* ([^ ]*)$/\1 \2/' concacaf.txt
It searches for the first word ([^ ]*, a sequence of non-space characters) at the beginning of the line and the word at the end of the line and replaces the entire line with the first word and the last word and a space between them.
The option -E tells sed to use modern regular expressions (by default it uses basic regular expressions and the parentheses need to be escaped).
The sed command is s (search). It searches in each line using a regular expression and replaces the matching substring with the provided replacement string. In the replacement string, \1 represents the substring matching the first capturing group, \2 the second group and so on.
The regular expression is explained below:
^ # matches the beginning of line
( # starts a group (it is not a matcher)
[^ ] # matches any character that is not a space (there is a space after `^`)
* # the previous sub-expression, zero or more times
) # close the group; the matched substring is captured
# there is a space here in the expression; it matches a space
.* # match any character, any number of times
# match a space
([^ ]*) # another group that matches a sequence of non-space characters
$ # match the end of the line
You can use rev to cut out that last field containing the integer:
$ cat concacaf.txt | rev | cut -d' ' -f2- | rev
David Canada
Larin Canada
Borges Costa Rica
Buchanan Canada
Davis Panama
Gray Jamaica
Henriquez El Salvador
For instance let say I have a text file:
worker1, 0001, company1
worker2, 0002, company2
worker3, 0003, company3
How would I use sed to take the first 2 characters of the first column so "wo" and remove the rest of the text and attach it to the second column so the output would look like this:
wo0001,company1
wo0002,company2
wo0003,company3
$ sed -E 's/^(..)[^,]*, ([^,]*,) /\1\2/' file
wo0001,company1
wo0002,company2
wo0003,company3
s/ begin substitution
^(..) match the first two characters at the beginning of the line, captured in a group
[^,]* match any amount of non-comma characters of the first column
, match a comma and a space character
([^,]*,) match the second field and comma captured in a group (any amount of non-comma characters followed by a comma)
match the next space character
/\1\2/ replace with the first and second capturing group
I need to replace only white spaces of a tab delimited file with underscores (but keeping the tabulation and the division in lines). The file is composed of 5 million lines and 8 columns, here the first two lines as example:
Contig505_strand1_frame2_coord21-810 sp|Q06605|GRZ1_RAT Granzyme-like protein 1 OS=Rattus norvegicus PE=2 SV=1 32.245 245 153 6 5.15e-33 123
Contig505_strand1_frame2_coord21-810 sp|P36178|CTRB2_LITVA Chymotrypsin BII OS=Litopenaeus vannamei PE=1 SV=1 34.483 232 140 7 1.78e-32 122
For now I am using these commands in sequence, but it's very slow...there is a quicker way to make it?
tr -s '\t' ';' <inputfile.txt >file2.txt
tr -s '[:blank:]' '_' <file2.txt >file3.txt
tr -s ';' '\t' <file3.txt >file4.txt
thank you!
[:blank:] includes tabs, so I think if you want to replace one or spaces with an underscore this may work better:
sed -E 's/ +/_/g' inputfile.txt > file2.txt
The sed (stream edit) command searches for one or more spaces and replaces them with an underscore. The 'g' is for global, meaning do the replacement multiple times on a line if found. The default action is to replace only the first occurrence.
I have a huge file
from line 3 to end of (#lines in file -1 )
starting at character position 75 on the line. I need to change the string to 123456789.
thought suggestions? I can't the existing characters per line are not duplicates so I can't search on that.
The joys of hiding pii data
In vim, you can do this:
%s/\(^.\{75\}\)\#<=........./1234567890/g
which basically does a lookbehind of 75 characters (which starts at the beginning of the line), and replaces the rest of the line with your string.
Let's consider this test file:
$ cat testfile
.........-.........-.........-.........-.........-.........-.........-....ReplaceMeKeep
.........-.........-.........-.........-.........-.........-.........-....OldData..Keep
Using sed
This replaces the nine characters starting with column 75 on with 123456789:
$ sed -E 's/(.{74}).{0,9}/\1123456789/' testfile
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
Using awk
This puts the new string in place of the first nine characters starting at position 75:
$ awk '{print substr($0,1,74) "123456789" substr($0,75+9)}' testfile
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
I have a text file with lines like this:
Sequences (1:4) Aligned. Score: 4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...
I want to be able to extract the values into a new tab delimited text file:
1 4 4
100 3011 77
12 345 100
(like this but with tabs instead of spaces)
Can anyone suggest anything? Some combination of sed or cut maybe?
You can use Perl:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'
Or, to save to file:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt
Little explanation:
Regex here is in the form:
s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/
How to match = .*?(\d+):(\d+).*?(\d+)
How to replace = $1\t$2\t$3
In our case, we used the following tokens to declare how we want to match the string:
.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit
Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:
.*?(\d+):(\d+).*?(\d+)
$1 $2 $3
Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):
$1\t$2\t$3
You could use sed:
sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile
Here's a BSD sed compatible version:
sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile
The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.
If you know there will always be 3 numbers per line, you could go with grep:
<infile grep -o '[0-9]\+' | paste - - -
Output in all cases:
1 4 4
100 3011 77
12 345 100
My solution using sed:
sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1 \2 \3/g' file.txt