removed entry based on the length of the values

removed entry based on the length of the values - linux

If the last column consists of less than 2 values then the whole row will be removed
sample data:
18106|1.0.4.0/22|223 121 1836
3549|1.0.10.0/24|421 21
5413|1.0.0.0/16|789
2152|1.4.0.0/16|745 89 1876
3549|1.0.8.0/22|680
Expected output:
18106|1.0.4.0/22|223 121 1836
3549|1.0.10.0/24|421 21
2152|1.4.0.0/16|745 89 1876
Is there any way to do it?

If there are no spaces after the single value, you can just eliminate lines with no space after the last |:
grep -v '|[^ |]*$'
[...] is a character class. [ |] matches a space or |.
^ inside a character class negates it, i.e. [^ |] matches anything but space or |.
* means "repeated zero or more times"
$ matches the end of line
-v shows the lines not matching the pattern
So the whole thing means "skip lines that contain vertical bar followed by characters different to space and vertical bar till the end of the line"
It doesn't work for your sample data, though, as there's a space after 789. So, check there's a space followed by non-space after the last |:
grep '|[^ |]\+ [^ |]\+'
Here, \+ means "repated one or more times".

Short awk solution:
awk -F'\\|' 'split($NF,a," ")>=2' file
The output:
18106|1.0.4.0/22|223 121 1836
3549|1.0.10.0/24|421 21
2152|1.4.0.0/16|745 89 1876
split($NF,a," ") - split the last field by space and returns the number of chunks

Related

Is there a way that I can use the cut command with space as delimiter and treat a word with space like Costa Rica as a single word?

I have created this file concacaf.txt with the following input
David Canada 5
Larin Canada 5
Borges Costa Rica 2
Buchanan Canada 2
Davis Panama 2
Gray Jamaica 2
Henriquez El Salvador 2
Is there a way that I can either use the cut command and treat Costa Rica or El Salvador as a single word or modify the text so that when I use:
cut -f 1,3 -d ' ' concacaf.txt
I get 'Borges 2' instead of 'Borges Rica'. Thanks

It is not possible using cut but it is possible using sed:
sed -E 's/^([^ ]*) .* ([^ ]*)$/\1 \2/' concacaf.txt
It searches for the first word ([^ ]*, a sequence of non-space characters) at the beginning of the line and the word at the end of the line and replaces the entire line with the first word and the last word and a space between them.
The option -E tells sed to use modern regular expressions (by default it uses basic regular expressions and the parentheses need to be escaped).
The sed command is s (search). It searches in each line using a regular expression and replaces the matching substring with the provided replacement string. In the replacement string, \1 represents the substring matching the first capturing group, \2 the second group and so on.
The regular expression is explained below:
^ # matches the beginning of line
( # starts a group (it is not a matcher)
[^ ] # matches any character that is not a space (there is a space after `^`)
* # the previous sub-expression, zero or more times
) # close the group; the matched substring is captured
# there is a space here in the expression; it matches a space
.* # match any character, any number of times
# match a space
([^ ]*) # another group that matches a sequence of non-space characters
$ # match the end of the line

You can use rev to cut out that last field containing the integer:
$ cat concacaf.txt | rev | cut -d' ' -f2- | rev
David Canada
Larin Canada
Borges Costa Rica
Buchanan Canada
Davis Panama
Gray Jamaica
Henriquez El Salvador

How do I remove text using sed?

For instance let say I have a text file:
worker1, 0001, company1
worker2, 0002, company2
worker3, 0003, company3
How would I use sed to take the first 2 characters of the first column so "wo" and remove the rest of the text and attach it to the second column so the output would look like this:
wo0001,company1
wo0002,company2
wo0003,company3

$ sed -E 's/^(..)[^,]*, ([^,]*,) /\1\2/' file
wo0001,company1
wo0002,company2
wo0003,company3
s/ begin substitution
^(..) match the first two characters at the beginning of the line, captured in a group
[^,]* match any amount of non-comma characters of the first column
, match a comma and a space character
([^,]*,) match the second field and comma captured in a group (any amount of non-comma characters followed by a comma)
match the next space character
/\1\2/ replace with the first and second capturing group

replace only white spaces (no tabs, no line end) of a tabular file with underscores

I need to replace only white spaces of a tab delimited file with underscores (but keeping the tabulation and the division in lines). The file is composed of 5 million lines and 8 columns, here the first two lines as example:
Contig505_strand1_frame2_coord21-810 sp|Q06605|GRZ1_RAT Granzyme-like protein 1 OS=Rattus norvegicus PE=2 SV=1 32.245 245 153 6 5.15e-33 123
Contig505_strand1_frame2_coord21-810 sp|P36178|CTRB2_LITVA Chymotrypsin BII OS=Litopenaeus vannamei PE=1 SV=1 34.483 232 140 7 1.78e-32 122
For now I am using these commands in sequence, but it's very slow...there is a quicker way to make it?
tr -s '\t' ';' <inputfile.txt >file2.txt
tr -s '[:blank:]' '_' <file2.txt >file3.txt
tr -s ';' '\t' <file3.txt >file4.txt
thank you!

[:blank:] includes tabs, so I think if you want to replace one or spaces with an underscore this may work better:
sed -E 's/ +/_/g' inputfile.txt > file2.txt
The sed (stream edit) command searches for one or more spaces and replaces them with an underscore. The 'g' is for global, meaning do the replacement multiple times on a line if found. The default action is to replace only the first occurrence.

vi sed or awk. every line in a text file. replace 9 characters starting at position 75

I have a huge file
from line 3 to end of (#lines in file -1 )
starting at character position 75 on the line. I need to change the string to 123456789.
thought suggestions? I can't the existing characters per line are not duplicates so I can't search on that.
The joys of hiding pii data

In vim, you can do this:
%s/\(^.\{75\}\)\#<=........./1234567890/g
which basically does a lookbehind of 75 characters (which starts at the beginning of the line), and replaces the rest of the line with your string.

Let's consider this test file:
$ cat testfile
.........-.........-.........-.........-.........-.........-.........-....ReplaceMeKeep
.........-.........-.........-.........-.........-.........-.........-....OldData..Keep
Using sed
This replaces the nine characters starting with column 75 on with 123456789:
$ sed -E 's/(.{74}).{0,9}/\1123456789/' testfile
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
Using awk
This puts the new string in place of the first nine characters starting at position 75:
$ awk '{print substr($0,1,74) "123456789" substr($0,75+9)}' testfile
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
.........-.........-.........-.........-.........-.........-.........-....123456789Keep

Extract certain text from each line of text file using UNIX or perl

I have a text file with lines like this:
Sequences (1:4) Aligned. Score: 4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...
I want to be able to extract the values into a new tab delimited text file:
1 4 4
100 3011 77
12 345 100
(like this but with tabs instead of spaces)
Can anyone suggest anything? Some combination of sed or cut maybe?

You can use Perl:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'
Or, to save to file:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt
Little explanation:
Regex here is in the form:
s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/
How to match = .*?(\d+):(\d+).*?(\d+)
How to replace = $1\t$2\t$3
In our case, we used the following tokens to declare how we want to match the string:
.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit
Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:
.*?(\d+):(\d+).*?(\d+)
$1 $2 $3
Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):
$1\t$2\t$3

You could use sed:
sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile
Here's a BSD sed compatible version:
sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile
The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.
If you know there will always be 3 numbers per line, you could go with grep:
<infile grep -o '[0-9]\+' | paste - - -
Output in all cases:
1 4 4
100 3011 77
12 345 100

My solution using sed:
sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1 \2 \3/g' file.txt

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

removed entry based on the length of the values - linux

Short awk solution: awk -F'\\|' 'split($NF,a," ")>=2' file The output: 18106|1.0.4.0/22|223 121 1836 3549|1.0.10.0/24|421 21 2152|1.4.0.0/16|745 89 1876 split($NF,a," ") - split the last field by space and returns the number of chunks

Related

Is there a way that I can use the cut command with space as delimiter and treat a word with space like Costa Rica as a single word?

How do I remove text using sed?

replace only white spaces (no tabs, no line end) of a tabular file with underscores

vi sed or awk. every line in a text file. replace 9 characters starting at position 75

Extract certain text from each line of text file using UNIX or perl

Categories

Resources