grep - removing a line that contains anything other than specified characters

grep - removing a line that contains anything other than specified characters - linux

I'm trying to find a way to delete any lines that contain characters other than what I specify. For example if I specify the characters a,e,i,o,u,r,s,t and I have a list of words
rat
tar
set
meow
Then "meow" should be deleted from the list because it contains the letters "m" and "w", which I haven't okayed. Any ideas?

Alternatively you can do this:
$ grep -v '[^aeiourst]' file.txt
rat
tar
set
The pattern matches lines that contain any caracter not specified in the list. This is clearly explained in the grep manual page:
A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that list; if the first character of the list is the caret ^ then it matches any character not in the list. For example, the regular expression [0123456789] matches any single digit.
In addition to this, since what you want is to remove the lines that match that pattern the -v/--invert-match option is used. This is also well explained in the grep manual page:
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)

This should do it for you. It has the letters you specified in a set, enclosed by []. * denotes that they can occur any number of times. ^ denotes the line must start with one of those letters, and $ denotes it must end with it as well.
grep '^[aeiourst]*$' file.txt

Related

how to add a word after specific patterns into a file linux

I have around 3000 files (phylogenetic tree files) in which there are some specific genes that I want to insert {Foreground} after : .
For instance;
(CBREN_CBREN.CBN09275:0.1505047394,((((((CBREN_CBREN.CBN30237:0.1134434184,CDOUG_CDOUG.g12077.t1:0.1229043127)92:0.0214649873,(CTRO_Csp11.Scaffold630.g17672.t1:0.0631318986,CWALLA_CWALL.g8382.t1:0.0753910535)92:0.0239057141)93:0.0325662116,((CBRI_CBG17629:0.0312071500,CNIGO_CNIGO.Cni-ugt-54:0.0736951024)99:0.0494942769,(CSINI_CSINI.Csp5_scaffold_00095.g4122.t1:0.0606700444,(CTRIBU_CTRIB.g6645.t1:0.0736535896,CZANZI_CZANZ.g13363.t1:0.0688400206)58:0.0091500887)100:0.0582326665)83:0.0238218345)64:0.0211630102,(CLAT_CLATE.FL83_14023:0.0101547146,CREMA_CREMA.FL82_03023:0.0239757985)100:0.0954119437)99:0.0555252013,(CELE_T25B9.7:0.1602734533,CINO_Sp34_40094810.t1:0.2305325582)93:0.0423976759)99:0.1230996301,(CJPJ_00498800.t1:0.0895372175,CJPJ_00700900.t1:0.1411739758)100:0.1285915300)100:0.8994859943,(((((((CBRI_CBG13049:0.0452507889,CNIGO_CNIGO.Cnig_chr_II.g5360.1:0.0660490258)68:0.0042384566,CNIGO_CNIGO.Cnig_chr_II.g5361.1:0.0321380678)100:0.0949970282,(CSINI_CSINI.Csp5_scaffold_00169.g6176.t1:0.0626931406,CZANZI_CZANZ.g6858.t1:0.0894503797)100:0.0715162764)100:0.0539156634,(CLAT_CLATE.FL83_09404:0.0252696400,CREMA_CREMA.FL82_14428:0.0155270060)100:0.0771958234)73:0.0198698760,CELE_T03D3.1:0.2195730426)85:0.0368288871,CDOUG_CDOUG.g745.t1:0.1814046140)65:0.0156158577,CWALLA_CWALL.g18942.t1:0.1591453045)78:0.0306577438);
What I need is to include {Foreground} before : of each CELE..., CBRI.. and CTRO..
For instance,
CTRO_Csp11.Scaffold630.g17672.t1{Foreground}:0.0631318986
CELE_T03D3.1{Foreground}:0.2195730426
CBRI_CBG17629{Foreground}:0.0312071500
in one file for these tree matches. and one by one as separete files.
I tried
cat OG0000733.tree |sed -e 's/CINO_Sp34_........\.t1/&{Foreground}/g' > edited.tree
but number of character is different after _ for each gene.

Use
sed -E 's/(CELE|CBRI|CTRO)[^:]*/&{Foreground}/g' OG0000733.tree > edited.tree
The (CELE|CBRI|CTRO)[^:]* expression finds CELE, CBRI or CTRO and then any zero or more characters other than a colon.
Replacement is the whole matched text (&) and a {Foreground} string.

Replace the . by something that repeats any number of times, but never reaches too far. It seems it could be a colon:
CINO_Sp34_[^:]*\.t1

Regular Expression symbol "?" not working in shell

I have one file with the following data. I am using egrep as the command
which is used for extended regular expression pattern
test
best
+see
done++
feett
ttesingt
I want the output as below
best
+see
done++
vino+
I am using the below command for the output
egrep 't?' filename.
We know that the meaning of ? is zero or one occurrence of previous
character. So in my case t is optional if it present it has only one t but i
am getting all lines as output.
Please let me know how to achieve the required output.

? means: The preceding item is optional and matched at most once.
In your question, egrep 't?' filename means, you are considering it as optional right.
Meanwhile egrep '?' filename (here t is zero occurrence), so it will print total file output.
Example : If you give egrep 'tt?' filename, it means here first character is 't' and next chracter 't?' is optional zero occurrence. So the output will be
egrep 't' filename

In short, egrep 't?' filename essentially means "find me a 't' character, but if it's absent, that's OK too".
Let's start with egrep 't', this asks to find one t character. It doesn't say only one character, so it will match "at", "att", "attt" and so on.
Then you add "?" to it -- egrep 't?', making "t" optional, now it matches "a", "at", "att", "abc", "xyz" and basically any other string you can imagine.
In case of one-character searches the "?" modifier doesn't really make any sense, only when there's more to find, like for example egrep 'ab?c' that matches "abc", "acd", but not "abb".
How to make your example work?
A simple way will be just to chain two egreps together:
egrep t filename | egrep -v tt
First egrep gives you only lines containing at least one t, the second one will throw away all lines with two (or more) t characters together.
More complex solution using regexp will look like this:
egrep '^((^|[^t]+)t($|[^t]+))+$' filename
I would personally prefer the first one :)

Vim or sed : Replace character(s) within a pattern

I wanted to replace underscores with hyphens in all places where the character('_') is preceded and following by uppercase letters e.g. QWQW_IOIO, OP_FD_GF_JK, TRT_JKJ, etc. The replacement is needed throughout one document.
I tried to replace this in vim using:
:%s/[A-Z]_[A-Z]/[A-Z]-[A-Z]/g
But that resulted in QWQW_IOIO with QWQ[A-Z]-[A-Z]OIO :(
I tried using a sed command:
sed -i '/[A-Z]_[A-Z]/ s/_/-/g' ./file_name
This resulted in replacement over the whole line. e.g.
QWQW_IOIO variable may contain '_' or '-' line was replaced by
QWQW-IOIO variable may contain '-' or '-'

You had the right idea with your first vim approach. But you need to use a capturing group to remember what character was found in the [A-Z] section. Those are nicely explained here and under :h /\1. As a side note, I would recommend using \u instead of [A-Z], since it is both shorter and faster. That means the solution you want is:
:%s/\(\u\)_\(\u\)/\1-\2/g
Or, if you would like to use the magic setting to make it more readable:
:%s/\v(\u)_(\u)/\1-\2/g
Another option would be to limit the part of the search that gets replaced with the \zs and \ze atoms:
:%s/\u\zs_\ze\u/-/g
This is the shortest solution I'm aware of.

This should do what you want, assuming GNU sed.
sed -i -r -e 's/([A-Z]+)_([A-Z]+)/\1-\2/g' ./file_name
Explanation:
-r flag enables extended regex
[A-Z]+ is "one or more uppercase letters"
() groups a pattern together and creates a numbered memorized match
\1, \2 put those memorized matches in the replacement.
So basically this finds a chunk of uppercase letters followed by an underscore, followed by another chunk of uppercase letters, memorizes only the letter chunks as 2 groups,
([A-Z]+)_([A-Z]+)
Then it replays those groups, but with a hyphen in between instead of an underscore.
\1-\2
The g flag at the end says to do this even if the pattern shows up multiple times on one line.
Note that this falls apart a little in this case:
QWQW_IOIO_ABAB
Because it matches the first time, but not the second; the second part won't match because IOIO was consumed by the first match. So that would result in
QWQW-IOIO_ABAB
This version drops the + so it only matches one uppercase letter, and won't break in the same way:
sed -i -r -e 's/([A-Z])_([A-Z])/\1-\2/g'
It still has a small flaw, if you have a string like this:
A_B_C
Same issue as before, just one letter now instead of multiple.

extract first instance per line (maybe grep?)

I want to extract the first instance of a string per line in linux. I am currently trying grep but it yields all the instances per line. Below I want the strings (numbers and letters) after "tn="...but only the first set per line. The actual characters could be any combination of numbers or letters. And there is a space after them. There is also a space before the tn=
Given the following file:
hello my name is dog tn=12g3 fun 23k3 hello tn=1d3i9 cheese 234kd dks2 tn=6k4k ksk
1263 chairs are good tn=k38493kd cars run vroom it95958 tn=k22djd fair gold tn=293838 tounge
Desired output:
12g3
k38493

Here's one way you can do it if you have GNU grep, which (mostly) supports Perl Compatible Regular Expressions with -P. Also, the non-standard switch -o is used to only print the part matching the pattern, rather than the whole line:
grep -Po '^.*?tn=\K\S+' file
The pattern matches the start of the line ^, followed by any characters .*?, where the ? makes the match non-greedy. After the first match of tn=, \K "kills" the previous part so you're only left with the bit you're interested in: one or more non-space characters \S+.
As in Ed's answer, you may wish to add a space before tn to avoid accidentally matching something like footn=.... You might also prefer to use something like \w to match "word" characters (equivalent to [[:alnum:]_]).

Just split the input in tn=-separators and pick the second one. Then, split again to get everything up to the first space:
$ awk -F"tn=" '{split($2,a, " "); print a[1]}' file
12g3
k38493kd

$ awk 'match($0,/ tn=[[:alnum:]]+/) {print substr($0,RSTART+4,RLENGTH-4)}' file
12g3
k38493kd

How to replace in vim

I have a line in a source file: [12 13 15]. In vim, I type:
:%s/\([0-90-9]\) /\0, /g
wanting to add a coma after 12 and 13. It works, but not quite, as it inserts an extraspace [12 , 13 , 15].
How can I achieve the desired effect?

Use \1 in the replacement expression, not \0.
\1 is the text captured by the first \(...\). If there were any more pairs of escaped parens in your pattern, \2 would match the text capture between the pair starting at the second \(, \3 at the third \(, and so on.
\0 is the entire text matched by the whole pattern, whether in parentheses or not. In your case this includes the space at the end of your pattern.
Also note that [0-90-9] is the same as [0-9]: each [...] collection matches just one character. It happens to work anyway, because in your data ‘a digit followed by a space’ matches in the same places as ‘2 digits followed by a space’. (If you actually needed to only insert commas after 2 digits, you could write [0-9][0-9].)

"I have a line in a source file:..."
then you type :%s/... this will do the substitution on all lines, if it matched. or that is the single line in your file?
If it is the single line, you don't have to group, or [0-9], just :%s/ \+/,/g will do the job.

The fine answers already point interesting solutions, but here's another one,
making use of the \zs, which marks the start of the match. In this pattern:
/[0-9]\zs /
The searched text is /[0-9] /, but only the space counts as a match. Note
that you can use the class \d to simplify the digit character class, so the
following command shall work for your needs:
:s/\d\d\zs /, /g ; matches only the space, replace by `, '
You said you have multiple lines and these changes are only to certain lines.
You can either visually select the lines to be changed or use the :global
command, which searches for lines matching a pattern and applies a command to
them. Now you'd need to build an expression to match the lines to be changed
in a less precise as possible way. If the lines that begins with optional
spaces, a [ and two digits are the only lines to be matched and no other
ones, then this would work for you:
:g/\s*[\d\d/s/\d\d\zs /, /g
Check the help for pattern.txt for \ze and similar and
:global.
Homework: use the help to understand \zs and see how this works:
:s/\d\d\zs\ze /,/g

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

grep - removing a line that contains anything other than specified characters - linux

This should do it for you. It has the letters you specified in a set, enclosed by []. * denotes that they can occur any number of times. ^ denotes the line must start with one of those letters, and $ denotes it must end with it as well. grep '^[aeiourst]*$' file.txt

Related

how to add a word after specific patterns into a file linux

Regular Expression symbol "?" not working in shell

Vim or sed : Replace character(s) within a pattern

extract first instance per line (maybe grep?)

How to replace in vim

Categories

Resources