I have as follows:
T47101 UNIPROID FGFR1_HUMAN
T47101 ECNUMBER EC 2.7.10.1
T47101 SEQUENCE MWSWKCLLFWAVLVTATLCTARPSPTLPEQAQPWGAPVEVESFLVHPGDLLQLRCRLRDDVQSINWLRDGVQLAESNRTRITGEEVEVQDSVPADSGLYACVT
T47101 DRUGINFO D09HNV Intedanib Approved
T47101 DRUGINFO D01PZD Romiplostim Approved
T47101 DRUGINFO D02WVT E-3810 Phase 3
There's a lot of filler in here. The only things I'm interested in is the words that follow UNIPROID, which are always different but always contain _HUMAN after. I want to keep that information (ex. FGFR1_HUMAN).
Then the other thing I'm interested in is everything that follows the word SEQUENCE. I want to keep the letters that follow that. Everything other than those two bolded things I want to get rid of.
I don't have much experience with using this, so I need all the help I can get.
Here is what I currently have:
Find: .+UNIPROID(\D).+
I have regular expression checked and .matches newline unchecked.
Edit: The command I have now is able to delete everything up until FGFR1_Human, but I'm unsure how to proceed.
You could
Find what:
.+?(?|UNIPROID\h+(\w+_HUMAN)|SEQUENCE\h+(\w+)|$)
The pattern matches
.+? Match 1+ times any char except a newline non greedy (lazy)
(?| Branch reset group, to use group 1 for both alternations
UNIPROID\h+(\w+_HUMAN) Match UNIPROID, 1+ horizontal whitespace chars and capture 1+ word chars and _HUMAN in group 1
| Or
SEQUENCE\h+(\w+) Match SEQUENCE, 1+ horizontal whitespace chars and capture 1+ word chars in group 2
| Or
$ End of string to also match the last part
) Close group
Replace with:
$1
Regex demo
Related
I am new here. I wanted to ask a question on using REGEX for an entity in DialogFlow
I wanted the entity to accept all text and spaces except for the symbol *
I have tried to use [A-Za-z0-9 ][^*], but it is not working. Any advice. thanks!
In your Regex expression, [^*] means "capture any character at the start of the line." To refer to a literal asterisk rather than matching any character, you need to use \*
If you want to match a line of letters or numbers as in the [A-Za-z0-9] example you give, but only if that string does not include an asterisk, then this expression should work for you:
^[a-zA-Z0-9]+$
This means "match a whole line of text if it only contains one or more of the characters a-z, A-Z, or 0-9".
If you want to match any character or group of characters in a line except for the asterisk, then you could use something like this:
(?!\*)([a-zA-Z0-9]+)(?<!\*)
The first part is called a "negative lookahead," and it looks forward to ensure we're not matching the asterisk. The last part is called a "negative lookbehind," and it looks backwards to make sure we're not matching the asterisk. The middle part is your "capture group," and confirms that you're matching any letters or numbers in a given string, but excluding the * character.
If this Regex gets input like *abc, it will capture abc. If it encounters abc*, it will still capture abc. If it encounters abc*def, it will capture abc and def separately in two capture groups, because it will break around the asterisk.
This link explains the concept of lookarounds in Regex. You can also use this Regex tester to get started practicing your Regular Expressions with explanations of what each block of characters does.
EDITED TO ADD If you're just interested in matching single characters rather than groups of characters, you can use [A-Za-z0-9] and match any upper or lowercase letter and any single digit. You don't need to exclude the * character, because the character group is already exclusive.
This is a slight duplicate of the question below, so responses here may also help you. Hope this helps!
How can I exclude asterisk in a regex expression
[A-Za-z0-9 ][^*]
What you regex will do is match 2 consecutive characters. First, it will look for anything A-Za-z0-9 . Then, it will look at the negated set that includes *, and will match ANY character except *.
You can type your regex into https://regexr.com/ to see a breakdown of how it matches and test some strings.
For example, your regex would match these:
Aa
AA
a&
A1
0_
But would not match these:
A*
a*
1*
And WOULD NOT match anything longer than 2 characters. If you really want to match any string with any characters except *, this should work:
[^\*]+
What that will do is match any number of consecutive characters that are not *. (The + means match 1 or more characters in the set). It is also a good idea to escape * because it is also a reserved character in regex. Even though most regex parsers are smart enough to know that inside a group you probably mean the literal char *, it is still a best practice to escape it. (And by that same token, you would want to use \s instead of the blank space in your original regex.)
I want to break the below line into 1000 ones by 3 letters each:
saahaalaasabaaboabsabyaceactaddadoadsadzaffaftagaageagoagsahaahiahsaidailaimainairaisaitalaalbaleallalpalsaltamaamiampamuanaandaneaniantanyapeapoappaptarbarcarearfarkarmarsartashaskaspassateattaukavaaveavoawaaweawlawnaxeayeaysazobaabadbagbahbalbambanbapbarbasbatbaybedbeebegbelbenbesbetbeybibbidbigbinbiobisbitbizboabobbodbogboobopbosbotbowboxboybrabrobrrbubbudbugbumbunburbusbutbuybyebyscabcadcamcancapcarcatcawcayceecelcepchicigciscobcodcogcolconcoocopcorcoscotcowcoxcoycozcrucrycubcudcuecumcupcurcutcwmdabdaddagdahdakdaldamdandapdawdaydebdeedefdeldendevdewdexdeydibdiddiedifdigdimdindipdisditdocdoedogdoldomdondordosdotdowdrydubdudduedugduhduidunduodupdyeeareateauebbecuedhedseekeeleffefsefteggegoekeeldelfelkellelmelsemeemsemuendengenseoneraereergernerrersessetaetheveeweeyefabfadfagfanfarfasfatfaxfayfedfeefehfemfenferfesfetfeufewfeyfezfibfidfiefigfilfinfirfitfixfizfluflyfobfoefogfohfonfopforfoufoxfoyfrofryfubfudfugfunfurgabgadgaegaggalgamgangapgargasgatgaygedgeegelgemgengetgeyghigibgidgiegiggingipgitgnugoagobgodgoogorgosgotgoxgoygulgumgungutguvguygymgyphadhaehaghahhajhamhaohaphashathawhayhehhemhenhepherheshethewhexheyhichidhiehimhinhiphishithmmhobhodhoehoghonhophoshothowhoyhubhuehughuhhumhunhuphuthypiceichickicyidsiffifsiggilkillimpinkinninsionireirkismitsivyjabjagjamjarjawjayjeejetjeujewjibjigjinjobjoejogjotjowjoyjugjunjusjutkabkaekafkaskatkaykeakefkegkenkepkexkeykhikidkifkinkipkirkiskitkoakobkoikopkorkoskuekyelablacladlaglamlaplarlaslatlavlawlaxlaylealedleelegleileklesletleulevlexleylezliblidlielinliplislitlobloglooloplotlowloxluglumluvluxlyemacmadmaemagmanmapmarmasmatmawmaxmaymedmegmelmemmenmetmewmhomibmicmidmigmilmimmirmismixmoamobmocmodmogmolmommonmoomopmormosmotmowmudmugmummunmusmutmycnabnaenagnahnamnannapnawnaynebneenegnetnewnibnilnimnipnitnixnobnodnognohnomnoonornosnotnownthnubnunnusnutoafoakoaroatobaobeobiocaodaoddodeodsoesoffoftohmohoohsoilokaokeoldoleomsoneonoonsoohootopeopsoptoraorborcoreorsortoseoudouroutovaoweowlownoxooxypacpadpahpalpampanpapparpaspatpawpaxpaypeapecpedpeepegpehpenpepperpespetpewphiphtpiapicpiepigpinpippispitpiupixplypodpohpoipolpompoopoppotpowpoxproprypsipstpubpudpugpulpunpuppurpusputpyapyepyxqatqisquaradragrahrairajramranraprasratrawraxrayrebrecredreerefregreiremrepresretrevrexrhoriaribridrifrigrimrinriprobrocrodroeromrotrowrubruerugrumrunrutryaryesabsacsadsaesagsalsapsatsausawsaxsayseasecseesegseiselsensersetsewsexshasheshhshysibsicsimsinsipsirsissitsixskaskiskyslysobsodsolsomsonsopsossotsousowsoxsoyspaspysristysubsuesuksumsunsupsuqsyntabtadtaetagtajtamtantaotaptartastattautavtawtaxteatedteetegteltentettewthethothytictietiltintiptistittodtoetogtomtontootoptortottowtoytrytsktubtugtuituntuptuttuxtwatwotyeudoughukeuluummumpunsupoupsurburdurnurpuseutauteutsvacvanvarvasvatvauvavvawveevegvetvexviavidvievigvimvisvoevowvoxvugvumwabwadwaewagwanwapwarwaswatwawwaxwaywebwedweewenwetwhawhowhywigwinwiswitwizwoewogwokwonwoowopwoswotwowwrywudwyewynxisyagyahyakyamyapyaryawyayyeayehyenyepyesyetyewyidyinyipyobyodyokyomyonyouyowyukyumyupzagzapzaszaxzedzeezekzepzigzinzipzitzoazoozuzzzz
Please advise me on how to approach it.
Try the following find and replace, in regex mode:
Find: (...)(?=.)
Replace: $1\r\n
Demo
The pattern (...)(?=.) matches and captures any three letters at a time. Then, we replace with those three letters ($1) followed by a break (I used \r\n, the Windows line ending; use \n if you are on Linux). Note that the pattern also only matches if the three letters found are not the final three letters in the string. The positive lookahead (?=.) avoids adding an unwanted break at the end.
This regular expression,
.{3}\K
with a replacement of:
\n
might simply do that.
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
I have a little JSON files with some entries, here is a section:
"i":{
"normale":"3c",
"bold":"4b",
"doppio":"6c"},
"is":{
"normale":"2c",
"bold":"33",
"doppio":"66"},
I realized I have to add "\u25" in front of all the values, so I tried this command:
:%s:\("\)\(\d\d"\)\|\("\)\(\d\w"\):"\\u25\2
The idea is to search for either "dd" or "dw", and substitute the first double quote with "\u25 while keeping the rest.This is the result:
"i":{
"normale":"\u25,
"bold":"\u25,
"doppio":"\u25},
"is":{
"normale":"\u25,
"bold":"\u2533",
"doppio":"\u2566"},
If the matching string has only the two digits, the command works fine: the first double quote (the first group) is substituted and the second group is left as it was.
However, if the matching string has a digit and a character, it seems to ignore the second group, substituting the whole string. The two patterns are identical, except for \w, so it should work exactly the same. What's happening?
Vim matches \d to digits; you'd need \x to match hex digits.
But it seems you want to replace all occurrences of :" with :"\u25.
Can you use:
:%s/:"/:"\\u25"/
Or, if you want to prepend \u25 to all occurrences of 2 hex digits,
:%s/\x\x/\\u25&/
I have the following string in the code at multiple places,
m_cells->a[ Id ]
and I want to replace it with
c(Id)
where the string Id could be anything including numbers also.
A regular expression replace like below should do:
%s/m_cells->a\[\s\(\w\+\)\s\]/c(\1)/g
If you wish to apply the replacement operation on a number of files you could use the :bufdo command.
Full explanation of #BasBossink's answer (as a separate answer because this won't fit in a comment), because regexes are awesome but non-trivial and definitely worth learning:
In Command mode (ie. type : from Normal mode), s/search_term/replacement/ will replace the first occurrence of 'search_term' with 'replacement' on the current line.
The % before the s tells vim to perform the operation on all lines in the document. Any range specification is valid here, eg. 5,10 for lines 5-10.
The g after the last / performs the operation "globally" - all occurrences of 'search_term' on the line or lines, not just the first occurrence.
The "m_cells->a" part of the search term is a literal match. Then it gets interesting.
Many characters have special meaning in a regex, and if you want to use the character literally, without the special meaning, then you have to "escape" it, by putting a \ in front.
Thus \[ and \] match the literal '[' and ']' characters.
Then we have the opposite case: literal characters that we want to treat as special regex entities.
\s matches white*s*pace (space, tab, etc.).
\w matches "*w*ord" characters (letters, digits, and underscore _).
(. matches any character (except a newline). \d matches digits. There are more...)
If a character is not followed by a quantifier, then exactly one such character matches. Thus, \s will match one space or tab, but not fewer or more.
\+ is a quantifier, and means "one or more". (\? matches 0 or 1; * (with no backslash) matches any number: zero or more. Warning: matching on zero occurrences takes a little getting used to; when you're first learning regexes, you don't always get the results you expected. It's also possible to match on an arbitrary exact number or range of occurrences, but I won't get into that here.)
\( and \) work together to form a "capturing group". This means that we don't just want to match on these characters, we also want to remember them specially so that we can do something with them later. You can have any number of capturing groups, and they can be nested too. You can refer to them later by number, starting at 1 (not 0). Just start counting (escaped) left-parantheses from the left to determine the number.
So here, we are matching a space followed by a group (which we will capture) of at least one "word" character followed by a space, within the square brackets.
Then section between the second and third / is the replacement text.
The "c" is literal.
\1 means the first captured group, which in this case will be the "Id".
In summary, we are finding text that matches the given description, capturing part of it, and replacing the entire match with the replacement text that we have constructed.
Perhaps a final suggestion: c after the final / (doesn't matter whether it comes before or after the 'g') enables *c*onfirmation: vim will highlight the characters to be replaced and will show the replacement text and ask whether you want to go ahead. Great for learning.
Yes, regexes are complicated, but super powerful and well worth learning. Once you have them internalized, they're actually fairly easy. I suggest that, as with learning vim itself, you start with the basics, get fluent in them, and then incrementally add new features to your repertoire.
Good luck and have fun.
I have a line in a source file: [12 13 15]. In vim, I type:
:%s/\([0-90-9]\) /\0, /g
wanting to add a coma after 12 and 13. It works, but not quite, as it inserts an extraspace [12 , 13 , 15].
How can I achieve the desired effect?
Use \1 in the replacement expression, not \0.
\1 is the text captured by the first \(...\). If there were any more pairs of escaped parens in your pattern, \2 would match the text capture between the pair starting at the second \(, \3 at the third \(, and so on.
\0 is the entire text matched by the whole pattern, whether in parentheses or not. In your case this includes the space at the end of your pattern.
Also note that [0-90-9] is the same as [0-9]: each [...] collection matches just one character. It happens to work anyway, because in your data ‘a digit followed by a space’ matches in the same places as ‘2 digits followed by a space’. (If you actually needed to only insert commas after 2 digits, you could write [0-9][0-9].)
"I have a line in a source file:..."
then you type :%s/... this will do the substitution on all lines, if it matched. or that is the single line in your file?
If it is the single line, you don't have to group, or [0-9], just :%s/ \+/,/g will do the job.
The fine answers already point interesting solutions, but here's another one,
making use of the \zs, which marks the start of the match. In this pattern:
/[0-9]\zs /
The searched text is /[0-9] /, but only the space counts as a match. Note
that you can use the class \d to simplify the digit character class, so the
following command shall work for your needs:
:s/\d\d\zs /, /g ; matches only the space, replace by `, '
You said you have multiple lines and these changes are only to certain lines.
You can either visually select the lines to be changed or use the :global
command, which searches for lines matching a pattern and applies a command to
them. Now you'd need to build an expression to match the lines to be changed
in a less precise as possible way. If the lines that begins with optional
spaces, a [ and two digits are the only lines to be matched and no other
ones, then this would work for you:
:g/\s*[\d\d/s/\d\d\zs /, /g
Check the help for pattern.txt for \ze and similar and
:global.
Homework: use the help to understand \zs and see how this works:
:s/\d\d\zs\ze /,/g