re.split but leaving in the condition - python-3.x

I have an example text string text_var = 'ndTail7-40512-1' and I want to split the first time I see a number followed by a - BUT I want to keep the number. Currently, I have print(re.split('\d*(?=-)',text_var,1)) and my output is ['ndTail', '-40512-1']. But I want to keep that number which is the trigger so it should look like ['ndTail', '7-40512-1']. Any help?

We can try using re.findall here:
text_var = 'ndTail7-40512-1'
matches = re.findall(r'(.*?)(\d-.*$)', text_var)
print(matches[0])
This prints:
('ndTail', '7-40512-1')
Sometimes it can be easier to use re.findall rather than re.split.
The regex pattern used here says to:
(.*?) match AND capture all content up to, but including
(\d-.*$) the first digit which is followed by a hyphen;
match and capture this content all the way to the end of the input
Note that we are using re.findall which typically has the potential to return multiple matches. However, in this case, our pattern matches to the end of the input, so we are left with just a single tuple containing the two desired capture groups.

Related

Regex for specific permutations of a word

I am working on a wordle bot and I am trying to match words using regex. I am stuck at a problem where I need to look for specific permutations of a given word.
For example, if the word is "steal" these are all the permutations:
'tesla', 'stale', 'steal', 'taels', 'leats', 'setal', 'tales', 'slate', 'teals', 'stela', 'least', 'salet'.
I had some trouble creating a regex for this, but eventually stumbled on positive lookaheads which solved the issue. regex -
'(?=.*[s])(?=.*[l])(?=.*[a])(?=.*[t])(?=.*[e])'
But, if we are looking for specific permutations, how do we go about it?
For example words that look like 's[lt]a[lt]e'. The matching words are 'steal', 'stale', 'state'. But I want to limit the count of l and t in the matched word, which means the output should be 'steal' & 'stale'. 1 obvious solution is this regex r'slate|stale', but this is not a general solution. I am trying to arrive at a general solution for any scenario and the use of positive lookahead above seemed like a starting point. But I am unable to arrive at a solution.
Do we combine positive lookaheads with normal regex?
s(?=.*[lt])a(?=.*[lt])e (Did not work)
Or do we write nested lookaheads or something?
A few more regex that did not work -
s(?=.*[lt]a[tl]e)
s(?=.*[lt])(?=.*[a])(?=.*[lt])(?=.*[e])
I tried to look through the available posts on SO, but could not find anything that would help me understand this. Any help is appreciated.
You could append the regex which matches the permutations of interest to your existing regex. In your sample case, you would use:
(?=.*s)(?=.*l)(?=.*a)(?=.*t)(?=.*e)s[lt]a[lt]e
This will match only stale and slate; it won't match state because it fails the lookahead that requires an l in the word.
Note that you don't need the (?=.*s)(?=.*a)(?=.*e) in the above regex as they are required by the part that matches the permutations of interest. I've left them in to keep that part of the regex generic and not dependent on what follows it.
Demo on regex101
Note that to allow for duplicated characters you might want to change your lookaheads to something in this form:
(?=(?:[^s]*s){1}[^s]*)
You would change the quantifier on the group to match the number of occurrences of that character which are required.

Using flex to identify variable name without repeating characters

I'm not fully sure how to word my question, so sorry for the rough title.
I am trying to create a pattern that can identify variable names with the following restraints:
Must begin with a letter
First letter may be followed by any combination of letters, numbers, and hyphens
First letter may be followed with nothing
The variable name must not be entirely X's ([xX]+ is a seperate identifier in this grammar)
So for example, these would all be valid:
Avariable123
Bee-keeper
Y
E-3
But the following would not be valid:
XXXX
X
3variable
5
I am able to meet the first three requirements with my current identifier, but I am really struggling to change it so that it doesn't pick up variables that are entirely the letter X.
Here is what I have so far: [a-z][a-z0-9\-]* {return (NAME);}
Can anyone suggest a way of editing this to avoid variables that are made up of just the letter X?
The easiest way to handle that sort of requirement is to have one pattern which matches the exceptional string and another pattern, which comes afterwards in the file, which matches all the strings:
[xX]+ { /* matches all-x tokens */ }
[[:alpha:]][[:alnum:]-]* { /* handle identifiers */ }
This works because lex (and almost all lex derivatives) select the first match if two patterns match the same longest token.
Of course, you need to know what you want to do with the exceptional symbol. If you just want to accept it as some token type, there's no problem; you just do that. If, on the other hand, the intention was to break it into subtokens, perhaps individual letters, then you'll have to use yyless(), and you might want to switch to a new lexing state in order to avoid repeatedly matching the same long sequence of Xs. But maybe that doesn't matter in your case.
See the flex manual for more details and examples.

Get a value from the string with regex

I have this for example:
<#445288012218368010>
And I want to get from between <# > symbols the value.
I tried so:
string.replace(/^(?:\<\#)(?:.*)(?:\>)$/gim, '');
But then I don't get any result. It will delete/remove the whole string.
I want only this part: 445288012218368010 (it will be dynamic, so yeah it will be not the same numbers).
Anyway it is for the discord chat bot and I know that there is other methods for check the mentioned names but I want to do that in regex because which I am trying to do can't go the common method.
So yeah how can I get the value from between those symbols?
I need this in node.js regex.
You can use String#match which will return regular expression matches for the string (in this case the RegExp would be <#(\d+)> (the parenthesis around the \d+ make \d+ become its own group). This way you can use <string>.match(/<#(\d+)>/) to get the regular expression results and <string>.match(/<#(\d+)>/)[1] to get the first group of the regex (in this case the number).
You regex matches but you use a non capturing group (?:.*) so you get the full match and replace that with an empty string. Note that you could omit the first and the third non capturing group and use <# and > instead.
You could match what is between the brackets using a capturing group ([^>]+) or (\d+) and use replace and refer the first capturing group $1 in the replacement.
console.log("<#445288012218368010>".replace(/^<#([^>]+)>$/gim, '$1'));

Find the maximal input string matching a regular expression

Given a regular expression re and an input string str, I want to find the maximal substring of str, which starts at the minimal position, which matches re.
Special case:
re = Regex("a+|[ax](bc)*"); str = "yyabcbcb"
matching re with str should return the matching string "abcbc" (and not "a", as PCRE does). I also have in mind, that the result is as I want, if the order of the alternations is changed.
The options I found were:
POSIX extended RE - probably outdated, used by egrep ...
RE2 by Google - open source RE2 - C++ - also C-wrapper available
From my point of view, there are two problems with your question.
First is that changing the order of alternations the results are supposed to change.
For each single 'a' in the string, it can either match 'a+' or "ax*".
So it is ambiguous for matching 'a' to alternations in your regular expression.
Second, for finding the maximal substring, it requires the matching pattern of the longest match. As far as I know, only RE2 has provided such a feature, as mentioned by #Cosinus.
So my recommendation is that separating "a+|ax*" into two regexes, finding the maximal substring in each of them, and then comparing the positions of both substrings.
As to find the longest match, you can also refer to a previous regex post description here. The main idea is to search for substrings starting from string position 0 to len(str) and to keep track of the length and position when matched substrings are found.
P.S. Some languages provide regex functions similar to "findall()". Be careful of using them since the returns may be non-overlapping matches. And non-overlapping matches do not necessarily contain the longest matching substring.

Adding space in a specific position in a string of uppercase and lowercase letters

Dear stackoverflow users,
Many people encounter situations in which they need to modify strings. I have seen many
posts related to string modification. But, I have not come across solutions I am looking
for. I believe my post would be useful for some other R users who will face similar
challenges. I would like to seek some help from R users who are familiar with string
modification.
I have been trying to modify a string like the following.
x <- "Marcus HELLNERJohan OLSSONAnders SOEDERGRENDaniel RICHARDSSON"
There are four individuals in this string. Family names are in capital letters.
Three out of four family names stay in chunks with first names (e.g., HELLNERJohan).
I want to separate family names and first names adding space (e.g., HELLNER Johan).
I think I need to state something like "Select sequences of uppercase letters, and
add space between the last and second last uppercase letters, if there are lowercase
letters following."
The following post is probably somewhat relevant, but I have not been successful in writing codes yet.
Splitting String based on letters case
Thank you very much for your generous support.
This works by finding and capturing two consecutive sub-patterns, the first consisting of one upper case letter (the end of a family name), and the next consisting of an upper then a lower-case letter (taken to indicate the start of a first name). Everywhere these two groups are found, they are captured and replaced by themselves with a space inserted between (the "\\1 \\2" in the call below).
x <- "Marcus HELLNERJohan OLSSONAnders SOEDERGRENDaniel RICHARDSSON"
gsub("([[:upper:]])([[:upper:]][[:lower:]])", "\\1 \\2", x)
# "Marcus HELLNER Johan OLSSON Anders SOEDERGREN Daniel RICHARDSSON"
If you want to separate the vector into a vector of names, this splits the string using a regular expression with zero-width lookbehind and lookahead assertions.
strsplit(x, split = "(?<=[[:upper:]])(?=[[:upper:]][[:lower:]])",
perl = TRUE)[[1]]
# [1] "Marcus HELLNER" "Johan OLSSON" "Anders SOEDERGREN"
# [4] "Daniel RICHARDSSON"

Resources