Regex pattern is taking more than 4 digit number - python-3.x

import re
text = """State of California that the foregoing is true and correct. (For California sheriff or marshal use only) 1950-24-12 I certify that the foregoing is true and correct. Date: (SIGNATURE) SUBP-010 [Rev. January 1,2012] PROOF OF SERVICE OF DEPOSITION SUBPOENA FOR PRODUCTION OF BUSINESS RECORDS 055826-00-07 Page 2 of 2"""
pattern = re.findall("\d{2,4}[-]\d{1,2}[-]\d{1,2}",text)
print(pattern)
Required_output: 1950-24-12
The solution is taking 5826-00-07. Though it has more than 4 digit number. Is there any solution to remove it

What you want is called negative lookbehind. This means only matching a pattern when the section directly behind the match does not match a given sequence. To give you an example of what this means, (?<!something)abc will match any occurrence of "abc" that does not directly get proceeded by "something".
So in your case, you want to add (?<!\d) to the beginning of your regex to only match a pattern not proceeded by a digit.
Also, [-] will only match the character - so you don't need the brackets. After this change, the new regex is (?<!\d)\d{2,4}-\d{1,2}-\d{1,2}.

Related

Regex for specific permutations of a word

I am working on a wordle bot and I am trying to match words using regex. I am stuck at a problem where I need to look for specific permutations of a given word.
For example, if the word is "steal" these are all the permutations:
'tesla', 'stale', 'steal', 'taels', 'leats', 'setal', 'tales', 'slate', 'teals', 'stela', 'least', 'salet'.
I had some trouble creating a regex for this, but eventually stumbled on positive lookaheads which solved the issue. regex -
'(?=.*[s])(?=.*[l])(?=.*[a])(?=.*[t])(?=.*[e])'
But, if we are looking for specific permutations, how do we go about it?
For example words that look like 's[lt]a[lt]e'. The matching words are 'steal', 'stale', 'state'. But I want to limit the count of l and t in the matched word, which means the output should be 'steal' & 'stale'. 1 obvious solution is this regex r'slate|stale', but this is not a general solution. I am trying to arrive at a general solution for any scenario and the use of positive lookahead above seemed like a starting point. But I am unable to arrive at a solution.
Do we combine positive lookaheads with normal regex?
s(?=.*[lt])a(?=.*[lt])e (Did not work)
Or do we write nested lookaheads or something?
A few more regex that did not work -
s(?=.*[lt]a[tl]e)
s(?=.*[lt])(?=.*[a])(?=.*[lt])(?=.*[e])
I tried to look through the available posts on SO, but could not find anything that would help me understand this. Any help is appreciated.
You could append the regex which matches the permutations of interest to your existing regex. In your sample case, you would use:
(?=.*s)(?=.*l)(?=.*a)(?=.*t)(?=.*e)s[lt]a[lt]e
This will match only stale and slate; it won't match state because it fails the lookahead that requires an l in the word.
Note that you don't need the (?=.*s)(?=.*a)(?=.*e) in the above regex as they are required by the part that matches the permutations of interest. I've left them in to keep that part of the regex generic and not dependent on what follows it.
Demo on regex101
Note that to allow for duplicated characters you might want to change your lookaheads to something in this form:
(?=(?:[^s]*s){1}[^s]*)
You would change the quantifier on the group to match the number of occurrences of that character which are required.

Delete certain numbers among other numbers in a list in Python

I want to delete special numbers from a list that contains also other numbers which should not be affected.
The list looks like this:
[1,
, 00:00:03,950, 00:00:06,840,
, effective, argumentation, central,
, 2,
, 00:00:06,840, 00:00:09,180,
, term, thinking, topic,
, 3,
, 00:00:09,180, 00:00:10,830,
, previously, section, course,
... and so on]
Now, I want to delete only the single numbers plus the comma afterwards (here: 1, 2, 3) but not the timestamps (or any part of the timestamps).
What should be considered, is that these numbers could theoretically increase to 10 or even more digits, there is no constraint.
For this task, I have tried the following regular expressions (among others):
result = re.sub(r"^\d{1,},$", "", data)
result = re.sub(r"^\d{1,},\n", "", data)
But nothing I can think of works for my task. Either a part of the timestamps is also deleted or the numbers in question are not deleted at all.
Can anyone please help?
Thank you very much!
Converting my comment to answer so that solution is easy to find for future visitors.
As you are dealing with multiline test, you can use anchors ^ and $ without using MULTILINE mode. However for your case you can use a simple regex to remove all numbers followed by comma:
re.sub(r"(^\W*|\s)\d+,\s*", r'\1', data)
RegEx Details:
(^\W*|\s): Matches start position followed by 0+ non-word characters and OR else a whitespace and capture in group #1
\d+,\s*: Match 1+ digits followed by a comma and 0 or more whitespaces
\1: Is replacement that puts back text captured in group #1
RegEx Demo

Find the maximal input string matching a regular expression

Given a regular expression re and an input string str, I want to find the maximal substring of str, which starts at the minimal position, which matches re.
Special case:
re = Regex("a+|[ax](bc)*"); str = "yyabcbcb"
matching re with str should return the matching string "abcbc" (and not "a", as PCRE does). I also have in mind, that the result is as I want, if the order of the alternations is changed.
The options I found were:
POSIX extended RE - probably outdated, used by egrep ...
RE2 by Google - open source RE2 - C++ - also C-wrapper available
From my point of view, there are two problems with your question.
First is that changing the order of alternations the results are supposed to change.
For each single 'a' in the string, it can either match 'a+' or "ax*".
So it is ambiguous for matching 'a' to alternations in your regular expression.
Second, for finding the maximal substring, it requires the matching pattern of the longest match. As far as I know, only RE2 has provided such a feature, as mentioned by #Cosinus.
So my recommendation is that separating "a+|ax*" into two regexes, finding the maximal substring in each of them, and then comparing the positions of both substrings.
As to find the longest match, you can also refer to a previous regex post description here. The main idea is to search for substrings starting from string position 0 to len(str) and to keep track of the length and position when matched substrings are found.
P.S. Some languages provide regex functions similar to "findall()". Be careful of using them since the returns may be non-overlapping matches. And non-overlapping matches do not necessarily contain the longest matching substring.

XML schema restriction pattern for not allowing specific string

I need to write an XSD schema with a restriction on a field, to ensure that
the value of the field does not contain the substring FILENAME at any location.
For example, all of the following must be invalid:
FILENAME
ORIGINFILENAME
FILENAMETEST
123FILENAME456
None of these values should be valid.
In a regular expression language that supports negative lookahead, I could do this by writing /^((?!FILENAME).)*$ but the XSD pattern language does not support negative lookahead.
How can I implement an XSD pattern restriction with the same effect as /^((?!FILENAME).)*$ ?
I need to use pattern, because I don't have access to XSD 1.1 assertions, which are the other obvious possibility.
The question XSD restriction that negates a matching string covers a similar case, but in that case the forbidden string is forbidden only as a prefix, which makes checking the constraint easier. How can the solution there be extended to cover the case where we have to check all locations within the input string, and not just the beginning?
OK, the OP has persuaded me that while the other question mentioned has an overlapping topic, the fact that the forbidden string is forbidden at all locations, not just as a prefix, complicates things enough to require a separate answer, at least for the XSD 1.0 case. (I started to add this answer as an addendum to my answer to the other question, and it grew too large.)
There are two approaches one can use here.
First, in XSD 1.1, a simple assertion of the form
not(matches($v, 'FILENAME'))
ought to do the job.
Second, if one is forced to work with an XSD 1.0 processor, one needs a pattern that will match all and only strings that don't contain the forbidden substring (here 'FILENAME').
One way to do this is to ensure that the character 'F' never occurs in the input. That's too drastic, but it does do the job: strings not containing the first character of the forbidden string do not contain the forbidden string.
But what of strings that do contain an occurrence of 'F'? They are fine, as long as no 'F' is followed by the string 'ILENAME'.
Putting that last point more abstractly, we can say that any acceptable string (any string that doesn't contain the string 'FILENAME') can be divided into two parts:
a prefix which contains no occurrences of the character 'F'
zero or more occurrences of 'F' followed by a string that doesn't match 'ILENAME' and doesn't contain any 'F'.
The prefix is easy to match: [^F]*.
The strings that start with F but don't match 'FILENAME' are a bit more complicated; just as we don't want to outlaw all occurrences of 'F', we also don't want to outlaw 'FI', 'FIL', etc. -- but each occurrence of such a dangerous string must be followed either by the end of the string, or by a letter that doesn't match the next letter of the forbidden string, or by another 'F' which begins another region we need to test. So for each proper prefix of the forbidden string, we create a regular expression of the form
$prefix || '([^F' || next-character-in-forbidden-string || ']'
|| '[^F]*'
Then we join all of those regular expressions with or-bars.
The end result in this case is something like the following (I have inserted newlines here and there, to make it easier to read; before use, they will need to be taken back out):
[^F]*
((F([^FI][^F]*)?)
|(FI([^FL][^F]*)?)
|(FIL([^FE][^F]*)?)
|(FILE([^FN][^F]*)?)
|(FILEN([^FA][^F]*)?)
|(FILENA([^FM][^F]*)?)
|(FILENAM([^FE][^F]*)?))*
Two points to bear in mind:
XSD regular expressions are implicitly anchored; testing this with a non-anchored regular expression evaluator will not produce the correct results.
It may not be obvious at first why the alternatives in the choice all end with [^F]* instead of .*. Thinking about the string 'FEEFIFILENAME' may help. We have to check every occurrence of 'F' to make sure it's not followed by 'ILENAME'.

Adding space in a specific position in a string of uppercase and lowercase letters

Dear stackoverflow users,
Many people encounter situations in which they need to modify strings. I have seen many
posts related to string modification. But, I have not come across solutions I am looking
for. I believe my post would be useful for some other R users who will face similar
challenges. I would like to seek some help from R users who are familiar with string
modification.
I have been trying to modify a string like the following.
x <- "Marcus HELLNERJohan OLSSONAnders SOEDERGRENDaniel RICHARDSSON"
There are four individuals in this string. Family names are in capital letters.
Three out of four family names stay in chunks with first names (e.g., HELLNERJohan).
I want to separate family names and first names adding space (e.g., HELLNER Johan).
I think I need to state something like "Select sequences of uppercase letters, and
add space between the last and second last uppercase letters, if there are lowercase
letters following."
The following post is probably somewhat relevant, but I have not been successful in writing codes yet.
Splitting String based on letters case
Thank you very much for your generous support.
This works by finding and capturing two consecutive sub-patterns, the first consisting of one upper case letter (the end of a family name), and the next consisting of an upper then a lower-case letter (taken to indicate the start of a first name). Everywhere these two groups are found, they are captured and replaced by themselves with a space inserted between (the "\\1 \\2" in the call below).
x <- "Marcus HELLNERJohan OLSSONAnders SOEDERGRENDaniel RICHARDSSON"
gsub("([[:upper:]])([[:upper:]][[:lower:]])", "\\1 \\2", x)
# "Marcus HELLNER Johan OLSSON Anders SOEDERGREN Daniel RICHARDSSON"
If you want to separate the vector into a vector of names, this splits the string using a regular expression with zero-width lookbehind and lookahead assertions.
strsplit(x, split = "(?<=[[:upper:]])(?=[[:upper:]][[:lower:]])",
perl = TRUE)[[1]]
# [1] "Marcus HELLNER" "Johan OLSSON" "Anders SOEDERGREN"
# [4] "Daniel RICHARDSSON"

Resources