Regular Expression for space separated words - python-3.x

I am trying to match email footers like
Thanks,
Regards
Thanks & Regards,
with the help of regular expression I am able to get first 2 cases,
({language_keyword_footer[0]}|{language_keyword_footer[1]}(.*?)(\S+(.*?)))
language keyword footer is the metadata that I have created to add more cases in future,
keywords = {
"en": {'header':["From:", "Subject:"],'footer':["Regards,", "Thanks,","Thanks & Regards,"]}}
The problem is when I use this approach it captures only Regards, and discards Thanks & Regards,
is there a way I can add it to the existing re and capture this space separated scenario as well, any help is appreciated

You need to sort the items by length in dedcending order as alternation patterns in an NFA regex engine follow "first matched, first served" principle (see the "Remember That The Regex Engine Is Eager" article). Also, do not forget to escape the strings used as literal pattern inside the regex.
You can use something like
pattern = r'({})(.*?)(\S+)'.format("|".join(map(re.escape, sorted(language_keyword_footer, key=len, reverse=True))))
The resulting pattern would look like (Thanks\ \&\ Regards,|Regards,|Thanks,)(.*?)(\S+) here, and will match Thanks & Regards,, Regards, or Thanks, (in Group 1), then any zero or more chars other than line break chars as few as possible (in Group 2), and then one or more non-whitespace chars (with (\S+)).
Note the .*? at the end of a regex pattern has no effect on the final result.

Related

Regex for specific permutations of a word

I am working on a wordle bot and I am trying to match words using regex. I am stuck at a problem where I need to look for specific permutations of a given word.
For example, if the word is "steal" these are all the permutations:
'tesla', 'stale', 'steal', 'taels', 'leats', 'setal', 'tales', 'slate', 'teals', 'stela', 'least', 'salet'.
I had some trouble creating a regex for this, but eventually stumbled on positive lookaheads which solved the issue. regex -
'(?=.*[s])(?=.*[l])(?=.*[a])(?=.*[t])(?=.*[e])'
But, if we are looking for specific permutations, how do we go about it?
For example words that look like 's[lt]a[lt]e'. The matching words are 'steal', 'stale', 'state'. But I want to limit the count of l and t in the matched word, which means the output should be 'steal' & 'stale'. 1 obvious solution is this regex r'slate|stale', but this is not a general solution. I am trying to arrive at a general solution for any scenario and the use of positive lookahead above seemed like a starting point. But I am unable to arrive at a solution.
Do we combine positive lookaheads with normal regex?
s(?=.*[lt])a(?=.*[lt])e (Did not work)
Or do we write nested lookaheads or something?
A few more regex that did not work -
s(?=.*[lt]a[tl]e)
s(?=.*[lt])(?=.*[a])(?=.*[lt])(?=.*[e])
I tried to look through the available posts on SO, but could not find anything that would help me understand this. Any help is appreciated.
You could append the regex which matches the permutations of interest to your existing regex. In your sample case, you would use:
(?=.*s)(?=.*l)(?=.*a)(?=.*t)(?=.*e)s[lt]a[lt]e
This will match only stale and slate; it won't match state because it fails the lookahead that requires an l in the word.
Note that you don't need the (?=.*s)(?=.*a)(?=.*e) in the above regex as they are required by the part that matches the permutations of interest. I've left them in to keep that part of the regex generic and not dependent on what follows it.
Demo on regex101
Note that to allow for duplicated characters you might want to change your lookaheads to something in this form:
(?=(?:[^s]*s){1}[^s]*)
You would change the quantifier on the group to match the number of occurrences of that character which are required.

How to use the split() method with some condition?

There is one condition where I have to split my string in the manner that all the alphabetic characters should stay as one unit and everything else should be separated like the example shown below.
Example:
Some_var='12/1/20 Balance Brought Forward 150,585.80'
output_var=['12/1/20','Balance Brought Forward','150,585.80']
Yes, you could use some regex to get over this.
Some_var = '12/1/20 Balance Brought Forward 150,585.80'
match = re.split(r"([0-9\s\\\/\.,-]+|[a-zA-Z\s\\\/\.,-]+)", Some_var)
print(match)
You will get some extra spaces but you can trim that and you are good to go.
split isn't gonna cut it. You might wanna look into Regular Expressions (abbreviated regex) to accomplish this.
Here's a link to the Python docs: re module
As for a pattern, you could try using something like this:
([0-9\s\\\/\.,-]+|[a-zA-Z\s\\\/\.,-]+)
then trim each part of the output.

BluePrism count number of times a character exists in a string

I have a calculation which will remove a blank space and replace with a full stop. This is correct for 90% of my cases. However, sometimes two blanks will appear in my value. For the second space I want to delete it. Is this possible?
I think it may be possible using a code stage, but I am not sure what the code would be.
My current calculation is Replace([Item Data.Name], " ", ".")
Example data John B Smith I want the result to be John.BSmith
For anything that'd like to do with the strings, there is a really powerful tool called Regular Expressions (regex). I encourage you to play with it, because it's a really powerful tool in the hands of RPA developer.
To replace the second space in any string with a "." you can use the following action.
Object: Utility - Strings
Action: Regex - Find and Replace
Input:
Regex Pattern: "(?<= .*) "
Text: "John B Smith"
Replacement: "."
The above action is not a standard Blueprism one, so it has to be added to your VBO. The action looks as follows:
The VB.net code for that action is as follows:
Dim R as New Regex(Regex_Pattern, RegexOptions.SingleLine)
Dim M as Match = R.Match(Text)
replacement_result = R.Replace(Text,Regex_Pattern,replacement_string)
There might be a need for some additional assemblies, so please see below a printscreen of references and namespaces used in my object:
I resolved this issue by using the Utility - Strings object and the split text action. I split my name by space. This outputted a collection which I was then able to loop through and add a full stop after the fist instance but then trim the other instances.
Please see screenshot
I think the simplest solution would be
Replace(Replace(Text," "," ")" ","."))
if you know that it will give one or two spaces
First replace the two white spaces to single and then again single white space to dot(.)

Lua Pattern for extracting/replacing value in / /

I have a string like hello /world today/
I need to replace /world today/ with /MY NEW STRING/
Reading the manual I have found
newString = string.match("hello /world today/","%b//")
which I can use with gsub to replace, but I wondered is there also an elegant way to return just the text between the /, I know I could just trim it, but I wondered if there was a pattern.
Try something like one of the following:
slashed_text = string.match("hello /world today/", "/([^/]*)/")
slashed_text = string.match("hello /world today/", "/(.-)/")
slashed_text = string.match("hello /world today/", "/(.*)/")
This works because string.match returns any captures from the pattern, or the entire matched text if there are no captures. The key then is to make sure that the pattern has the right amount of greediness, remembering that Lua patterns are not a complete regular expression language.
The first two should match the same texts. In the first, I've expressly required that the pattern match as many non-slashes as possible. The second (thanks lhf) matches the shortest span of any characters at all followed by a slash. The third is greedier, it matches the longest span of characters that can still be followed by a slash.
The %b// in the original question doesn't have any advantages over /.-/ since the the two delimiters are the same character.
Edit: Added a pattern suggested by lhf, and more explanations.

Delimiter to use within a query string value

I need to accept a list of file names in a query string. ie:
http://someSite/someApp/myUtil.ashx?files=file1.txt|file2.bmp|file3.doc
Do you have any recommendations on what delimiter to use?
Having query parameters multiple times is legal, and the only way to guarantee no parsing problems in all cases:
http://someSite/someApp/myUtil.ashx?file=file1.txt&file=file2.bmp&file=file3.doc
The semicolon ; must be URI encoded if part of a filename (turned to %3B), yet not if it is separating query parameters which is its reserved use.
See section 2.2 of this rfc:
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
If they're filenames, a good choice would be a character which is disallowed in filenames. Suggestions so far included , | & which are generally allowed in filenames and therefore might lead to ambiguities. / on the other hand is generally not allowed, not even on Windows. It is allowed in URIs, and it has no special meaning in query strings.
Example:
http://someSite/someApp/myUtil.ashx?files=file1.txt|file2.bmp|file3.doc is bad because it may refer to the valid file file1.txt|file2.bmp.
http://someSite/someApp/myUtil.ashx?files=file1.txt/file2.bmp/file3.doc unambiguously refers to 3 files.
I would recommend making each file its own query parameter, i.e.
myUtil.ashx?file1=file1.txt&file2=file2.bmp&file3=file3.doc
This way you can just use standard query parsing and loop
Do you need to list the filenames as a string?
Most languages accepts arrays in the querystring so you could write it like
http://someSite/someApp/myUtil.ashx?files[]=file1.txt&files[]=file2.bmp&files[]=file3.doc
If it doesn't, or you can't use for some other reason, you should stick to a delimiter that is either not allowed or unusual in a filename. Pipe (|) is a good one, otherwise you could urlencode an invisible character since they are quite easy to use in coding, but harder to actually include in a filename.
I usually use arrays when possible and pipe otherwise.
I've always used double pipes "||". I don't have any good evidence to back up why this is a good choice other than 10 years of web programming and it's never been an issue.
This is one common problem. How i handled it was: I created a method which accepted a list of strings, then found a character that was not in any of the strings. (I did this by a simple concatenation of the strings, then testing for various characters.) Once a character was found, concatenated all the strings together but also prepended the string with the separation character. So in the given question, one example wud be:
http://someSite/someApp/myUtil.ashx?files=|file1.txt|file2.bmp|file3.doc
and another wud be:
http://someSite/someApp/myUtil.ashx?files=,file1.txt,file2.bmp,file3.doc
But since i actually use a method that guarantees my separator character is not in the rest of the strings, it is safe. It was a bit of work to create the first time, but i've used it MANY times in various applications.
I think I would consider using commas or semicolons.
I would build on MSalters answer by saying, to generalize, the best delimiter is one that is invalid to the items in the list. For example, if your list is prices, a comma is a bad delimiter because it can be confused with the values. For that reason, as most these answers suggest, I think a good general purpose delimiter is probably "|" as it is rarely a valid value. "/" is maybe not the best delimiter generally as it is valid for paths sometimes.

Resources