Generate string data from regex - python-hypothesis

I would like to be able to take a regex and generate conforming data using the python hypothesis library. For example given a regex of
regex = re.compile('[a-zA-Z]')
This would match any english alpha characters. An example generator for this could be.
import hypothesis
import string
hypothesis.strategies.text(alphabet=string.ascii_letters)
But Ideally I want to construct a string that will match any regex passed in.

There's a work in progress pull request for adding this feature. Nothing extant will let you do it easily, but looking at the PR might give you a good idea about how to translate any specific example you need.
Update: the from_regex strategy was added in Hypothesis 3.19.

Related

How to get a substring with Regex in Python

I am trying to formnulate a regex to get the ids from the below two strings examples:
/drugs/2/drug-19904-5106/magnesium-oxide-tablet/details
/drugs/2/drug-19906/magnesium-moxide-tablet/details
In the first case, I should get 19904-5106 and in the second case 19906.
So far I tried several, the closes I could get is [drugs/2/drug]-.*\d but would return g-19904-5106 and g-19907.
Please any help to get ride of the "g-"?
Thank you in advance.
When writing a regex expression, consider the patterns you see so that you can align it correctly. For example, if you know that your desired IDs always appear in something resembling ABCD-1234-5678 where 1234-5678 is the ID you want, then you can use that. If you also know that your IDs are always digits, then you can refine the search even more
For your example, using a regex string like
.+?-(\d+(?:-\d+)*)
should do the trick. In a python script that would look something like the following:
match = re.search(r'.+?-(\d+(?:-\d+)*)', my_string)
if match:
my_id = match.group(1)
The pattern may vary depending on the depth and complexity of your examples, but that works for both of the ones you provided
This is the closest I could find: \d+|.\d+-.\d+

Regex for specific permutations of a word

I am working on a wordle bot and I am trying to match words using regex. I am stuck at a problem where I need to look for specific permutations of a given word.
For example, if the word is "steal" these are all the permutations:
'tesla', 'stale', 'steal', 'taels', 'leats', 'setal', 'tales', 'slate', 'teals', 'stela', 'least', 'salet'.
I had some trouble creating a regex for this, but eventually stumbled on positive lookaheads which solved the issue. regex -
'(?=.*[s])(?=.*[l])(?=.*[a])(?=.*[t])(?=.*[e])'
But, if we are looking for specific permutations, how do we go about it?
For example words that look like 's[lt]a[lt]e'. The matching words are 'steal', 'stale', 'state'. But I want to limit the count of l and t in the matched word, which means the output should be 'steal' & 'stale'. 1 obvious solution is this regex r'slate|stale', but this is not a general solution. I am trying to arrive at a general solution for any scenario and the use of positive lookahead above seemed like a starting point. But I am unable to arrive at a solution.
Do we combine positive lookaheads with normal regex?
s(?=.*[lt])a(?=.*[lt])e (Did not work)
Or do we write nested lookaheads or something?
A few more regex that did not work -
s(?=.*[lt]a[tl]e)
s(?=.*[lt])(?=.*[a])(?=.*[lt])(?=.*[e])
I tried to look through the available posts on SO, but could not find anything that would help me understand this. Any help is appreciated.
You could append the regex which matches the permutations of interest to your existing regex. In your sample case, you would use:
(?=.*s)(?=.*l)(?=.*a)(?=.*t)(?=.*e)s[lt]a[lt]e
This will match only stale and slate; it won't match state because it fails the lookahead that requires an l in the word.
Note that you don't need the (?=.*s)(?=.*a)(?=.*e) in the above regex as they are required by the part that matches the permutations of interest. I've left them in to keep that part of the regex generic and not dependent on what follows it.
Demo on regex101
Note that to allow for duplicated characters you might want to change your lookaheads to something in this form:
(?=(?:[^s]*s){1}[^s]*)
You would change the quantifier on the group to match the number of occurrences of that character which are required.

Regex or IndexOf?

I have a long string "AB100123485;AB10064279293-IP-1-KNPO;AB473898487-MM41". I have to extract integer value after "IP-" i.e 1 (only) what is the most efficient way ? I am using c#
Thanks
The 'most-efficient' way depends on how consistent your string is in terms of length and appearance. You can surely do this with a regular expression as a quick solution if you just want to get the digit directly following IP-.
You can utilize the RegularExpressions API, passing in your regular expression and input string.
https://learn.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.match?view=netframework-4.8#System_Text_RegularExpressions_Regex_Match_System_String_System_String_
This pattern should get you started IP-[0-9]; refine it more to your use case as needed.
For example:
Match matched = System.Text.RegularExpressions.Match(
"AB100123485;AB10064279293-IP-1-KNPO;AB473898487-MM41",
"IP-[0-9]"
);

LevenshteinSim() Approximate string matching

I am using levenshteinSim() to do the approximate string matching. I am facing a problem
here is what my data look like
string = "Mitchell"
stringvector = c("Ray Mitchell", "Mitchell Dough","Juila Mitch")
.
I want the algorithm to match only second part of the Stringvector, not the first half..How do i do it. I really appreciate your help. And how do I use weighing schema?
Thanks
Kothavari
I believe you will need to preprocess the data to just pull out the second part of the string and use the algo on that.
Other people seem to do some preproessing first. See here

Select substring between two characters in Scala

I'm getting a garbled JSON string from a HTTP request, so I'm looking for a temp solution to select the JSON string only.
The request.params() returns this:
[{"insured_initials":"Tt","insured_surname":"Test"}=, _=1329793147757,
callback=jQuery1707229194729661704_1329793018352
I would like everything from the start of the '{' to the end of the '}'.
I found lots of examples of doing similar things with other languages, but the purpose of this is not to only solve the problem, but also to learn Scala. Will someone please show me how to select that {....} part?
Regexps should do the trick:
"\\{.*\\}".r.findFirstIn("your json string here")
As Jens said, a regular expression usually suffices for this. However, the syntax is a bit different:
"""\{.*\}""".r
creates an object of scala.util.matching.Regex, which provides the typical query methods you may want to do on a regular expression.
In your case, you are simply interested in the first occurrence in a sequence, which is done via findFirstIn:
scala> """\{.*\}""".r.findFirstIn("""[{"insured_initials":"Tt","insured_surname":"Test"}=, _=1329793147757,callback=jQuery1707229194729661704_1329793018352""")
res1: Option[String] = Some({"insured_initials":"Tt","insured_surname":"Test"})
Note that it returns on Option type, which you can easily use in a match to find out if the regexp was found successfully or not.
Edit: A final point to watch out for is that the regular expressions normally do not match over linebreaks, so if your JSON is not fully contained in the first line, you may want to think about eliminating the linebreaks first.

Resources