Groovy - characters loss with stream.getText - groovy

I have this Groovy script that I'm testing:
InputStream is = awsS3Stream.getObjectContent();
def lines = is.getText("UTF-8");
println "lines:"+lines;
Pattern pattern = ~/type\"\:\"[A-Z][a-z]*\"/;
Matcher matcher = pattern.matcher(lines);
...
I noticed that depending on the size of the awsS3Stream object, variable lines may not have all of the text - the end of it is missing. I was hoping that using StringBuffer instead of String would solve the issue, but it did not. I hope someone may know a Groovy based solution to it as I'm not terribly familiar with Groovy... much appreciate your time.
P.S The issues I'm seeing is not related to the pattern - I don't need pattern there to see that the variable lines doesn't always have all of the data.

Are you trying to match alphabetic strings with just one initial uppercase letter? If not, the problem is with your regexp. To match camel case strings with any number of capital letters, use this:
Pattern pattern = ~/type\"\:\"[A-Za-z]*\"/;

The issue was with the data going into s3, not how I retrieve it.

Related

How to get a substring with Regex in Python

I am trying to formnulate a regex to get the ids from the below two strings examples:
/drugs/2/drug-19904-5106/magnesium-oxide-tablet/details
/drugs/2/drug-19906/magnesium-moxide-tablet/details
In the first case, I should get 19904-5106 and in the second case 19906.
So far I tried several, the closes I could get is [drugs/2/drug]-.*\d but would return g-19904-5106 and g-19907.
Please any help to get ride of the "g-"?
Thank you in advance.
When writing a regex expression, consider the patterns you see so that you can align it correctly. For example, if you know that your desired IDs always appear in something resembling ABCD-1234-5678 where 1234-5678 is the ID you want, then you can use that. If you also know that your IDs are always digits, then you can refine the search even more
For your example, using a regex string like
.+?-(\d+(?:-\d+)*)
should do the trick. In a python script that would look something like the following:
match = re.search(r'.+?-(\d+(?:-\d+)*)', my_string)
if match:
my_id = match.group(1)
The pattern may vary depending on the depth and complexity of your examples, but that works for both of the ones you provided
This is the closest I could find: \d+|.\d+-.\d+

Regex for specific permutations of a word

I am working on a wordle bot and I am trying to match words using regex. I am stuck at a problem where I need to look for specific permutations of a given word.
For example, if the word is "steal" these are all the permutations:
'tesla', 'stale', 'steal', 'taels', 'leats', 'setal', 'tales', 'slate', 'teals', 'stela', 'least', 'salet'.
I had some trouble creating a regex for this, but eventually stumbled on positive lookaheads which solved the issue. regex -
'(?=.*[s])(?=.*[l])(?=.*[a])(?=.*[t])(?=.*[e])'
But, if we are looking for specific permutations, how do we go about it?
For example words that look like 's[lt]a[lt]e'. The matching words are 'steal', 'stale', 'state'. But I want to limit the count of l and t in the matched word, which means the output should be 'steal' & 'stale'. 1 obvious solution is this regex r'slate|stale', but this is not a general solution. I am trying to arrive at a general solution for any scenario and the use of positive lookahead above seemed like a starting point. But I am unable to arrive at a solution.
Do we combine positive lookaheads with normal regex?
s(?=.*[lt])a(?=.*[lt])e (Did not work)
Or do we write nested lookaheads or something?
A few more regex that did not work -
s(?=.*[lt]a[tl]e)
s(?=.*[lt])(?=.*[a])(?=.*[lt])(?=.*[e])
I tried to look through the available posts on SO, but could not find anything that would help me understand this. Any help is appreciated.
You could append the regex which matches the permutations of interest to your existing regex. In your sample case, you would use:
(?=.*s)(?=.*l)(?=.*a)(?=.*t)(?=.*e)s[lt]a[lt]e
This will match only stale and slate; it won't match state because it fails the lookahead that requires an l in the word.
Note that you don't need the (?=.*s)(?=.*a)(?=.*e) in the above regex as they are required by the part that matches the permutations of interest. I've left them in to keep that part of the regex generic and not dependent on what follows it.
Demo on regex101
Note that to allow for duplicated characters you might want to change your lookaheads to something in this form:
(?=(?:[^s]*s){1}[^s]*)
You would change the quantifier on the group to match the number of occurrences of that character which are required.

ANTLR4 - How to parse content between same string values

I'm trying to write an antlr4 parser rule that can match the content between some arbitrary string values that are same. So far I couldn't find a method to do it.
For example, in the below input, I need a rule to extract Hello and Bye. I'm not interested in extracting xyz though.
TEXT Hello TEXT
TEXT1 Bye TEXT1
TEXT5 xyz TEXT8
As it is very much similar to an XML element grammar, I tried an example for XML Parser given in ANTLR4 XML Grammar, but it parses an input like <ABC> ... </XYZ> without error which is not what I wanted.
I also tried using semantic predicates without much success.
Could anyone please help with a hint on how to match content that is embedded between same strings?
Thank you!
Satheesh
Not sure how this works out performance wise, because of many many checks the parser has to do, but you could try something like:
token:
start = IDENTIFIER WORD* end = IDENTIFIER { start == end }?
;
The part between the curly braces is a validating semantic predicate. The lexer tokens are self-explanatory, I believe.
The more I think about it, it might be better you just tokenize the input and write an owner parser that processes the input and acts accordingly. Depends of course on the complexity of the syntax.

Select substring between two characters in Scala

I'm getting a garbled JSON string from a HTTP request, so I'm looking for a temp solution to select the JSON string only.
The request.params() returns this:
[{"insured_initials":"Tt","insured_surname":"Test"}=, _=1329793147757,
callback=jQuery1707229194729661704_1329793018352
I would like everything from the start of the '{' to the end of the '}'.
I found lots of examples of doing similar things with other languages, but the purpose of this is not to only solve the problem, but also to learn Scala. Will someone please show me how to select that {....} part?
Regexps should do the trick:
"\\{.*\\}".r.findFirstIn("your json string here")
As Jens said, a regular expression usually suffices for this. However, the syntax is a bit different:
"""\{.*\}""".r
creates an object of scala.util.matching.Regex, which provides the typical query methods you may want to do on a regular expression.
In your case, you are simply interested in the first occurrence in a sequence, which is done via findFirstIn:
scala> """\{.*\}""".r.findFirstIn("""[{"insured_initials":"Tt","insured_surname":"Test"}=, _=1329793147757,callback=jQuery1707229194729661704_1329793018352""")
res1: Option[String] = Some({"insured_initials":"Tt","insured_surname":"Test"})
Note that it returns on Option type, which you can easily use in a match to find out if the regexp was found successfully or not.
Edit: A final point to watch out for is that the regular expressions normally do not match over linebreaks, so if your JSON is not fully contained in the first line, you may want to think about eliminating the linebreaks first.

parsing a string that ends

I have a huge string. I need to extract a substring from that that huge string. The conditions are the string starts with either "TECHNICAL" or "JUSTIFY" or "ALIGN" and ends with a number( any number from 1 to 10) followed by period and then followed by space. so for example, I have
string x = "This is a test, again I am testing TECHNICAL: I need to extract this substring starting with testing. 8. This is test again and again and again and again.";
so I need this
TECHNICAL: I need to extract this substring starting with testing.
I was wondering if someone has elegant solution for that.
I was trying to use the regular expression, but I guess I could not figure out the right expresion.
any help will be appreciated.
Thanks in advance.
Try this: #"((?:TECHNICAL|JUSTIFY|ALIGN).*?)(?:[1-9]|10)\. "

Resources