ANTLR4 - How to parse content between same string values - antlr4

I'm trying to write an antlr4 parser rule that can match the content between some arbitrary string values that are same. So far I couldn't find a method to do it.
For example, in the below input, I need a rule to extract Hello and Bye. I'm not interested in extracting xyz though.
TEXT Hello TEXT
TEXT1 Bye TEXT1
TEXT5 xyz TEXT8
As it is very much similar to an XML element grammar, I tried an example for XML Parser given in ANTLR4 XML Grammar, but it parses an input like <ABC> ... </XYZ> without error which is not what I wanted.
I also tried using semantic predicates without much success.
Could anyone please help with a hint on how to match content that is embedded between same strings?
Thank you!
Satheesh

Not sure how this works out performance wise, because of many many checks the parser has to do, but you could try something like:
token:
start = IDENTIFIER WORD* end = IDENTIFIER { start == end }?
;
The part between the curly braces is a validating semantic predicate. The lexer tokens are self-explanatory, I believe.
The more I think about it, it might be better you just tokenize the input and write an owner parser that processes the input and acts accordingly. Depends of course on the complexity of the syntax.

Related

Antlr Lexer and Parser for catching exressions within another expression

I need to get the pieces of text out of text)). Very simple example actually, but gives me quite some pain.
Here is the sample text, it is an email template:
{!Account.Name}
Hi hi there {!Account.Id + 'cool'}.
Very interesting stuff - {!Contact.Description}
Now we get {!Contact.Description + Contact.Email__c}
So I need all the occurances of text like Account.Name, but only those which are within opening "{!" and closing "}" tags.
What is the simplest/starting approach to do it? Note that in case of the last line, I need to get the two occurances, Contact.Description and Contact.Email__c.
Thanks a lot for any help!
I would just do a plain text search for {...} blocks and parse their content with a simple expression parser. Don't try to come up with a parser that gets all the text and must be prepared to deal with any rubbish that can come in outside of the blocks (which could ultimatively lead to security problems).

Can JAPE match paragraph Annotation in LHS?

I'm working on a math word problem solver, and would like to pass whole problems to my GATE Embedded application using JAPE. I'm using GATE IDE to display the output, as well as run the pipeline of GATE components. Each problem will be in its own paragraph, and each document will have several problems on it.
Is there a way to match any paragraph using the JAPE left-hand side regex?
I see three options here (there may be more elegant solutions):
1) Use simple rule like:
Phase: find
Input: Token
Options: control = once
Rule:OneToken
(
{Token}
)
In RHS you could get a text and use standard Java approach for getting paragraphs from plain text.
2) Use LHS (if you really want only LHS)
Rule: NewLine
(
({SpaceToken.string=="\n"}) |
({SpaceToken.string=="\r"}) |
({SpaceToken.string=="\n"}{SpaceToken.string=="\r"}) |
({SpaceToken.string=="\r"}{SpaceToken.string=="\n"})
):left
Build annotation NewLine, then write a Jape rule similar to 1) but with NewLine instead of Token. Take all NewLines from outputAS and build your Paragraph annotations.
3) Sometimes there may be right paragraphs in Original markups. In this case you could use Annotation Set Transfer PR and get them in Default Annotations Set.
why not just use RegEx Sentence splitter PR to use Split as the Input in your jape rules?

Groovy - characters loss with stream.getText

I have this Groovy script that I'm testing:
InputStream is = awsS3Stream.getObjectContent();
def lines = is.getText("UTF-8");
println "lines:"+lines;
Pattern pattern = ~/type\"\:\"[A-Z][a-z]*\"/;
Matcher matcher = pattern.matcher(lines);
...
I noticed that depending on the size of the awsS3Stream object, variable lines may not have all of the text - the end of it is missing. I was hoping that using StringBuffer instead of String would solve the issue, but it did not. I hope someone may know a Groovy based solution to it as I'm not terribly familiar with Groovy... much appreciate your time.
P.S The issues I'm seeing is not related to the pattern - I don't need pattern there to see that the variable lines doesn't always have all of the data.
Are you trying to match alphabetic strings with just one initial uppercase letter? If not, the problem is with your regexp. To match camel case strings with any number of capital letters, use this:
Pattern pattern = ~/type\"\:\"[A-Za-z]*\"/;
The issue was with the data going into s3, not how I retrieve it.

Parser skips lines

I want to write a simple parser for a subset of Jade, generating some XmlHtml for further processing.
The parser is quite simple, but as often with Parsec, a bit long. Since I don't know if I am allowed to make such long code posts, I have the full working example here.
I've dabbled with Parsec before, but rarely successfully. Right now, I don't quite understand why it seems to swallow following lines. For example, the jade input of
.foo.bar
| Foo
| Bar
| Baz
tested with parseTest tag txt, returns this:
Element {elementTag = "div", elementAttrs = [("class","foo bar")], elementChildren = [TextNode "Foo"]}
My parser seems to be able to deal with any kind of nesting, but never more than one line. What did I miss?
If Parsec cannot match the remaining input, it will stop parsing at that point and simply ignore that input. Here, the problem is that after having parsed a tag, you don't consume the whitespace in the beginning of the line before the next tag, so Parsec cannot parse the remaining input and bails. (There might also be other issues, I can't test the code right now)
There are many ways of adding something that consumes the spaces, but I am not familiar with Jade so I cannot tell you which way is the "correct" way (I don't know how the indentation syntax works) but just adding whiteSpace somewhere at the end of tag should do it.
By the way, you should consider splitting up your parser into a Lexer and Parser. The Lexer produces a token stream like [Ident "bind", OpenParen, Ident "tag", Equals, StringLiteral "longname", ..., Indentation 1, ...] and the parser parses that token stream (Yes, Parsec can parse lists of anything). I think that it would make your job easier/less confusing.

Select substring between two characters in Scala

I'm getting a garbled JSON string from a HTTP request, so I'm looking for a temp solution to select the JSON string only.
The request.params() returns this:
[{"insured_initials":"Tt","insured_surname":"Test"}=, _=1329793147757,
callback=jQuery1707229194729661704_1329793018352
I would like everything from the start of the '{' to the end of the '}'.
I found lots of examples of doing similar things with other languages, but the purpose of this is not to only solve the problem, but also to learn Scala. Will someone please show me how to select that {....} part?
Regexps should do the trick:
"\\{.*\\}".r.findFirstIn("your json string here")
As Jens said, a regular expression usually suffices for this. However, the syntax is a bit different:
"""\{.*\}""".r
creates an object of scala.util.matching.Regex, which provides the typical query methods you may want to do on a regular expression.
In your case, you are simply interested in the first occurrence in a sequence, which is done via findFirstIn:
scala> """\{.*\}""".r.findFirstIn("""[{"insured_initials":"Tt","insured_surname":"Test"}=, _=1329793147757,callback=jQuery1707229194729661704_1329793018352""")
res1: Option[String] = Some({"insured_initials":"Tt","insured_surname":"Test"})
Note that it returns on Option type, which you can easily use in a match to find out if the regexp was found successfully or not.
Edit: A final point to watch out for is that the regular expressions normally do not match over linebreaks, so if your JSON is not fully contained in the first line, you may want to think about eliminating the linebreaks first.

Resources