I'm working on a math word problem solver, and would like to pass whole problems to my GATE Embedded application using JAPE. I'm using GATE IDE to display the output, as well as run the pipeline of GATE components. Each problem will be in its own paragraph, and each document will have several problems on it.
Is there a way to match any paragraph using the JAPE left-hand side regex?
I see three options here (there may be more elegant solutions):
1) Use simple rule like:
Phase: find
Input: Token
Options: control = once
Rule:OneToken
(
{Token}
)
In RHS you could get a text and use standard Java approach for getting paragraphs from plain text.
2) Use LHS (if you really want only LHS)
Rule: NewLine
(
({SpaceToken.string=="\n"}) |
({SpaceToken.string=="\r"}) |
({SpaceToken.string=="\n"}{SpaceToken.string=="\r"}) |
({SpaceToken.string=="\r"}{SpaceToken.string=="\n"})
):left
Build annotation NewLine, then write a Jape rule similar to 1) but with NewLine instead of Token. Take all NewLines from outputAS and build your Paragraph annotations.
3) Sometimes there may be right paragraphs in Original markups. In this case you could use Annotation Set Transfer PR and get them in Default Annotations Set.
why not just use RegEx Sentence splitter PR to use Split as the Input in your jape rules?
Related
I've got a file format that looks a little like this:
blockA {
uniqueName42 -> uniqueName aWord1 anotherWord "Some text"
anotherUniqueName -> uniqueName23 aWord2
blockB {
thing -> anotherThing
}
}
Lots more blocks with arbitrary nesting levels.
The lines with the arrow in them define relationships between two things. Each relationship has some optional metadata (multi-word quoted or single word unquoted).
The challenge I'm having is that because the there can be an arbitrary number of metadata items in a relationship my parser is treating anotherUniqueName as a metadata item from the first relationship rather than the start of the second relationship.
You can see this in the image below. The parser is only recognising one relationshipDeclaration when a second should start with StringLiteral: anotherUniqueName
The parser looks a bit like this:
block
: BLOCK LBRACE relationshipDeclaration* RBRACE
;
relationshipDeclaration
: StringLiteral? ARROW StringLiteral StringLiteral*
;
I'm hoping to avoid lexical modes because the fact that these relationships can appear almost anywhere in the file will leave me up to my eyes in NL+ :-(
Would appreciate any ideas on what options I have. Is there a way to look ahead, spot the '->', for example?
Thanks a million.
Your example certainly looks like the NL is what signals the end of a relationshipDeclaration.
If that's the case, then you'll need NLs to be tokens available to your parse rules so the parser can know recognize the end.
As you've alluded to, you could potentially use -> to trigger a different Lexer Mode and generate different tokens for content between the -> and the NL and then use those tokens in your parse rule for relationshipDeclaration.
If it's as simple as your snippet indicates, then just capturing RD_StringLiteral tokens in that lexical mode, would probably be easier to deal with than handling all the places you might need to allow for NL. This would be pretty simple as Lexer modes go.
(BTW you can use x+ to get the same effect as x x*)
relationshipDeclaration
: StringLiteral? ARROW RD_StringLiteral+
;
I don't think there's a third option for dealing with this.
I am trying to parse a config, which would translate to a structured form. This new form requires that comments within the original config be preserved. The parsing tool is PLY. I am running into an issue with my current approach which I will describe in detail below, with links to code as well. The config file is going to look contain multiple config blocks, each of which is going to be of the following format
<optional comments>
start_of_line request_stmts(one or more)
indent reply_stmts (zero or more)
include_stmts (type 3)(zero or more)
An example config file looks like this.
While I am able to partially parse the config file with the grammar below, I fail to accomodate comments which would exist within the block.
For example, a block like this raises syntax errors, and any comments in a block of config fail to parse.
<optional comments>
start_of_line request_stmts(type 1)(one or more)
indent reply_stmts (type 2)(one or more)
<comments>
include_stmts (type 3)(one or more)(optional)
The parser.out mentions one shift/reduce conflict which I think arises because once the reply_stmts are parsed, a comments section which follows could mark start of a new block or comments within the subblock. Current grammar parsing result for the example file
[['# test comment ', '# more of this', '# does this make sense'], 'DEFAULT', [['x', '=',
'y']], [['y', '=', '1']], ['# Transmode', '# maybe something else', '# comment'],
'/random/location/test.user']
As you might notice, the second config block complete misses the username, request_stmt, reply_stmt sections.
What I have tried
I have tried moving the comments section around in the grammar, by specifying it before specific blocks or in the statement grammar. In the code link pasted above, the comments section has been specified in the overall statement grammar. Both of these approaches fail to parse comments within a config block.
username : comments username
| username
include_stmt : comments includes
| includes
I have two main questions:
Is there a mistake I am making in the implementation/understanding of LR parsing, solving which I could achieve what I want to ?
Is there a better way to achieve the same goal than my current approach ? (PLY-fu, different parser, different grammar)
P.S Wasn't able to include the actual code in the question, mentioned in the comments
You are correct that the problem is that when the parser sees a comment, it cannot know whether the comment belongs to the same section or whether the previous section is finished. In the former case, the parser needs to shift the comment, while in the latter case it needs to reduce the configuration section.
Since there could be any number of comments, the necessary lookahead could be arbitrarily large, in which case LR parsing wouldn't be possible. But a simple trick can reduce the lookahead to two tokens: just combine consecutive comments into a single token.
Any LR(k) grammar has an equivalent LR(1) grammar. In effect, the LR(1) grammars works by delaying all decisions for k-1 tokens, accumulating these tokens into the parser state. That's a massive increase in grammar size, but it's usually possible to achieve the same effect in other ways, and that's certainly the case here.
The basic idea is that any comment is (temporarily) accumulated into a list of comments. When a non-comment token is encountered, this temporary list is attached to that token.
This can be done either in the lexical scanner or in the parser actions, depending on your inclinations.
Before attempting all that, you should make sure that retaining comments is really useful to your application. Comments are normally not relevant to the semantics of a program (or configuration file), and it would certainly be much simpler for the lexer to just drop comments into the bit-bucket. If your application will end up reformatting the input, then it will have to retain comments. But if it only needs to extract information from the configuration, putting a lot of effort into handling comments is hard to justify.
I need to get the pieces of text out of text)). Very simple example actually, but gives me quite some pain.
Here is the sample text, it is an email template:
{!Account.Name}
Hi hi there {!Account.Id + 'cool'}.
Very interesting stuff - {!Contact.Description}
Now we get {!Contact.Description + Contact.Email__c}
So I need all the occurances of text like Account.Name, but only those which are within opening "{!" and closing "}" tags.
What is the simplest/starting approach to do it? Note that in case of the last line, I need to get the two occurances, Contact.Description and Contact.Email__c.
Thanks a lot for any help!
I would just do a plain text search for {...} blocks and parse their content with a simple expression parser. Don't try to come up with a parser that gets all the text and must be prepared to deal with any rubbish that can come in outside of the blocks (which could ultimatively lead to security problems).
i am facing an issue in parsing the below pattern
the log file will have log importance in the form of == or <= or >= or << or >>
I am trying the below custom pattern. Some of the log msgs may not have this pattern, so I am using *
(?(=<>)*)
But the log mesages are not parsing and give 'grokparsefailure'
kindly check and suggest if the above pattern is wrong.. Thanks much
below pattern is working fine.
(?[=<>]*)
the one which I used earlier and was erroring is
(?(=<>)*)
One thing to note, there is a better way to handle the "some do, some don't" aspect of your log-data.
(?<Importance>(=<>)*)
That will match more than you want. To get the sense of 'sometimes':
((?<Importance>(=<>)*)|^)
This says, match these three characters and define the field Importance, or leave the field unset.
Second, you're matching specifically two characters, in combinations:
((?<Importance>(<|>|=){2})|^)
This should match two instances of any of the trio of characters you're looking for.
I'm trying to write an antlr4 parser rule that can match the content between some arbitrary string values that are same. So far I couldn't find a method to do it.
For example, in the below input, I need a rule to extract Hello and Bye. I'm not interested in extracting xyz though.
TEXT Hello TEXT
TEXT1 Bye TEXT1
TEXT5 xyz TEXT8
As it is very much similar to an XML element grammar, I tried an example for XML Parser given in ANTLR4 XML Grammar, but it parses an input like <ABC> ... </XYZ> without error which is not what I wanted.
I also tried using semantic predicates without much success.
Could anyone please help with a hint on how to match content that is embedded between same strings?
Thank you!
Satheesh
Not sure how this works out performance wise, because of many many checks the parser has to do, but you could try something like:
token:
start = IDENTIFIER WORD* end = IDENTIFIER { start == end }?
;
The part between the curly braces is a validating semantic predicate. The lexer tokens are self-explanatory, I believe.
The more I think about it, it might be better you just tokenize the input and write an owner parser that processes the input and acts accordingly. Depends of course on the complexity of the syntax.