GATE JAPE rule priorities not respected - nlp

I have following text:
1 hwb wert: 330 kWh
In the first step, following mapping is tacking place:
330 kWh is mapped as: Lookup.major = "unit"
hwb wertis mapped as: Lookup.major = "keyword"
The JAPE Rules:
Phase: composedUnits
Input: Token Lookup
Options: control=appelt debug=true
Rule: TableRow
Priority:10
(
({Lookup.majorType == "keyword"})
({Token.kind == punctuation})[0,4]
({Lookup.majorType == "unit"})
)
Rule: ReversedTableRow
Priority: -2
(
({Token.kind == number})
({Lookup.majorType == "keyword"})
)
I can't understand why the ReversedTableRow-Rule is matched and not the TableRow.

The appelt priorities work only for the same regions of text (e.g. earlier match wins and longer match wins). Text consumed by a previous rule cannot be matched by a later rule...
From the documentation:
With the appelt style, only one rule can be fired for the same region
of text, according to a set of priority rules. Priority operates in
the following way.
From all the rules that match a region of the document starting at
some point X, the one which matches the longest region is fired.
If
more than one rule matches the same region, the one with the highest
priority is fired
If there is more than one rule with the same
priority, the one defined earlier in the grammar is fired.
...
Note also that depending on the control style, firing a rule may
‘consume’ that part of the text, making it unavailable to be matched
by other rules. This can be a problem for example if one rule uses
context to make it more specific, and that context is then missed by
later rules, having been consumed due to use of for example the
‘Brill’ control style.
The rule TableRow can win as longer with following modification, note that I added the :tableRow label, which does not include the leading number token.
(
({Token.kind == number})?
(
({Lookup.majorType == "keyword"})
({Token.kind == punctuation})[0,4]
({Lookup.majorType == "unit"})
):tableRow
)

Related

How to print conditional fields in PPFA code

How do I print a conditional field using PPFA code. When a value is an 'X' then I'd like to print it. However, if the 'X' is not present then I'd like to print an image. Here is my code:
LAYOUT C'mylayout' BODY
POSITION .25 in ABSOLUTE .25 in
FONT TIMES
OVERLAY MYTEMPOVER 8.5 in 11.0 in;
FIELD START 1 LENGTH 60
POSITION 2.0 in 1.6 in;
Where it has FIELD START 1 LENGTH 60 that will print the given text at that location. But based on the value I want to print either the given text or an image. How would I do that?
Here is an answer from the AFP-L list:
I would create two PAGEFORMATS, one with LAYOUT for TEXT and one with LAYOUT for IMAGE. With CONDITION you can jump between the Pageformats (where Copygroup is always 'NULL')
If you work in a z/OS environment, be careful of 'JES Blanc Truncation'.
That means in one sentence:
if there is a X in the data, condition is true
if there is nothing in the data, condition doesn't work and is always wrong (nothing happens)
In this case you must create a Condition which is always true. I call it a Dummy-Condition.
PPFA sample syntax:
CONDITION TEST start 1 length 1
when eq 'X' NULL PAGEFORMAT PRTTXT
when ge x'00' NULL PAGEFORMAT PRTIMAGE;
You must copy this CONDITION into both PAGEFORMATS after LAYOUT command.
Blanc truncation is a difficult problem on z/OS.
In this sample, the PAGEFORMAT named PRTTXT contains all the formatting and printing directives when the condition is true, and the other called PRTIMAGE contains every directive needed to print the image.
HTH

How to Create JAPE Grammars Automatically?

I am having great troubles with JAPE grammars. I have a small token dictionary for the words that needs to be matched with 5 types of document.
One dictionary for one type: For example Job, the dictionary of the person would contain { "Engineer" , "Doctor", "Manager"}. I need to read this dictionary a create JAPE rules for that. This is my first try
Phase: Jobtitle
Input: Lookup
Options: control = appelt debug = true
Rule: Jobs
(
{Lookup.majorType == "Doctor"}
(
{Lookup.majorType == "Engineer"}
)?
)
:jobs
-->
:jobs.JobTitle = {rule = "Jobs"}
Is there any way to automatically create JAPE rules that only for searching tokens in a dictionary to documents?
Why not to use a standard gazetteer where the last parameter in .def file could have a custom type like "Doctor" or "Engineer"?
Something like: keywords.lst:Doctor:Doctor::Doctor

How to Interpret NLTK Brill Tagger Rules

For the generated Brill Tagger Rule:
Rule('016', 'CS', 'QL', [(Word([1, 2, 3]),'as')])
I know:
'CS' is subordinating conjunction
'QL' is qualifier
I guess:
[(Word([1, 2, 3]),'as')] means the condition of the rule. It stands for the word 'as' appear as the first, second or third position before the target word. Target word is word that is going to be tagged by POS tag.
I do not know:
What is the meaning for '016'?
How to interpret the rule as a whole?
The documentation for the rules is here.
016 would be the the templateid, i.e. the template that was used to create the rule.
You can also get a description for the rule:
q = Rule('016', 'CS', 'QL', [(Word([1, 2, 3]),'as')])
q.format('verbose')
'CS -> QL if the Word of words i+1...i+3 is "as"'
In this case it is actually the words that come after the target word. (Indicated by i+1...)

JAPE rule Sentence contains multiple cases

How can i check whether a sentence contain combinations? For example consider sentence.
John appointed as new CEO for google.
I need to write a rule to check whether sentence contains < 'new' + 'Jobtitle' >.
How can i achieve this. I tried following. I need to check is there 'new' before word .
Rule: CustomRules
(
{
Sentence contains {Lookup.majorType == "organization"},
Sentence contains {Lookup.majorType == "jobtitle"},
Sentence contains {Lookup.majorType == "person_first"}
}
)
One way to handle this is to revert it. Focus on the sequence you need and then get the covering Sentence:
(
{Token#string == "new"}
{Lookup.majorType = "jobtitle"}
):newJT
You should check this edge when the Sentence starts after "new", like this:
new
CEO
You can use something like this:
{Token ... }
{!Sentence, Lookup.majorType ...}
And then get the sentence (if you really need it) in the java RHS:
long end = newJTAnnots.lastNode().getOffset();
long start = newJTAnnots.firstNode().getOffset();
AnnotationSet sentences = inputAS.get("Sentence", start, end);

preg_match optional elements in string

I have two strings which contain up to 3 elements:
1) anychar[price]{alphanum} e.g. a1\')[=00.00]{a1234}
2) anychar:anychar{alphanum} e.g. a1\'):a2\'){a1234}
...but the {} element is optional and may not always be there. I wrote the following patterns (respectively):
1) /(.+)\[(.+)\]\{*(\w+)*\}*/ - works as expected
2) /(.+)\:(.+)\{*(\w+)*\}*/ - works fine if the {} element is removed, but not with it.
The result array for 2 is as follows:
(
[0] => a1\'):a2\'){a123}
[1] => a1\')
[2] => a2\'){a123}
)
I've tried a few different permutations of the above but no dice. Any ideas?
First you should remove the * after {, } and (\w+).
'/(.+)\:(.+)\{(\w+)\}/'
Gives
array(4) {
[0]=>
string(18) "a1\'):a2\'){a1234}"
[1]=>
string(5) "a1\')"
[2]=>
string(5) "a2\')"
[3]=>
string(5) "a1234"
}
* means either 0, 1 or several, and PCRE tries to find the quickest route it can, so if you make the whole third part optional (by using * everywhere) then the quickest route is to have everything included in the second group and skip the third, that's why your code didn't work.
Now in order to deal with the fact that the third part is optional, you have to use a positive lookahead: in the second group, you will ask pcre to select it only if it can matches another regex after it. The final regex is this:
'/(.+)\:(.+(?=(?:(?<=[^}])$|\{(\w+)\})))/'
What I changed is that:
inside the second group, i added a positive lookahead in the form (?=regex). As said, this means it has to match. Lookahead are not selective by default, which means that they don't create a entry in your final result/they are not returned to you.
inside that lookahead, I created two cases, which means that in order to match, the .+ from the second group will have to match either case of my lookahead.
The first case is very basic, it means end of string not preceded by a }, this will match the string when the 3rd part is not there
the second case if you selector for the 3rd group, we make it selectable so that it will be returned in the results if present

Resources