UIMA Ruta Create Label over multiple Fields - nlp

I am creating my own types which should consist of an label. The Label needs to include the whole matched String (for further processing)
For Exampel this would be my rule:
(W{REGEXP("myregex1")} W{REGEXP("myregex2")}) { -> CREATE(MyType, "label"=?)}
You can see the question mark behind the "label" part. Is it possible to transfer the matched string to that label?

This is normally done with something like the MATCHEDTEXT action and a STRING variable:
STRING mt;
(W{REGEXP("myregex1")} W{REGEXP("myregex2")}) { -> MATCHEDTEXT(mt), CREATE(MyType, "label"=mt)};
With UIMA Ruta 2.5.0 (upcoming release) you can also use the implicit coveredText feature of a local annotation variable (label):
m:(W{REGEXP("myregex1")} W{REGEXP("myregex2")}) { -> CREATE(MyType, "label"=m.ct)};
DISCLAIMER: I am a developer of UIMA Ruta

Related

If-Then-Else in Ruta

is there something like if then else in Ruta available? I'd like to do something like:
if there's at least one term from catA, then label the document with "one"
else if there's at least one term from catB, then label the document with "two"
else label the document with "three".
All the best
Philipp
There is no language structure for if-then-else in UIMA Ruta (2.7.0).
You need to duplicate some parts of the rule in order to model the else part, e.g., something like the following:
Document{CONTAINS(CatA) -> One};
Document{-CONTAINS(CatA), CONTAINS(CatB) -> Two};
Document{-CONTAINS(CatA), -CONTAINS(CatB) -> Three};
You could also check if the previous rule has matched and depend on that.
How the rule should actually look like depends mainly on the type system and how you want to model the information (features?).
DISCLAIMER: I am a developer of UIMA Ruta
I think you are asking about If-else-if in Ruta. This is possible using "ONLYFIRST"
PACKAGE uima.ruta.example;
DECLARE CatA,CatB,CatC;
"CatA"->CatA;
"CatB"->CatB;
"CatC"->CatC;
DECLARE one,two,three;
ONLYFIRST Document{}{
Document{CONTAINS(CatA) -> one};
Document{CONTAINS(CatB) -> two};
Document{CONTAINS(CatC) -> three};
}

How to move part of the string after exact word to another field in logstash?

Let's imagine I have log file like the following:
My custom exception ST: java.lang.RuntimeException: Text of this dummy err.
My final goal is to put everything after ST: to new field ST called and remove ST:.
I'm trying to use the pattern, but it doesn't work.
filter {
grok {
match => { "message" => "(?<newField>(?<=ST)(?s)(.*$))" }
}
Grok is based on Oniguruma regex library. To make . match any char with an Oniguruma regex, you need to pass (?m) inline modifier, not (?s) (as in PCRE and some other regex engines).
By placing (?<=ST) positive lookahead inside the named capturing group, you require ST to appear immediately before the current location, but you have ST and a colon right after, and then a space. It makes sense to just move ST: out of the named group:
"(?m)ST: (?<newField>.*)"
^^^^^^^^
The ST: and space will get matched and consumed, newField group will hold the rest of string in it.
You can use a specific regex like that:
^My custom exception ST: %{GREEDYDATA:ST}
Or a more generric one:
%{GREEDYDATA} \bST\b: %{GREEDYDATA:ST}
Always try to use specific regex.

Node.JS - if a string includes some strings with any characters in between

I am testing a string that contains an identifier for which type of device submitted the string. The device type identifier will be something like "123456**FF789000AB" where the * denote any character could be used at this position. I run a series of functions to parse additional data and set variables based on the type of device submitting the data. Currently, I have the following statement:
if (payload[4].includes("02010612FF590080BC")) { function(topic, payload, intpl)};
The string tested in the includes() test will always start with 020106, but the next two characters could be anything. Is there a quick regex I could throw in the includes function, or should I organize the test in a different way?
To match the "020106**FF590080BC" pattern, where * can be anything, you can use RegExp.test() and the regular expression /020106..FF590080BC/:
if (/020106..FF590080BC/.test(payload[4])) { ... }
Also, if you require that the pattern must match the beginning of the string:
if (/^020106..FF590080BC/.test(payload[4])) { ... }

How to Create JAPE Grammars Automatically?

I am having great troubles with JAPE grammars. I have a small token dictionary for the words that needs to be matched with 5 types of document.
One dictionary for one type: For example Job, the dictionary of the person would contain { "Engineer" , "Doctor", "Manager"}. I need to read this dictionary a create JAPE rules for that. This is my first try
Phase: Jobtitle
Input: Lookup
Options: control = appelt debug = true
Rule: Jobs
(
{Lookup.majorType == "Doctor"}
(
{Lookup.majorType == "Engineer"}
)?
)
:jobs
-->
:jobs.JobTitle = {rule = "Jobs"}
Is there any way to automatically create JAPE rules that only for searching tokens in a dictionary to documents?
Why not to use a standard gazetteer where the last parameter in .def file could have a custom type like "Doctor" or "Engineer"?
Something like: keywords.lst:Doctor:Doctor::Doctor

Scala: split string via pattern matching

Is it possible to split string into lexems somehow like
"user#domain.com" match {
case name :: "#" :: domain :: "." :: zone => doSmth(name, domain, zone)
}
In other words, on the same manner as lists...
Yes, you can do this with Scala's Regex functionality.
I found an email regex on this site, feel free to use another one if this doesn't suit you:
[-0-9a-zA-Z.+_]+#[-0-9a-zA-Z.+_]+\.[a-zA-Z]{2,4}
The first thing we have to do is add parentheses around groups:
([-0-9a-zA-Z.+_]+)#([-0-9a-zA-Z.+_]+)\.([a-zA-Z]{2,4})
With this we have three groups: the part before the #, between # and ., and finally the TLD.
Now we can create a Scala regex from it and then use Scala's pattern matching unapply to get the groups from the Regex bound to variables:
val Email = """([-0-9a-zA-Z.+_]+)#([-0-9a-zA-Z.+_]+)\.([a-zA-Z]{2,4})""".r
Email: scala.util.matching.Regex = ([-0-9a-zA-Z.+_]+)#([-0-9a-zA-Z.+_]+)\.([a-zA-Z] {2,4})
"user#domain.com" match {
case Email(name, domain, zone) =>
println(name)
println(domain)
println(zone)
}
// user
// domain
// com
Starting Scala 2.13, it's possible to pattern match a Strings by unapplying a string interpolator:
val s"$user#$domain.$zone" = "user#domain.com"
// user: String = "user"
// domain: String = "domain"
// zone: String = "com"
If you are expecting malformed inputs, you can also use a match statement:
"user#domain.com" match {
case s"$user#$domain.$zone" => Some(user, domain, zone)
case _ => None
}
// Option[(String, String, String)] = Some(("user", "domain", "com"))
In general regex is horribly inefficient, so wouldn't advise.
You CAN do it using Scala pattern matching by calling .toList on your string to turn it into List[Char]. Then your parts name, domain and zone will also be List[Char], to turn them back into Strings use .mkString. Though I'm not sure how efficient this is.
I have benchmarked using basic string operations (like substring, indexOf, etc) for various use cases vs regex and regex is usually an order or two slower. And of course regex is hideously unreadible.
UPDATE: The best thing to do is to use Parsers, either the native Scala ones, or Parboiled2

Resources