How to Create JAPE Grammars Automatically?

How to Create JAPE Grammars Automatically? - nlp

I am having great troubles with JAPE grammars. I have a small token dictionary for the words that needs to be matched with 5 types of document.
One dictionary for one type: For example Job, the dictionary of the person would contain { "Engineer" , "Doctor", "Manager"}. I need to read this dictionary a create JAPE rules for that. This is my first try
Phase: Jobtitle
Input: Lookup
Options: control = appelt debug = true
Rule: Jobs
(
{Lookup.majorType == "Doctor"}
(
{Lookup.majorType == "Engineer"}
)?
)
:jobs
-->
:jobs.JobTitle = {rule = "Jobs"}
Is there any way to automatically create JAPE rules that only for searching tokens in a dictionary to documents?

Why not to use a standard gazetteer where the last parameter in .def file could have a custom type like "Doctor" or "Engineer"?
Something like: keywords.lst:Doctor:Doctor::Doctor

Related

Remove a map item in terraform

If I use the following yamldecode, I get the following output:
output "product" {
value = distinct(yamldecode(file("resource/lab/abc.yaml"))["account_list"]["resource_tags"][*]["TAG:product"])
}
Output:
+ product = [
+ "fargate",
+ "CRM",
]
I want fargate to be removed from my output and the expected output is this:
+ product = [
+ "CRM"
]
Please let me know how I can do this.

output "product" {
value = compact([for x in distinct(yamldecode(file("resource/lab/abc.yaml"))["account_list"]["resource_tags"][*]["TAG:product"]) : x == "fargate" ? "" : x])
}
I get this output:
test = [
"enter",
]
compact function solved the problem.

The Terraform language is based on functional principles rather than imperative principles and so it does not support directly modifying an existing data structure. Instead, the best way to think about goals like this is how to define a new value which differs from your existing value in some particular way.
For collection and structural types, the most common way to derive a new value with different elements is to use a for expression, which describes a rule for creating a new collection based on the elements of an existing collection. For each element in the source collection we can decide whether to use it in the result at all, and then if we do decide to use it we can also describe how to calculate a new element value based on the input element value.
In your case you seem to want to create a new list with fewer elements than an existing list. (Your question title mentions maps, but the question text shows a tuple which is behaving as a list.)
To produce a new collection with fewer elements we use the optional if clause of a for expression. For example:
[for x in var.example : x if x != "fargate"]
In the above expression, x refers to each element of var.example in turn. Terraform first evaluates the if clause, and expects it to return true if the element should be used. If so, it will then evaluate the expression immediately after the colon, which in this case is just x and so the result element is identical to the input element.
Your example expression also includes some unrelated work to decode a YAML string and then traverse to the list inside it. That expression replaces var.example in the above expression, as follows:
output "products" {
value = toset([
for x in yamldecode(file("${path.module}/resource/lab/abc.yaml"))["account_list"]["resource_tags"][*]["TAG:product"]: x
if x != "fargate"
])
}
I made some other small changes to the above compared to your example:
I named the output value "products" instead of "product", because it's returning a collection and it's Terraform language idiom to use plural names for collection values.
I wrapped the expression in toset instead of distinct, because from your description it seems like this is an unordered collection of unique strings rather than an ordered sequence of strings.
I added path.module to the front of the file path so that this will look for the file in the current module directory. If your output value is currently in a root module then this doesn't really matter because the module directory will always be the current working directory, but it's good practice to include this so that it's clear that the file belongs to the module.
Therefore returning a set is more appropriate than returning a list because it communicates to the caller of the module that they may not rely on the order of these items, and therefore allows the order to potentially change in future without it being a breaking change to your module.

String matching keywords and key phrases in Python

I am trying to perform a smart dynamic lookup with strings in Python for a NLP-like task. I have a large amount of similar-structure sentences that I would like to parse through each, and tokenize parts of the sentence. For example, I first parse a string such as "bob goes to the grocery store".
I am taking this string in, splitting it into words and my goal is to look up matching words in a keyword list. Let's say I have a list of single keywords such as "store" and a list of keyword phrases such as "grocery store".
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery store', 'computer store', 'coffee shop']
for word in sample.split():
# do dynamic length lookups
Now the issue is this Sometimes my sentences might be simply "bob goes to the store" instead of "bob goes to the grocery store".
I want to find the keyword "store" for sure but if there are descriptive words such as "grocery" or "computer" before the word store I would like to capture that as well. That is why I have the keyphrases list as well. I am trying to figure out a way to basically capture a keyword at the very least then if there are words related to it that might be a possible "phrase" I want to capture those too.
Maybe an alternative is to have some sort of adjective list instead of a phrase list of multiple words?
How could I go about doing these sort of variable length lookups where I look at more than just a single word if one is captured, or is there an entirely different method I should be considering?

Here is how you can use a nested for loop and a formatted string:
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery', 'computer', 'coffee']
for kw in keywords:
for kp in keyphrases:
if f"{kp} {kw}" in sample:
# Do something

How to extract relationships from a text

I am currently new with NLP and need guidance as of how I can solve this problem.
I am currently doing a filtering technique where I need to brand data in a database as either being correct or incorrect. I am given a structured data set, with columns and rows.
However, the filtering conditions are given to me in a text file.
An example filtering text file could be the following:
Values in the column ID which are bigger than 99
Values in the column Cash which are smaller than 10000
Values in the column EndDate that are smaller than values in StartDate
Values in the column Name that contain numeric characters
Any value that follows those conditions should be branded as bad.
However, I want to extract those conditions and append them to the program that I've made so far.
For instance, for the conditions above, I would like to produce
`if ID>99`
`if Cash<10000`
`if EndDate < StartDate`
`if Name LIKE %[1-9]%`
How can I achieve the above result using the Stanford NLP? (or any other NLP library).

This doesn't look like a machine learning problem; it's a simple parser. You have a simple syntax, from which you can easily extract the salient features:
column name
relationship
target value or target column
The resulting "action rule" is simply removing the "syntactic sugar" words and converting the relationship -- and possibly the target value -- to its symbolic form.
Enumerate all of your critical words for each position in a lexicon. Then use basic string manipulation operators in your chosen implementation language to find the three needed fields.
EXAMPLE
Given the data above, your lexicons might be like this:
column_trigger = "Values in the column"
relation_dict = {
"are bigger than" : ">",
"are smaller than" : "<",
"contain" : "LIKE",
...
}
value_desc = {
"numeric characters" : "%[1-9]%",
...
}
From here, use these items in standard parsing. If you're not familiar with that, please look up the basics of a simple sentence grammar in your favourite programming language, with rules such as such as
SENTENCE => SUBJ VERB OBJ
Does that get you going?

String Comparison with Elasticsearch Groovy Dynamic Script

I have an elasticsearch index that contains various member documents. Each member document contains a membership object, along with various fields associated with / describing individual membership. For example:
{membership:{'join_date':2015-01-01,'status':'A'}}
Membership status can be 'A' (active) or 'I' (inactive); both Unicode string values. I'm interested in providing a slight boost the score of documents that contain active membership status.
In my groovy script, along with other custom boosters on various numeric fields, I have added the following:
String status = doc['membership.status'].value;
float status_boost = 0.0;
if (status=='A') {status_boost = 2.0} else {status_boost=0.0};
return _score + status_boost
For some reason associated with how strings operate via groovy, the check (status=='A') does not work. I've attempted (status.toString()=='A'), (status.toString()=="A"), (status.equals('A')), plus a number of other variations.
How should I go about troubleshooting this (in a productive, efficient manner)? I don't have a stand-alone installation of groovy, but when I pull the response data in python the status is very much so either a Unicode 'A' or 'I' with no additional spacing or characters.

#VineetMohan is most likely right about the value being 'a' rather than 'A'.
You can check how the values are indexed by spitting them back out as script fields:
$ curl -XGET localhost:9200/test/_search -d '
{
"script_fields": {
"status": {
"script": "doc[\"membership.status\"].values"
}
}
}
'
From there, it should be an indication of what you're actually working with. More than likely based on the name and your usage, you will want to reindex (recreate) your data so that membership.status is mapped as a not_analyzed string. If done, then you won't need to worry about lowercasing of anything.
In the mean time, you can probably get by with:
return _score + (doc['membership.status'].value == 'a' ? 2 : 0)
As a big aside, you should not be using dynamic scripting. Use stored scripts in production to avoid security issues.

JAPE rule Sentence contains multiple cases

How can i check whether a sentence contain combinations? For example consider sentence.
John appointed as new CEO for google.
I need to write a rule to check whether sentence contains < 'new' + 'Jobtitle' >.
How can i achieve this. I tried following. I need to check is there 'new' before word .
Rule: CustomRules
(
{
Sentence contains {Lookup.majorType == "organization"},
Sentence contains {Lookup.majorType == "jobtitle"},
Sentence contains {Lookup.majorType == "person_first"}
}
)

One way to handle this is to revert it. Focus on the sequence you need and then get the covering Sentence:
(
{Token#string == "new"}
{Lookup.majorType = "jobtitle"}
):newJT
You should check this edge when the Sentence starts after "new", like this:
new
CEO
You can use something like this:
{Token ... }
{!Sentence, Lookup.majorType ...}
And then get the sentence (if you really need it) in the java RHS:
long end = newJTAnnots.lastNode().getOffset();
long start = newJTAnnots.firstNode().getOffset();
AnnotationSet sentences = inputAS.get("Sentence", start, end);

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string