ignore certain word from gate nlp NER - nlp

I want to add ignore.lst in gate annie gazetteer so that the word in that list does not show while using NE. I see stop.lst inside annie gazetteer. What is the use of stop.lst for? I created ignore.lst and added it to list.def. How to make gate nlp not show names contain in ignore.lst?

I don't that it will work using only gazetteers. I think that the option will be to write a post-processing "cleanup" JAPE rule, which will delete all NE, which contains words from your stoplist.

Related

DialogFlow: import training phrases from a document rather than input manually?

DialogFlow: how to import training phrases from a document rather than input manually?
If there's a lot, you could use API to add them programmably. What you need is to add your training phrases to Intent through https://cloud.google.com/dialogflow/es/docs/reference/rest/v2/projects.agent.intents
Another one-off and hacky is to export the agent in agent settings, edit the exported files (look for the user says files), add your training phrases there, and restore.

Approach for Text mining on a file and assigning category

Need help in deciding an algorithmic approach where the text is read line by line the text contains description of incident ticket, one reading each row it should assign a category to that incident type using a set of keywords association already decided ...for example if the description contain words like password(s) then it should assign it as a category password issue.
Kindly help
You can try bag-of-words, or document vectors.
If there are spelling errors, you’ll need fuzzymatching techniques.
You’ll want to clean stop words beforehand as well.
Good luck.

Can different entities in dialogflow have the same value?

I am trying to understand how to structure intents when entities contains the same strings as value.
I imagine that when adding functionalities this will become a mess;
What is the correct approach to handle this "mixed" content?
example:
Entity 1: content
word document(s)
html page(s)
video(s)
Entity 2: content-specifier
video(s)
image(s)
car(s)
Example 1: show me all [html pages] with [videos]
the expectation is to have
#content => "html page"
#content-specifier => "video"
Example 2: show me all [videos] with [cars]
the expectation is to have
#content => "video"
#content-specifier => "car"
I believe that at the beginning you're going to get a lot of false positive matches. After training and adding a lot of training phrases there it should be ok though. To help it match better the options use also a template:
3 things to note here:
make sure you correct at your intent any values it misclassified
go at the training option frequently and make sure that the entities are correctly recognize. Make any necessary changes
Create a template for the user input. To do so, in your intent at the training phrases, click on the quotes. It will change to an "at" symbol (#). Then add the expected format of your user's input

Google prediction API - Training data syntax for multi classification

Trying to harness the power of Google Prediction API, to classify my data. Each item in my DB can have multi categories assign to it.
For example: "My Nexus phone is rebooting constantly" could be assigned both #Android and #troubleshooting tags.
I would like to upload my training data to Google, but I'm not sure how to apply both tags to the same content. In the following example I've found the syntax that provide one category for each content like so:
"Android" ,"My Nexus phone is rebooting constantly"
What is the right syntax for multi-classification training data?
Unless I'm misunderstanding something from your question, I think the answer to it is in the docs here.
Namely, the section about text strings explains that when you submit a text string, the system actually cuts it into multiple strings, separating everything using whitespaces as a delimiter. They point out to "Godzilla vs Mothra" to be "Godzilla", "vs", and "Mothra". So in your case, you could just use "Android troubleshooting". The system will separate it in "Android" and "troubleshooting".
From the docs:
Each line can only have one label assigned, but you can apply multiple labels to one example by repeating an example and applying different labels to each one. For example:
"excited", "OMG! Just had a fabulous day!"
"annoying", "OMG! Just had a fabulous day!"
If you send a tweet to this model, you might get a classification something like this: "excited":0.6, "annoying":0.2.

SentenceSplitter in GATE

I am trying to detect Sentences using GATE and more specifically using either ANNIE SentenceSplitter or RegexSentenceSplitter.
RegexSentenceSplitter seems to be working very well, however the only problem is that a new sentence annotation is being created at the beginning of each new page of the document. (The documents analysed are PDFs).
Is it possible to change this behavior of the RegexSentenceSplitter?
You can probably try to use a conditional corpus pipeline. This method allows you to run PR (here the RegExSentenceSplitter) or not according to the value of a feature on the document.
More details here: https://gate.ac.uk/sale/tao/splitch3.html#x6-480003.8.2

Resources