PyLighter annotates characters not whole words - keras

I have annotated a corpus of bank transfer reasons. In the returned csv format each labeled word is annotated char by char, like this:
BIO HEALTH - ['B-company', 'I-company', 'I-company', 'I-company', 'I-company', 'I-company', 'I-company', 'I-company', 'I-company', 'I-company']
From other projects, I see that annotations look like this:
BIO HEALTH - ['U-company']
I see this is another annotation format BILOU.
My model does not work properly with PyLighter's IOB2 format or maybe I transform data in the wrong way. Is there a way to convert to BILOU or someone who can give me a little tutorial on how to accurately work with annotated data?

Related

How to support 'double' and 'triple' in Dialogflow digit-string entities?

In Australia it is totally normal for a voice-assistant user to speak digit strings with 'double' and 'triple'. (Same in the UK - Where they also sometimes use "treble")
So "8845" is said "double eight four five".
"6663" will often be said as "triple six three".
Dialogflow doesn't seem to support this for any of the system digit-string entities that aim to understand a user speaking a string of digits.
So, anyone know how to support "double" and "triple" in digit strings in Dialogflow?
Do I have to 'roll my own'?
To handle these cases, you can create a dev mapping entity (let's call it "number-extra"):
reference value synonyms
88 double eight
666 triple six
Since there are only 10 "double" or "triple" variants (one for each digit), you can just create a mapping for each one (11, 22, 33, etc).
You also need a composite entity (let's call it "number"):
#numbers-extra
#sys.number
Both entities should return strings, so there will be no inconsistencies in the composite entity and the reference values should be easy to handle on the backend.
You should also add training phrases that use these entities, e.g. "My address is triple six three Main Street" and annotate the entities accordingly. This gives your model more information about how these entities are used and will improve accuracy.
This suggestion can be generalized for other sys entities as well. Missing city? Create an entity for cities and combine it with #sys.geo-city in a composite entity. Missing given-name? Same procedure.
You can use SSML and some logic to accomplish this.
Parse "468826661" to be four six double eight two triple six one and then just send it like that in a <speak></speak> tag.
Here are the docs for that.

How to conduct entity co-referencing and negation detection with OpenNLP?

Is there a way to handle negations in OpenNLP?
For instance
"He is NOT a dangerous person."
With POS tagging we usually sense adjective "dangerous" and give a negative sentiment. Any way to figure this out?
More specific (with OpenNLP): How could entity co-referencing/relation be done?

Extracting <subject, predicate, object> triplet from unstructured text

I need to extract simple triplets from unstructured text. Usually it is of the form noun- verb- noun, so I have tried POS tagging and then extracting nouns and verbs from neighbourhood.
However it leads to lot of cases and gives low accuracy.
Will Syntactic/semantic parsing help in this scenario?
Will ontology based information extraction be more useful?
I expect that syntactic parsing would be the best fit for your scenario. Some trivial template-matching method with POS tags might work, where you find verbs preceded and followed by a single noun, and take the former to be the subject and the latter the object. However, it sounds like you've already tried something like that -- unless your neighborhood extraction ignores word order (which would be a bit silly - you'd be guessing which noun was the word and which was the object, and that's assuming exactly two nouns in each sentence).
Since you're looking for {s, v, o} triplets, chances are you won't need semantic or ontological information. That would be useful if you wanted more information, e.g. agent-patient relations or deeper knowledge extraction.
{s,v,o} is shallow syntactic information, and given that syntactic parsing is considerably more robust and accessible than semantic parsing, that might be your best bet. Syntactic parsing will be sensitive to simple word re-orderings, e.g. "The hamburger was eaten by John." => {John, eat, hamburger}; you'd also be able to specifically handle intransitive and ditransitive verbs, which might be issues for a more naive approach.

Utility to generate performance report of a NLP based text annotator

I am trying to build a quality test framework for my text annotator. I wrote my annotators using GATE
I do have gold-standard (human annotated) data for every input document.
Here is list of gate resource for quality assurance GATE Embedded API for the measures
So far, I am able to get performance matrix containing FP,TP,FN, Precision, Recall and Fscores using methods in
AnnotationDiļ¬€er
Now, I want to dive deeper. I would like to look at individual FP,FN on per document basis.
i.e. I want to analyize each FP and FN so that I can fix my annotator accordingly.
I didn't see any function in any of GATE's classes such as AnnotationDiffer which returns List<Annotation> of FP or FN. They just return count of FP and FN
int fp=annotationDiffer.getFalsePositivesStrict()
int fn=annotationDiffer.getMissing()
Before I go ahead and create my own utility to get List<Annotation> of FP and FN and couple of surrounding sentences, to create an HTML report per input document for analysis. I wanted to check if there is something like that already exists.
I figured it out how to get FP and FN annotations
List<AnnotationDiffer.Pairing> differ= annotationDiffer.calculateDiff(goldAnnotSet, systemAnnotSet);
for(Annotation fnAnnotation:annotationDiffer.missingAnnotations)
{
System.out.println("FN=>"+fnAnnotation);
}
for(Annotation fpAnnotation:annotationDiffer.spuriousAnnotations)
{
System.out.println("FP=>"+fpAnnotation);
}
Based on offsets of fnAnnotation or fpAnnotations, I can easily get the surrounding sentences and create a nice html report.

Mapping interchangeably terms such as Weight to Mass for QAnswering NLP

I've been working on a Question Answering engine in C#. I have implemented the features of most modern systems and are achieving good results. Despite the aid of Wordnet , one problem I haven't been able to solve yet is changing the user input to the correct term.
For example
changing Weight -> Mass
changing Tall -> Height
My question is about the existence of some sort of resource that can aid me in this task of changing the terms to the correct terms.
Thank You
Looking at all the synsets in WordNet for both Mass and Weight I can see that there is no shared synset and thus there is no meaning in common. Words that actually do have the same meaning can be matched by means of their synset labels, as I'm sure you've realized.
In my own natural language engine (http://nlp.abodit.com) I allow users to use any synset label in the grammar they define but I would still create two separate grammar rules in this case, one recognizing questions about mass and one recognizing questions about weight.
However, there are also files for Wordnet that give you class relationships between synsets too. For example, if you type 'define mass' into my demo page you'll see:-
4. wn30:synset-mass-noun-1
the property of a body that causes it to have weight in a gravitational field
--type--> wn30:synset-fundamental_quantity-noun-1
--type--> wn30:synset-physical_property-noun-1
ITokenText, IToken, INoun, Singular
And if you do the same for 'weight' you'll also see that it too has a class relationship to 'physical property'.
In my system you can write a rule that recognizes a question about a 'physical property' and perhaps a named object and then try to figure out which physical property they are likely to be asking about. And, perhaps, if you can't match maybe just tell them all about the physical properties of the object.
The method signature in my system would be something like ...
... QuestionAboutPhysicalProperties (... IPhysicalProperty prop,
INamedObject obj, ...)
... and in code I would look at the properties of obj and try to find one called 'prop'.
The only way that I know how to do this effectively requires having a large corpus of user query sessions and a happiness measure on sessions, and then finding correlations between substituting word x for word y (possibly given some context z) that improves user happiness.
Here is a reasonable paper on generating query substitutions.
And here is a new paper on generating synonyms from anchor text, which doesn't require a query log.

Resources