First a little bit of context: I'm trying to identify street addresses in a corpus of documents and we decided that the obvious solution for this would be to use an NLP (Apache OpenNLP in this case) tool to achieve this and so far everything looks great although we still need to train the model with a lot of documents, but that's not really an issue. We improved the solution by adding a extra step for address validation by using the USAddress parser from Datamade. My biggest issue is the fact that the addresses by themselves are nothing without a location next to them, sometimes the location is specified in the text and we will assume that this happens quite often.
Here comes my question: Is there someway to use coreference to associate the entities in the text? Or better yet is there a way to annotate arbitrary words in the text and identify them as being one entity?
I've been looking at the Apache OpenNLP documentation but...it's pretty thin and I think it still needs some work.
If you want to use coreference for this problem, you can have a look at this blog
But a simpler solution would be using a sentence detector+ RegEx or a location NER+ sentence detector(presuming addresses are in a single line)
I think the US addresses can be identified using a Regular Expression and once the regex matches, you can use opennlp's sentence detector to print the whole address line.
Similarly you can use NER model provided by opennlp to find locations and print the sentence you want.
Hope this helps!
edit
this Github Repo made it simple for us. Check it out!
OpenNLP does not provide a coreference resolution module. You have to use either Stanford or Illinois or Berkeley system to accomplish the task. They may not work out of the box, you may have to do some parameter tuning or supervised training to achieve reasonable performance.
#edit
Thanks #Alaye for pointing out that OpenNLP does have a coref module, for more details see his answer.
Thanks
Ok, several months later! It wasn't Coref what I was after... what I as actually looking for was Relation Extraction (Information Extraction). I used MITIE (BinaryRelation) and that did the trick, I trained my own model using Brat annotation tool and I got an F1 score of 0.81. Pretty neat...
Related
Has anyone used CoreNLP from stanford for sentiment analysis in Spark?
It is not working as desired or may be I need to do some work which I am not aware of.
Following is the example.
1). I look forward to interacting with kids of states governed by the congress. - POSITIVE
2). I look forward to interacting with CM of states governed by the congress. - NEGETIVE (CM is chief minister)
Please note the change in one word here. kids -> CM
Statement 2 is not negetive but coreNLP tagged it as negetive.
is there anything I need to do to make it work as desired? Any alteration required? Please let me know if I need to plug-in any custom code.
Whoever has knowledge on this, please suggest something.
Also, suggest if there is any other better alternate to coreNLP.
Thanks.
Gaurav
The sentiment model relies on embeddings for all of the words in the statement. If you change one word you can radically alter the score it will get. In fact there is a whole area of machine learning research which is analyzing situations like this where you can make changes that don't affect a human's judgment/perception but radically alter the learned model's choices. For instance in vision problems, there are ways to alter the image that are imperceptible to humans but fool the trained model.
It is important to remember that while a neural network can do impressive things, it is not a perfect replication of the human mind. Numerous examples like this have been presented to me over the last year, so there clearly are some deficiencies.
The Goal
My goal is to create an API that can verify how grammatically correct a sentence is. I am using a Markov Chain to generate a bunch of lines and I want to rank them by how much sense they make.
I want to be able to have some input like:
[
"This sentence is totally great!",
"Not great so sentence this one.",
"From on in where is are for pig."
]
and then get some output like:
[
0.71,
0.30,
-0.43,
]
Where I'm currently at
I've looked at using the Stanford Parser but I don't think there's a way to use your own corpus.
Currently, I am using a Microsoft joint probability cognitive service, which also doesn't allow a custom corpus and seems pretty rudimentary.
Direct questions
Is this a solved problem?
What is this kind of problem/research called? (So I know how to google around for it)
What methods are there for accomplishing something like this?
I don't know why you are unable to use your own corpus with stanford parser, but you can always use OpenNLP.
Here's what I would do:
create a corpus of parsed sentences, with some correct sentences. I can stop here and do what you're doing. Or else I'll do 2.
create a word2vec model and see how close(cosine similarity) is the parsed input sentence. Hopefully, you'll get good results.
you can have a quick start here for using OpenNLP.
Hope this helps!
The easiest way to solve your need is to create a language model from your corpus and then evaluate your test sentences against it which will give you some sort of score. You can look at the results and see if you need a more sophisticated approach.
I would start with a character language model, probably up to 6-grams for 10k sentences or so, longer if you have lots of data, shorter with less. You will have to play with it. You can also try token language models in the 2-4 token range. Tutorial for LingPipe is here .
Not exactly your use case but it least it is concrete.
It is easy to build and sensitive to your corpus.
You could try and treat the problem as a spell check problem where instead of spelling correction you are doing grammar correction with the backing language model informing whether the edits have improved things. The more edits required to get a good estimate will translate to a lower score for the original's grammaticality. The closest code to that is our tutorial here.
But it will need a lot of modification to apply to the grammar use case.
There are many ways to approach your problem if the language model approach doesn't work but I would start there.
Breck
I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.
I have a collection of "articles", each 1 to 10 sentences long, written in a noisy, informal english (i.e. social media style).
I need to extract some information from each article, where available, like date and time. I also need to understand what the article is talking about and who is the main "actor".
Example, given the sentence: "Everybody's presence is required tomorrow morning starting from 10.30 to discuss the company's financial forecast.", I need to extract:
the date/time => "10.30 tomorrow morning".
the topic => "company's financial forecast".
the actor => "Everybody".
As far as I know, the date and time could be extracted without using NLP techniques but I haven't found anything as good as Natty (http://natty.joestelmach.com/) in Python.
My understanding on how to proceed after reading some chapters of the NLTK book and watching some videos of the NLP courses on Coursera is the following:
Use part of the data to create an annotated corpus. I can't use off-the-shelf corpus because of the informal nature of the text (e.g. spelling errors, uninformative capitalization, word abbreviations, etc...).
Manually (sigh...) annotate each article with tags from the Penn TreeBank tagset. Is there any way to automate this step and just check/fix the results ?
Train a POS tagger on the annotated article. I've found the NLTK-trainer project that seems promising (http://nltk-trainer.readthedocs.org/en/latest/train_tagger.html).
Chunking/Chinking, which means I'll have to manually annotate the corpus again (...) using the IOB notation. Unfortunately according to this bug report n-gram chunkers are broken: https://github.com/nltk/nltk/issues/367. This seems like a major issue, and makes me wonder whether I should keep using NLTK given that it's more than a year old.
At this point, if I have done everything correctly, I assume I'll find actor, topic and datetime in the chunks. Correct ?
Could I (temporarily) skip 1,2 and 3 and produce a working, but possibly with a high error rate, implementation ? Which corpus should I use ?
I was also thinking of a pre-process step to correct common spelling mistakes or shortcuts like "yess", "c u" and other abominations. Anything already existing I can take advantage of ?
THE question, in a nutshell, is: is my approach at solving this problem correct ? If not, what am I doing wrong ?
Could I (temporarily) skip 1,2 and 3 and produce a working, but
possibly with a high error rate, implementation ? Which corpus should
I use ?
I was also thinking of a pre-process step to correct common spelling
mistakes or shortcuts like "yess", "c u" and other abominations.
Anything already existing I can take advantage of ?
I would suggest you first have a go at processing standard language text. The pre-processing you refer to is an NLP task in its own right, known as normalization. Here is a resource for Twitter normalization: http://www.ark.cs.cmu.edu/TweetNLP/ , additionally, you can use spell checking, sentence boundary detection, ...
THE question, in a nutshell, is: is my approach at solving this
problem correct ? If not, what am I doing wrong ?
If you make abstraction of normalization, I think your approach is valid. With regard to automating the annotation process: you can bootstrap the process by using off-the-shelf components first, after which you correct, retrain, and so on, ... during different iterations. To get acceptable results, you will need to do your steps 2, 3, and 4 a couple of times.
If you are interested in understanding the problem and being able to optimize existing solutions, I would suggest you focus on tools that allow you to develop your own models. If you prioritize getting results over being able to develop your own models, I would recommend looking into existing open source text engineering frameworks such as Gate (https://gate.ac.uk/) UIMA (http://uima.apache.org/) and DKPro (which extends UIMA) (https://code.google.com/p/dkpro-core-asl/). All three frameworks wrap existing components, so you have a wide range of possible solutions.
I'd suggesting giving a try to NER and Temporal Normalizer.
Here is what I see for your example sentence:
You can try the demo here:
http://deagol.cs.illinois.edu:8080/
We've been working with the NLTK library in a recent project where we're
mainly interested in the named entities part.
In general we're getting good results using the NEChunkParser class.
However, we're trying to find a way to provide our own terms to the
parser, without success.
For example, we have a test document where my name (Shay) appears in
several places. The library finds me as GPE while I'd like it to find
me as PERSON...
Is there a way to provide some kind of a custom file/
code so the parser will be able to interpret the named entity as I
want it to?
Thanks!
The easy solution is to compile a list of entities that you know are misclassified, then filter the NEChunkParser output in a postprocessing module and replace these entities' tags with the tags you want them to have.
The proper solution is to retrain the NE tagger. If you look at the source code for NLTK, you'll see that the NEChunkParser is based on a MaxEnt classifier, i.e. a machine learning algorithm. You'll have to compile and annotate a corpus (dataset) that is representative for the kind of data you want to work with, then retrain the NE tagger on this corpus. (This is hard, time-consuming and potentially expensive.)