Extract Person Name from unstructure text - nlp

I have a collection of bills and Invoices, so there is no context in the text (i mean they don't tell a story).
I want to extract people names from those bills.
I tried OpenNLP but the quality of trained model is not good because i don't have context.
so the first question is: can I train model contains only people names without context? and if that possible can you give me good article for how i build that new model (most of the article that i read didn't explain the steps that i should made to build new model).
I have database name with more than 100,000 person name (first name, last name), so if the NER systems don't work in my case (because there is no context), what is the best way to search for those candidates (I mean searching for every first name with all other last names?)
thanks.

Regarding "context", I guess you mean that you don't have entire sentences, i.e. no previous / next tokens, and in this case you face quite a non-standard NER. I am not aware of available software or training data for this particular problem, if you found none you'll have to build your own corpus for training and/or evaluation purposes.
Your database of names will probably greatly help, depending indeed on what proportion of bill names are actually present in the database. You'll also probably have to rely on character-level morphology of names, as patterns (see for instance patterns in [1]). Once you have a training set with features (presence in database, morphology, other information of bill) and solutions (actual names of annotated bills), using standard machine-learning as SVM will be quite straightforward (if you are not familiar with this, just ask).
Some other suggestions:
You may most probably also use other bill's information: company name, positions, tax mentions, etc.
You may also proceed in a a selective manner - if all bills should mention (exactly?) one person name, you may exclude all other texts (e.g. amounts, tax names, positions etc.) or assume in a dedicated model that among all text in a bill, only one should be guessed as a name.
[1] Ranking algorithms for named-entity extraction: Boosting and the voted perceptron (Michael Collins, 2002)

I'd start with some regular expressions, then possibly augment that with a dictionary-based approach (i.e., big list of names).
No matter what you do, it won't be perfect, so be sure to keep that in mind.

Related

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

How to determine if a piece of text mentions a product

I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.

What features do NLP practitioners use to pick out English names?

I am trying named entity recognition for the first time. I'm looking for features that will pick out English names. I am using the methods outlined in the coursera nlp course (week three) and the nltk book. In other words: I am defining features, identifying features of words and then running those words/features through a classifier that I train on labeled data.
What features are used to pick out English names?
I can imagine that you'd look for two capital words in a row, or a capital word and then an initial and then a capital word. (ex. John Smith or James P. Smith).
But what other features are used for NER?
Some common features:
Word lists for common names (John, Adam, etc)
casing
contains symbol or numeric characters (names generally don't)
person prefixes (Mr., Mrs., etc...)
person postfixes (Jr., Sr., etc...)
single letter abbreviation (ie, (J.) Smith).
analysis of surrounding words (you may find some words have a high probability of appearing near names).
Named Entities previously recognized (often it is easy to identify NE in some parts of the corpus based on context, but very hard in other parts. If previously identified, this is an excellent hint towards NER)
Depending what language you are working with there may be more language specific features as well. Frankly you can turn up a wealth of information with a simple Google query, I'm really not sure why you haven't turned there. Some starting points however:
Google
A survey of named entity recognition and classification
Named entity recognition without gazetteers
I had done something similar back in school using machine learning. I suppose that you will use a supervised algorithm and you will classify every single word independently and not words in combination. In that case I would choose some features for the word itself like the ones you mentioned (if the word begins with a capital letter, if the word is an abbreviation) but I would add some more features like if the previous or the next words also start from a capital letter, or if they are abbreviations. This way you can add some context and overcome the problems related to your basic independence assumption.
If you want have a look here. In the machine learning section you can find some more information and examples (the problem is slightly different but the method should be similar).
Whatever features you choose it is important that you use some measure to evaluate their relevance and possibly reduce them to the useful ones to avoid over-fitting. One of the measures you can use to evaluate them is the gain ratio but there are many more. Here you can find some basic information about feature extraction.
Hope it helps!

Focused Named Entity Recognition (NER)?

I want to recognize named entities in a specific field (e.g. baseball). I know there are tools available like StanfordNER, LingPipe, AlchemyAPI and I have done a little testing with them. But what I want them to be is field specific as I mentioned earlier. How this is possible?
One approach may be to
Use a general (non-domain specific) tool to detect people's names
Use a subject classifier to filter out texts that are not in the domain
If the total size of the data set is sufficient and the accuracy of the extractor and classifier good enough, you can use the result to obtain a list of people's names that are closely related to the domain in question (e.g. by restricting the results to those that are mentioned significantly more often in domain-specific texts than in other texts).
In the case of baseball, this should be a fairly good way of getting a list of people related to baseball. It would, however, not be a good way to obtain a list of baseball players only. For the latter it would be necessary to analyse the precise context in which the names are mentioned and the things said about them; but perhaps that is not required.
Edit: By subject classifier I mean the same as what other people might refer to simply as categorization, document classification, domain classification, or similar. Examples of ready-to-use tools include the classifier in Python-NLTK (see here for examples) and the one in LingPipe (see here).
Have a look at smile-ner.appspot.com which covers 250+ categories. In particaul, it covers a lot of persons/teams/clubs on sports. May be useful for your purpose.

How to find references to dates in natural text?

What I want to do is to parse raw natural text and find all the phrases that describe dates.
I've got a fairly big corpus with all the references to dates marked up:
I met him <date>yesterday</date>.
Roger Zelazny was born <date>in 1937</date>
He'll have a hell of a hangover <date>tomorrow morning</date>
I don't want to interpret the date phrases, just locate them. The fact that they're dates is irrelevant (in real life they're not even dates but I don't want to bore you with the details), basically it's just an open-ended set of possible values. The grammar of the values themselves can be approximated as context-free, however it's quite complicated to build manually and with increasing complexity it gets increasingly hard to avoid false positives.
I know this is a bit of a long shot so I'm not expecting an out-of-the-box solution to exist out there, but what technology or research can I potentially use?
One of the generic approaches used in academia and in industry is based on Conditional Random Fields. Basically, it is a special probabilistic model, you train it first with your marked up data and then it can label certain types of entities in a given text.
You can even try one of the systems from Stanford Natural Language Processing Group: Stanford Named Entity Recognizer
When you download the tool, note there are several models, you need the last one:
Included with the Stanford NER are a 4 class model trained for CoNLL,
a 7 class model trained for MUC, and a 3 class model trained on both
data sets for the intersection of those class sets.
3 class Location, Person, Organization
4 class Location, Person, Organization, Misc
7 class Time, Location, Organization, Person, Money, Percent, Date
Update. You can actually try that tool online here. Select the muc.7class.distsim.crf.ser.gz classifier and try some text with dates. It doesn't seem to recognize "yesterday", but it recognizes "20th century", for example. In the end, this is a matter of CRF training.
Keep in mind CRFs are rather slow to train and require human-annotated data, so doing it yourself is not easy. Read the answers to this for another example of how people often do it in practice- not much in common with current academic research.

Resources