I'm learning about natural language processing and I'm wondering if someone could point me in the right direction.
Say I have a bunch of contracts, and they have something like:
Joe's Farm, hereafter known as the seller,
and Bob's supermarket, hereafter known as the buyer, blah blah..
I'd like to be able to identify which party is the buyer and seller in this sentence. From what I have read, it should be theoretically possible to:
1. Give the AI a lot of sample sentences and tell it "this is the buyer/seller".
2. After training, it should be able to analyze a new sentence.
I have tried some entity extraction (tokenizing the sentence and identifying the party names) but I don't know how to tell it "this party is the buyer".
One workaround is to identify segments of the sentence and search if that has the word "buyer" in it... which probably works in most cases, but I want to try to do this in an "AI" way.
Can anyone point me to the right direction on what to research?
Looks like a coreference resolution problem to me.
Stanford CoreNLP may be a good starting point. It comes with a deterministic, a statistical, and a neural system, as well as pre-trained models.
How you solve that problem will depend on several factors, most importantly:
Do all contracts have the same format?
After specifying the seller and the buyer as you showed in your examples, do their real names appear again the text? Or just referred to as seller and buyer?
With that in mind, let's assume that the contracts have various formats and that the proper/real names of the seller and the buyer appear only once in the text somewhere in the introduction of the contract. This second assumption simplifies the problem (and is more likely to be the case in real-world contracts).
I would tackle the problem in three steps:
Teach the program how to identify the introduction of the contract (i.e. the paragraph in which the contract says who is who; kinda like your example sentences.)
Split the introduction into two parts: the part/sentence(s) where the seller is defined and the part/sentence(s) where the buyer is defined.
Finally, look into the seller part to find who the seller is and the buyer part to find who the buyer is.
To solve the 1st step, a small training dataset would be necessary. If not available, you could manually identify introductions of several contracts and use them as your training dataset. From here, Naive Bayes would probably be the simplest way to identify if a section of the contract is an introduction or not (you can randomly divide the contract into multiple chunks). Naive Bayes relies only on the frequency of tokens (not the ordering). You can read more here.
To solve the 2nd step, I would pretty much repeat what's done in the first step: use a dataset to "classify" sections of the introduction as the seller part and the buyer part. Though, this step will probably require more accuracy than the first one. So, I suggest doing a n-gram language model. That looks at the frequency of tokens, but also the ordering and succession. You can read more here. For a n-gram, you want something in between: not too short (1-gram = Naive Bayes) and not too long (> ~ 6-gram) to avoid much overlaps between the seller and buyer sentences.
For the 3rd and last step, I cannot think of a straightforward way, but I would remove the stop words (i.e. frequent English words) first. Then, I would try to find rare words in close proximity to the target terms (buyer and seller). Since we are assuming that the real names only appear once in the contract, that could be a rule that helps you identify them.
There are probably many other things you can try depending on the size/availability of training datasets, but this should give you a start.
Related
There are some event description texts.
I want to extract the entrance fee of the events.
Sometimes the entrance fee is conditional.
What I want to achieve is to extract the entrance fee and it's conditions(if available). It's fine to retrieve the whole phrase or sentence which tells the entrance fee + it's conditions.
Note I: The texts are in German language.
Note II: Often the sentences are not complete, as they are mainly event flyers or advertisements.
What would be the category of this problem in NLP? Is it Named Entity Recognition and could be solved by training an own model with Apache openNLP?
Or I thought maybe easier would be to detect the pattern via the usual keywords in the use-case(entrance, $, but, only till, [number]am/pm, ...).
Please shed some light on me.
Input Examples:
- "If you enter the club before 10pm, the entrance is for free. Afterwards it is 6$."
- "Join our party tonight at 11pm till 5am. The entrance fee is 8$. But for girls and students it's half price."
This is broadly a structure learning problem. You might have to combine Named-Entity-Recognition/Tagging with Coreference Resolution. Read some papers on these as well as related github code and take it from there. Here is good discussion of state of the art tools for these at the moment https://www.reddit.com/r/MachineLearning/comments/3dz3fl/dl_architectures_for_entity_recognition_and_other/
Hope that helps.
You might try Stanford's CoreNLP for the named entity extraction part. It should be able to help you pick out the money values, and there is also a link to models trained for German language as well (https://nlp.stanford.edu/software/CRF-NER.shtml).
Given that it's fine to extract the entire sentence that contains the information, I'd suggest taking a binary sentence classification approach. You could probably get quite far just by using ngrams and some named entity information as features. That would mean that you'd need you'd want to build a pipeline that would automatically segment your documents into sentence-like chunks. You could try a sentence segmentation tool (also provided by Stanford CoreNLP) as a first go https://stanfordnlp.github.io/CoreNLP/. Since this would form the basis for all further work, you'd want to ensure that the results are at least decent. Perhaps the structure of the document itself gives you enough information to segment it without even using a sentence segmentation tool.
After you have this pipeline in place, you'd want to annotate the sentences extracted from a large set of documents as relevant or non-relevant to make it a binary classification task. Then train a model based on that dataset. Finally, when you apply it to unseen data, first use the sentence segmentation approach, and then classify each sentence.
I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.
I have a collection of bills and Invoices, so there is no context in the text (i mean they don't tell a story).
I want to extract people names from those bills.
I tried OpenNLP but the quality of trained model is not good because i don't have context.
so the first question is: can I train model contains only people names without context? and if that possible can you give me good article for how i build that new model (most of the article that i read didn't explain the steps that i should made to build new model).
I have database name with more than 100,000 person name (first name, last name), so if the NER systems don't work in my case (because there is no context), what is the best way to search for those candidates (I mean searching for every first name with all other last names?)
thanks.
Regarding "context", I guess you mean that you don't have entire sentences, i.e. no previous / next tokens, and in this case you face quite a non-standard NER. I am not aware of available software or training data for this particular problem, if you found none you'll have to build your own corpus for training and/or evaluation purposes.
Your database of names will probably greatly help, depending indeed on what proportion of bill names are actually present in the database. You'll also probably have to rely on character-level morphology of names, as patterns (see for instance patterns in [1]). Once you have a training set with features (presence in database, morphology, other information of bill) and solutions (actual names of annotated bills), using standard machine-learning as SVM will be quite straightforward (if you are not familiar with this, just ask).
Some other suggestions:
You may most probably also use other bill's information: company name, positions, tax mentions, etc.
You may also proceed in a a selective manner - if all bills should mention (exactly?) one person name, you may exclude all other texts (e.g. amounts, tax names, positions etc.) or assume in a dedicated model that among all text in a bill, only one should be guessed as a name.
[1] Ranking algorithms for named-entity extraction: Boosting and the voted perceptron (Michael Collins, 2002)
I'd start with some regular expressions, then possibly augment that with a dictionary-based approach (i.e., big list of names).
No matter what you do, it won't be perfect, so be sure to keep that in mind.
Help by editing my question title and tags is greatly appreciated!
Sometimes one participant in my corpus of "conversations" will refer to another participant using a nickname, usually an abbreviation or misspelling, but hereafter I'll just say "nicknames". Let's say I'm willing to manually tell my software whether or not I think various possible nicknames are in fact nicknames, but I want software to come up with a list of possible matches between the handle's that identify people, and the potential nicknames. How would I go about doing that?
Background on me and then my corpus: I have no experience doing natural language processing but I'm a competent data analyst with R. My data is produced by 70 teams, each forecasting the likelihood of 100 distinct events occurring some time in the future. The result that I have 70 x 100 = 7000 text files, containing the stream of forecasts participants make and the comments they include with their forecasts. I'll paste a very short snip of one of these text files below, this one had to do with whether the Malian government would enter talks with the MNLA:
02/12/2013 20:10: past_returns answered Yes: (50%)
I hadn't done a lot of research when I put in my previous
placeholder... I'm bumping up a lot due to DougL's forecast
02/12/2013 19:31: DougL answered Yes: (60%)
Weak President Traore wants talks if MNLA drops territorial claims.
Mali's military may not want talks. France wants talks. MNLA sugggests
it just needs autonomy. But in 7 weeks?
02/12/2013 10:59: past_returns answered No: (75%)
placeholder forecast...
http://www.irinnews.org/Report/97456/What-s-the-way-forward-for-Mali
My initial thoughts: Obviously I can start by providing the names I'm looking to match things up with... in the above example they would be past_returns and DougL (though there is no use of nicknames in the above). I wouldn't think it'd be that hard to get a computer to guess at minor misspellings (though I wouldn't personally know where to start). I can imagine that other tricks could be used, like assuming that a string is more likely to be a nickname if it is used much much more by one team, than by other teams. A nickname is more likely to refer to someone who spoke recently than someone who spoke long ago, or not at all on regarding this question. And they should be used in sentences in a manner similar to the way the full name/screenname is typically used in the corpus. But I'm interested to hear about simple approaches, as well as ones that try to consider more sophisticated techniques.
This could get about as complicated as you want to make it. From the semi-linguistic side of things, research topics would include Levenshtein Distance (for detecting minor misspellings of known names/nicknames) and Named Entity Recognition (for the task of detecting names/nicknames in the first place). Actually, NER's worth reading about, but existing systems might not help you much in your domain of forum handles and nicknames.
The first rough idea that comes to mind is that you could run a tokenized version of your corpus against an English dictionary (perhaps a dataset compiled from Wiktionary or something like WordNet) to find words that are candidates for names, then filter those through some heuristics (do they start with the same letters as known full names? Do they have a low Levenshtein distance from known names? Are they used more than once?).
You could also try some clustering or supervised ML algorithms against the non-word tokens. That might reveal some non-"word" tokens that often occur in the same threads as a given username; again, heuristics could help rule out some false positives.
Good luck; sounds like a fun problem - hope I mentioned at least one thing you hadn't already thought of.
I'm trying to make an analysis of a set of phrases, and I don't know exactly how "natural language processing" can help me, or if someone can share his knowledge with me.
The objective is to extract streets and localizations. Often this kind of information is not presented to the reader in a structured way, and It's hard to find a way of parsing it. I have two main objectives.
First the extraction of the streets itself. As far as I know NLP libraries can help me to tokenize a phrase and perform an analysis which will get nouns (for example). But where a street begins and where does it ends?. I assume that I will need to compare that analysis with a streets database, but I don't know wich is the optimal method.
Also, I would like to deduct the level of severity , for example, in car accidents. I'm assuming that the only way is to stablish some heuristic by the present words in the phrase (for example, if deceased word appears + 100). Am I correct?
Thanks a lot as always! :)
The first part of what you want to do ("First the extraction of the streets itself. [...] But where a street begins and where does it end?") is a subfield of NLP called Named Entity Recognition. There are many libraries available which can do this. I like NLTK for Python myself. Depending on your choice I assume that a streetname database would be useful for training the recognizer, but you might be able to get reasonable results with the default corpus. Read the documentation for your NLP library for that.
The second part, recognizing accident severity, can be treated as an independent problem at first. You could take the raw words or their part of speech tags as features, and train a classifier on it (SVM, HMM, KNN, your choice). You would need a fairly large, correctly labelled training set for that; from your description I'm not certain you have that?
"I'm assuming that the only way is to stablish some heuristic by the present words in the phrase " is very vague, and could mean a lot of things. Based on the next sentence it kind of sounds like you think scanning for a predefined list of keywords is the only way to go. In that case, no, see the paragraph above.
Once you have both parts working, you can combine them and count the number of accidents and their severity per street. Using some geocoding library you could even generalize to neighborhoods or cities. Another challenge is the detection of synonyms ("Smith Str" vs "John Smith Street") and homonyms ("Smith Street" in London vs "Smith Street" in Leeds).