I'm trying to make an analysis of a set of phrases, and I don't know exactly how "natural language processing" can help me, or if someone can share his knowledge with me.
The objective is to extract streets and localizations. Often this kind of information is not presented to the reader in a structured way, and It's hard to find a way of parsing it. I have two main objectives.
First the extraction of the streets itself. As far as I know NLP libraries can help me to tokenize a phrase and perform an analysis which will get nouns (for example). But where a street begins and where does it ends?. I assume that I will need to compare that analysis with a streets database, but I don't know wich is the optimal method.
Also, I would like to deduct the level of severity , for example, in car accidents. I'm assuming that the only way is to stablish some heuristic by the present words in the phrase (for example, if deceased word appears + 100). Am I correct?
Thanks a lot as always! :)
The first part of what you want to do ("First the extraction of the streets itself. [...] But where a street begins and where does it end?") is a subfield of NLP called Named Entity Recognition. There are many libraries available which can do this. I like NLTK for Python myself. Depending on your choice I assume that a streetname database would be useful for training the recognizer, but you might be able to get reasonable results with the default corpus. Read the documentation for your NLP library for that.
The second part, recognizing accident severity, can be treated as an independent problem at first. You could take the raw words or their part of speech tags as features, and train a classifier on it (SVM, HMM, KNN, your choice). You would need a fairly large, correctly labelled training set for that; from your description I'm not certain you have that?
"I'm assuming that the only way is to stablish some heuristic by the present words in the phrase " is very vague, and could mean a lot of things. Based on the next sentence it kind of sounds like you think scanning for a predefined list of keywords is the only way to go. In that case, no, see the paragraph above.
Once you have both parts working, you can combine them and count the number of accidents and their severity per street. Using some geocoding library you could even generalize to neighborhoods or cities. Another challenge is the detection of synonyms ("Smith Str" vs "John Smith Street") and homonyms ("Smith Street" in London vs "Smith Street" in Leeds).
Related
There are some event description texts.
I want to extract the entrance fee of the events.
Sometimes the entrance fee is conditional.
What I want to achieve is to extract the entrance fee and it's conditions(if available). It's fine to retrieve the whole phrase or sentence which tells the entrance fee + it's conditions.
Note I: The texts are in German language.
Note II: Often the sentences are not complete, as they are mainly event flyers or advertisements.
What would be the category of this problem in NLP? Is it Named Entity Recognition and could be solved by training an own model with Apache openNLP?
Or I thought maybe easier would be to detect the pattern via the usual keywords in the use-case(entrance, $, but, only till, [number]am/pm, ...).
Please shed some light on me.
Input Examples:
- "If you enter the club before 10pm, the entrance is for free. Afterwards it is 6$."
- "Join our party tonight at 11pm till 5am. The entrance fee is 8$. But for girls and students it's half price."
This is broadly a structure learning problem. You might have to combine Named-Entity-Recognition/Tagging with Coreference Resolution. Read some papers on these as well as related github code and take it from there. Here is good discussion of state of the art tools for these at the moment https://www.reddit.com/r/MachineLearning/comments/3dz3fl/dl_architectures_for_entity_recognition_and_other/
Hope that helps.
You might try Stanford's CoreNLP for the named entity extraction part. It should be able to help you pick out the money values, and there is also a link to models trained for German language as well (https://nlp.stanford.edu/software/CRF-NER.shtml).
Given that it's fine to extract the entire sentence that contains the information, I'd suggest taking a binary sentence classification approach. You could probably get quite far just by using ngrams and some named entity information as features. That would mean that you'd need you'd want to build a pipeline that would automatically segment your documents into sentence-like chunks. You could try a sentence segmentation tool (also provided by Stanford CoreNLP) as a first go https://stanfordnlp.github.io/CoreNLP/. Since this would form the basis for all further work, you'd want to ensure that the results are at least decent. Perhaps the structure of the document itself gives you enough information to segment it without even using a sentence segmentation tool.
After you have this pipeline in place, you'd want to annotate the sentences extracted from a large set of documents as relevant or non-relevant to make it a binary classification task. Then train a model based on that dataset. Finally, when you apply it to unseen data, first use the sentence segmentation approach, and then classify each sentence.
I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.
This is a case of me wanting to search for something online but not knowing what it's called.
I have a collection of job descriptions in text files, some only a sentence or two long, most a paragraph or two. I want to write a script that, given a set of rules, will notify me when it finds a job description I would want.
For example, lets say I am looking for a job in PHP programming, but not a full-time position and not a designing position. So my "rule book" could be:
want: PHP
want: web programming
want: telecommuting
do not want: designing
do not want: full-time position
What is a method I could use to sort these files into a "pass" (descriptions that match what I'm looking for) and a "fail" (descriptions are not relevant)? Some ideas I was considering:
Count the occurrences of the phrases in the text file that are also in my "rule book", and reject those that contain words that I do not want. This doesn't always work, though, because what if a description says "web designing not required"? Then my algorithm would say "That contains the word designing so it is not relevant" when it really was!
When searching the text for phrases that I do and do not want, count phrases within a certain Levenshtein distance as the same phrase. For example, designing and design should be treated the same way, as well as misspellings of words, such as programing.
I have a large collection of descriptions that I have looked through manually. Is there a way I could "teach" the program "these are examples of good descriptions, these are examples of bad ones"?
Does anyone know what this "filtering process" is called, and/or have any advice or methods on how I can accomplish this?
You basically have a text classification or document classification problem. This is a specific case of binary classification, which is itself a specific case of supervised learning. It's well studied problem, there are many tools to do it. Basically you give a set of good documents and bad documents to a learning or training process, which finds words that correlate strongly with positive and negative documents and it outputs a function capable of classifying unseen documents as positive or not. Naive Bayes is the simplest learning algorithm for this kind of task, and it will do a decent job. There are fancier algorithms like Logistic Regression and Support Vector Machines which will probably do a somewhat better, but they are more complicated.
To determine which variants words are actually equivalent to each other, you want to do some kind of stemming. The Porter stemmer is a common choice here.
I have a list of several dozen product attributes that people are concerned with, like
Financing
Manufacturing quality
Durability
Sales experience
and several million free-text statements from customers about the product, e.g.
"The financing was easy but the housing is flimsy."
I would like to score each free text statement in terms of how strongly it relates to each of the attributes, and whether that is a positive or negative association.
In the given example, there would be a strong positive association to Financing and a strong negative association to Manufacturing quality.
It feels like this type of problem is probably the realm of Natural Language Programming (NLP). However, I spent several hours reading up on things like OpenNLP and NLTK and find there's so much domain specific terminology that I cannot figure out where to focus to solve this specific problem.
So my three-part question:
Is NLP the correct route to solve this class of problem?
What aspect of NLP should I focus on learning for this specific problem?
Are there alternatives I have not considered?
A resource you might find handy is SentiWordNet. (http://sentiwordnet.isti.cnr.it/) Which is like a dictionary that has a sentiment grade for words. It will tell you to what degree it thinks a word is positive, negative, or objective.
You can then combine that with some nltk code that looks through your sentences for the words you want to associate the sentiment with. So you would write a script to get some level of meaningful chunks of text that surround the words you were looking at, maybe sentence or clause level. Then you can have another thing that runs through the surrounding words and grab all the sentiment scores from the SentiWordNet.
I have some old code that did this and can place on github if you'd like, but you'd still need to make your own request for SentiWordNet.
I guess your problem is more on association rather than just classification. Now moving forward with this assumption:
Is NLP the correct route to solve this class of problem?
Yes.
What aspect of NLP should I focus on learning for this specific problem?
Part of speech tagging
Sentiment analysis
Maximum entrophy
Are there alternatives I have not considered?
In depth study of automata theory with respect to NLP will help you a lot, it helped me a lot in grasping the implementations like OpenNLP.
Yes, this is a NLP problem by the name of Sentiment analysis. Sentiment analysis is an active research area with different approaches and a task where a lot of other NLP-methods have to work together, so it is certainly not the easiest field to get started with in NLP.
A more or less recent survey of the academic research in the field can be found in Pang & Lee (2008).
I need to do an experiment and I am new in NLP. I have read books that explain the theoritical issues but when it comes to practical I found it hard to find a guide. so please who knows anything in NLP especially the practical issues tell me and point me to the right path because I feel I am lost (useful books, useful tools and useful websites)
what I am trying to do is to take a text and find specific words for example animals such as dogs, cats,...etc in it then I need to extract this word and 2 words on each side.
For example
I was watching TV with my lovely cat last night.
the extracted text will be
(my lovely cat last night)
This will be my training example to the machine tool
Q1: there will be around 100 training examples similar to what I explained above. I used tocknizer to extracts words but how can I extract specific words(for our example all types of animals) with 2 words on each side. do I need to use tags for example or what is your idea?
Q2: If I have these training examples how can I prepare appropriate datasets that I can give it to the machine tool to train it? what should I write in this dataset to specify the animal and should I need to give other features? and how can I arrange it in a dataset .
many words from you might help me a lot please do not hesitate to tell what you know
What you are attempting to do is sometimes known as "Ontology Acquisition" or "Automated Ontology", and is a pretty difficult problem. Most approaches come down to "Words that are similar will tend to be used in similar contexts." The problem with this is that while there are algorithms that successfully extract semantically meaningful relationships from data such as yours, going from "Here are a bunch of terms that statistically share a common distribution with your seed terms" to "your seed terms are animal names, here are some other animal names" is challenging. For example, training on cat,dog, snake, bird, might end up giving you results like "mammal, dachshund, creature, biped" are used in similar contexts, but depending on your requirements, may not be exactly what you need.
Below is a link to a research paper that implemented exactly what you are trying to do. They describe their approach to data representation and algorithms used, and perform with at least some level of success on the animal name problem. In addition, tracking down their references may be a fruitful exercise..
http://www.cl.cam.ac.uk/~ah433/cluk.pdf
Let me begin by saying that being a self-taught engineer when I started working in NLP several years ago, I completely understand your frustration. I would suggest that you read the NLTK book which is a wonderful introduction to applied NLP. In particular, read Chapters 3-7 which deal with processing raw text data to extract information and use it for tagging. The book is available online.
With regards to your specific question:
I think that it might be much easier to create a small list of animals and then extract sentences from a corpus that contain these animal names. Wikipedia sentences is one obvious example. You can build your corpus using this method because you already know the names of the animals in each sentence.
// PSEUDO CODE
Dictionary animals = ["dog","dogs,"cat","cats","pig","pigs","cow","cows","lion","lions","lioness","lionesses"];
String[] sentences = getWikipediaSentences();
for(sent: sentences){
for(token: Tokenizer.getTokens(sent)){
if(animals.contains(token){
addSentenceToCorpus(sent)
} // else ignore sentence
}
}
You can then train your algorithm on these sentences so that you can use the trained model to extract newer animal names. There are caveats with this approach since your "training data" is artificially collected but it will be a good first experience nonetheless.