I am looking for an algorithm or method that would help identify general phrases from a corpus of text that has a particular dialect (it is from a specific domain but for my case is a dialect of the English language) -- for example the following fragment could be from a larger corpus related to the World or Warcraft or perhaps MMORPHs.
players control a character avatar within a game world in third person or first person view, exploring the landscape, fighting various monsters, completing quests, and interacting with non-player characters (NPCs) or other players. Also similar to other MMORPGs, World of Warcraft requires the player to pay for a subscription, either by buying prepaid game cards for a selected amount of playing time, or by using a credit or debit card to pay on a regular basis
As output from the above I would like to identify the following general phrases:
first person
World of Warcraft
prepaid game cards
debit card
Notes:
There is a previous questions similar to mine here and here but for clarification mine has the following differences:
a. I am trying to use an existing toolkit such as NLTK, OpenNLP, etc.
b. I am not interested in identifying other Parts of Speech in the sentence
c. I can use human intervention where the algorithm presents the identified noun phrases to a human expert and the human expert can then confirm or reject the findings however we do not have resources for training a model of language on hand-annotated data
Nltk has built in part of speech tagging that has proven pretty good at identifying unknown words. That said, you seem to misunderstand what a noun is and you should probably solidify your understanding of both parts of speech, and your question.
For instance, in first person first is an adjective. You could automatically assume that associated adjectives are a part of that phrase.
Alternately, if you're looking to identify general phrases my suggestion would be to implement a simple Markov Chain model and then look for especially high transition probabilities.
If you're looking for a Markov Chain implementation in Python I would point you towards this gist that I wrote up back in the day: https://gist.github.com/Slater-Victoroff/6227656
If you want to get much more advanced than that, you're going to quickly descend into dissertation territory. I hope that helps.
P.S. Nltk includes a huge number of pre-annotated corpuses that might work for your purposes.
It appears you are trying to do noun phrase extraction. The TextBlob Python library includes two noun phrase extraction implementations out of the box.
The simplest way to get started is to use the default FastNPExtractor which is based of Shlomi Babluki's algorithm described here.
from text.blob import TextBlob
text = '''
players control a character avatar within a game world in third person or first
person view, exploring the landscape, fighting various monsters, completing quests,
and interacting with non-player characters (NPCs) or other players. Also similar
to other MMORPGs, World of Warcraft requires the player to pay for a
subscription, either by buying prepaid game cards for a selected amount of
playing time, or by using a credit or debit card to pay on a regular basis
'''
blob = TextBlob(text)
print(blob.noun_phrases) # ['players control', 'character avatar' ...]
Swapping out for the other implementation (an NLTK-based chunker) is quite easy.
from text.np_extractors import ConllExtractor
blob = TextBlob(text, np_extractor=ConllExtractor())
print(blob.noun_phrases) # ['character avatar', 'game world' ...]
If neither of these suffice, you can create your own noun phrase extractor class. I recommend looking at the TextBlob np_extractor module source for examples. To gain a better understanding of noun phrase chunking, check out the NLTK book, Chapter 7.
Related
There are some event description texts.
I want to extract the entrance fee of the events.
Sometimes the entrance fee is conditional.
What I want to achieve is to extract the entrance fee and it's conditions(if available). It's fine to retrieve the whole phrase or sentence which tells the entrance fee + it's conditions.
Note I: The texts are in German language.
Note II: Often the sentences are not complete, as they are mainly event flyers or advertisements.
What would be the category of this problem in NLP? Is it Named Entity Recognition and could be solved by training an own model with Apache openNLP?
Or I thought maybe easier would be to detect the pattern via the usual keywords in the use-case(entrance, $, but, only till, [number]am/pm, ...).
Please shed some light on me.
Input Examples:
- "If you enter the club before 10pm, the entrance is for free. Afterwards it is 6$."
- "Join our party tonight at 11pm till 5am. The entrance fee is 8$. But for girls and students it's half price."
This is broadly a structure learning problem. You might have to combine Named-Entity-Recognition/Tagging with Coreference Resolution. Read some papers on these as well as related github code and take it from there. Here is good discussion of state of the art tools for these at the moment https://www.reddit.com/r/MachineLearning/comments/3dz3fl/dl_architectures_for_entity_recognition_and_other/
Hope that helps.
You might try Stanford's CoreNLP for the named entity extraction part. It should be able to help you pick out the money values, and there is also a link to models trained for German language as well (https://nlp.stanford.edu/software/CRF-NER.shtml).
Given that it's fine to extract the entire sentence that contains the information, I'd suggest taking a binary sentence classification approach. You could probably get quite far just by using ngrams and some named entity information as features. That would mean that you'd need you'd want to build a pipeline that would automatically segment your documents into sentence-like chunks. You could try a sentence segmentation tool (also provided by Stanford CoreNLP) as a first go https://stanfordnlp.github.io/CoreNLP/. Since this would form the basis for all further work, you'd want to ensure that the results are at least decent. Perhaps the structure of the document itself gives you enough information to segment it without even using a sentence segmentation tool.
After you have this pipeline in place, you'd want to annotate the sentences extracted from a large set of documents as relevant or non-relevant to make it a binary classification task. Then train a model based on that dataset. Finally, when you apply it to unseen data, first use the sentence segmentation approach, and then classify each sentence.
I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.
I created classifier to classy the class of nouns,adjectives, Named entities in given sentence. I have used large Wikipedia dataset for classification.
Like :
Where Abraham Lincoln was born?
So classifier will give this short of result - word - class
Where - question
Abraham Lincoln - Person, Movie, Book (because classifier find Abraham Lincoln in all there categories)
born - time
When Titanic was released?
when - question
Titanic - Song, movie, Vehicle, Game (Titanic
classified in all these categories)
Is there any way to identify exact context for word?
Please see :
Word sense disambiguation would not help here. Because there might not be near by word in sentence which can help
Lesk algorithm with wordnet or sysnet also does not help. Because it for suppose word Bank lesk algo will behave like this
======== TESTING simple_lesk ===========
TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
Here for word bank it suggested as financial institute and slopping land. While in my case I am already getting such prediction like Titanic then it can be movie or game.
I want to know is there any other approach apart from Lesk algo, baseline algo, traditional word sense disambiguation which can help me to identify which class is correct for particular keyword?
Titanic -
Thanks for using the pywsd examples. With regards to wsd, there are many other variants and i'm coding them by myself during my free time. So if you want to see it improve do join me in coding the open source tool =)
Meanwhile, you will find the following technologies more relevant to your task, such as:
Knowledge base population (http://www.nist.gov/tac/2014/KBP/) where tokens/segments of text are assigned an entity and the task is to link them or to solve a simplified question and answer task.
Knowledge representation (http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html)
Knowledge extraction (https://en.wikipedia.org/wiki/Knowledge_extraction)
The above technologies usually includes several sub-tasks such as:
Wikification (http://nlp.cs.rpi.edu/kbp/2014/elreading.html)
Entity linking
Slot filling (http://surdeanu.info/kbp2014/def.php)
Essentially you're asking for a tool that is an NP-complete AI system for language/text processing, so I don't really think such a tool exists as of yet. Maybe it's IBM Watson.
if you're looking for the field to look into, the field is out there but if you're looking at tools, most probably wikification tools are closest to what you might need. (http://nlp.cs.rpi.edu/paper/WikificationProposal.pdf)
Given "violence" as input would it be possible to come up with how violence construed by a person (e.g. physical violence, a book, an album, a musical group ..) as mentioned below in Ref #1.
Assuming if the user meant an Album, what would be the best way to look for violence as an album from a set of tweets.
Is there a way to infer this via any of the NLP API(s) say OpenNLP.
Ref #1
violence/N1 - intentional harmful physical action.
violence/N2 - the property of being wild or turbulent.
Violence/N6 - a book from Neil L. Whitehead; nonfiction
Violence/N7 - an album by The Last Resort
Violence/N8 - Violence is the third album by the Washington-based Alternative metal music group Nothingface.
Violence/N9 - a musical group which produced the albums Eternal Nightmare and Nothing to Gain
Violence/N10 - a song by Aesthetic Perfection, Angel Witch, Arsenic, Beth Torbert, Brigada Flores Magon, etc on the albums A Natural Disaster, Adult Themes for Voice, I Bificus, Retribution, S.D.E., etc
Violence/N11 - an album by Bombardier, Dark Quarterer and Invisible Limits
Violence/N12 - a song by CharlElie Couture, EsprieM, Fraebbblarnir, Ian Hunter, Implant, etc on the albums All the Young Dudes, Broke, No Regrets, Power of Limits, Repercussions, etc
Violence/N18 - Violence: The Roleplaying Game of Egregious and Repulsive Bloodshed is a short, 32-page roleplaying game written by Greg Costikyan under the pseudonym "Designer X" and published by Hogshead Publishing as part of its New Style line of games.
Violence/N42 - Violence (1947) is an American drama film noir directed by Jack Bernhard.
Pure automatic inference is a little to hard in general for this problem.
Instead we might use :
Resources like WordNet, or a semantics dictionary.
For languages other than English you can look at eurowordnet (non free) dataset.
To get more meaning (i.e. for the album sense) we process some well managed resource like Wikipedia. Wikipedia as a lot of meta information that would be very useful for this kind of processing.
The reliability of the process is achieve just by combining the maximum number of data source and processing them correctly, with specialized programs.
As a last resort you may try hand processing/annotating. Long and costly, but useful in enterprise context where you need only a small part of a language.
No free lunch here.
If you're working on English NLP in python, then you can try the wordnet API as such:
from nltk.corpus import wordnet as wn
query = 'violence'
for ss in wn.synsets(query):
print query, str(ss.offset).zfill(8)+'-'+ss.pos, ss.definition
If you're working on other human languages, maybe you can take a look at the open wordnets available from http://casta-net.jp/~kuribayashi/multi/
NOTE: the reason for str(ss.offset).zfill(8)+'-'+ss.pos, it's because it is used as the unique id for each sense of a specific word. And this id is consistent across the open wordnets for every language. the first 8 digits gives the id and the character after the dash is the Part-of-Speech of the sense.
Check this out: Twitter Filtering Demo from Idilia. It does exactly what you want by first analyzing a piece of text to discover the meaning of its words and then filtering the texts that contain the sense that you are looking for. It's available as an API.
Disclaimer: I work for Idilia.
You can extract all contexts "violence" occurs in (context can be a whole document, or a window of say 50 words), then convert them into features (using say bag of words), then cluster these features. As clustering is unsupervised, you won't have names for the clusters, but you can label them with some typical context.
Then you need to see which cluster "violence" in the query belongs to. Either based on other words in the query which act as a context or by asking explicitly (Do you mean violence as in "...." or as in "....")
This will be incredibly difficult due to the fact that the proper noun uses of the word 'Violence' will be incredibly infrequent as a proportion of all words and their frequency distribution is likely highly skewed in some way. We run into these problems almost any time we want to do some form of Named Entity Disambiguation.
No tool I'm aware of will do this for you, so you will be building your own classifier. Using Wikipedia as a training resource as Mr K suggested is probably your best bet.
I'm trying to make an analysis of a set of phrases, and I don't know exactly how "natural language processing" can help me, or if someone can share his knowledge with me.
The objective is to extract streets and localizations. Often this kind of information is not presented to the reader in a structured way, and It's hard to find a way of parsing it. I have two main objectives.
First the extraction of the streets itself. As far as I know NLP libraries can help me to tokenize a phrase and perform an analysis which will get nouns (for example). But where a street begins and where does it ends?. I assume that I will need to compare that analysis with a streets database, but I don't know wich is the optimal method.
Also, I would like to deduct the level of severity , for example, in car accidents. I'm assuming that the only way is to stablish some heuristic by the present words in the phrase (for example, if deceased word appears + 100). Am I correct?
Thanks a lot as always! :)
The first part of what you want to do ("First the extraction of the streets itself. [...] But where a street begins and where does it end?") is a subfield of NLP called Named Entity Recognition. There are many libraries available which can do this. I like NLTK for Python myself. Depending on your choice I assume that a streetname database would be useful for training the recognizer, but you might be able to get reasonable results with the default corpus. Read the documentation for your NLP library for that.
The second part, recognizing accident severity, can be treated as an independent problem at first. You could take the raw words or their part of speech tags as features, and train a classifier on it (SVM, HMM, KNN, your choice). You would need a fairly large, correctly labelled training set for that; from your description I'm not certain you have that?
"I'm assuming that the only way is to stablish some heuristic by the present words in the phrase " is very vague, and could mean a lot of things. Based on the next sentence it kind of sounds like you think scanning for a predefined list of keywords is the only way to go. In that case, no, see the paragraph above.
Once you have both parts working, you can combine them and count the number of accidents and their severity per street. Using some geocoding library you could even generalize to neighborhoods or cities. Another challenge is the detection of synonyms ("Smith Str" vs "John Smith Street") and homonyms ("Smith Street" in London vs "Smith Street" in Leeds).