How to extract artist's name from plain text? - text

I'm new to NLP.
I want to extract music artist's name from plain text like that is posted on social media.
The text looks like this. (this is just sample, not real)
Today bandcamp is waiving fees again! CHANGE, TAYLOR SWIFT and POP
SMOKE will be using all funds collected through bandcamp to donate to
Anti Repression Committee. No Justice No Peace.
This time,I want to extract string "CHANGE","TAYLOR SWIFT","POP SMOKE".
I already tried NLTK and spaCy but it didn't work as desired.
Is there any other idea how I can achieve this?
Thanks in advance.

If you have a lot of upper case data like in your example, you might want to pass the data through a truecaser first. There’s one available in the Stanford NLP package. After that, spacy might have a better shot at picking the names out. On this text:
Today bandcamp is waiving fees again! Change, Taylor Swift, and Pop Smoke will be using all funds collected through bandcamp to donate to Anti Repression Committee. No Justice No Peace.
en_core_web_sm will pick out Taylor Swift and Pop Smoke as entities. Change / CHANGE is going to be tough for any model to pick out.

Related

Improving Identified OCR text accuracy?

Lets say I have a text which is returned from an OCR saying "Hllo Wrld". I need to convert this output in to a user friendly one, saying "Hello World". How can I get this job done?
I am a final year Software Engineering Student from Sri Lanka. I am going to implement an OCR to convert Braille text to Sinhala text for my Final Year Project. Sinhala is a low resource language which is used in Sri Lanka. I went through previous research papers and found that researchers had developed a system that could translate braille into Sinhala text. But that final text is not user friendly to the end user(As mentioned in the above example). What I am going to do is to convert that identified text in to a meaningful one. I think I should go with NLP technologies to get this job done. I'm truly thankful If you could guide me in the right direction.
Thank You

NLP Entity Recognition Inquiry

I am working on an NLP Chatbot project. The Chatbot will need to process requests like the following:
"I want to go to Penn Station from Back Bay Station" and "I want to go from Back Bay Station to Penn Station"
In each case, I want to extract the source train station as "Back Bay Station" and the destination as "Penn Station." However, because of the sentence re-ordering, I am not sure how to do this.
Any advice, including examples, would be much appreciated.
Two ways.
Heuristics: Look for words like 'to' and 'from' and similar before the entities. You might have to spend some time creating a library of these prepositions or subordinating conjunctions but that will do the job.
Use more sophisticated deep parsers that can do this job for you. You might have to still fall back to heuristics here as well, but you can get much more information this way. I am suggesting this option because I don't know how wide your problem statement is. If it is just about 'to' and 'from' then stick to option 1

Identifying the context of word in sentence

I created classifier to classy the class of nouns,adjectives, Named entities in given sentence. I have used large Wikipedia dataset for classification.
Like :
Where Abraham Lincoln was born?
So classifier will give this short of result - word - class
Where - question
Abraham Lincoln - Person, Movie, Book (because classifier find Abraham Lincoln in all there categories)
born - time
When Titanic was released?
when - question
Titanic - Song, movie, Vehicle, Game (Titanic
classified in all these categories)
Is there any way to identify exact context for word?
Please see :
Word sense disambiguation would not help here. Because there might not be near by word in sentence which can help
Lesk algorithm with wordnet or sysnet also does not help. Because it for suppose word Bank lesk algo will behave like this
======== TESTING simple_lesk ===========
TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
Here for word bank it suggested as financial institute and slopping land. While in my case I am already getting such prediction like Titanic then it can be movie or game.
I want to know is there any other approach apart from Lesk algo, baseline algo, traditional word sense disambiguation which can help me to identify which class is correct for particular keyword?
Titanic -
Thanks for using the pywsd examples. With regards to wsd, there are many other variants and i'm coding them by myself during my free time. So if you want to see it improve do join me in coding the open source tool =)
Meanwhile, you will find the following technologies more relevant to your task, such as:
Knowledge base population (http://www.nist.gov/tac/2014/KBP/) where tokens/segments of text are assigned an entity and the task is to link them or to solve a simplified question and answer task.
Knowledge representation (http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html)
Knowledge extraction (https://en.wikipedia.org/wiki/Knowledge_extraction)
The above technologies usually includes several sub-tasks such as:
Wikification (http://nlp.cs.rpi.edu/kbp/2014/elreading.html)
Entity linking
Slot filling (http://surdeanu.info/kbp2014/def.php)
Essentially you're asking for a tool that is an NP-complete AI system for language/text processing, so I don't really think such a tool exists as of yet. Maybe it's IBM Watson.
if you're looking for the field to look into, the field is out there but if you're looking at tools, most probably wikification tools are closest to what you might need. (http://nlp.cs.rpi.edu/paper/WikificationProposal.pdf)

identifying general phrases in a particular dialect

I am looking for an algorithm or method that would help identify general phrases from a corpus of text that has a particular dialect (it is from a specific domain but for my case is a dialect of the English language) -- for example the following fragment could be from a larger corpus related to the World or Warcraft or perhaps MMORPHs.
players control a character avatar within a game world in third person or first person view, exploring the landscape, fighting various monsters, completing quests, and interacting with non-player characters (NPCs) or other players. Also similar to other MMORPGs, World of Warcraft requires the player to pay for a subscription, either by buying prepaid game cards for a selected amount of playing time, or by using a credit or debit card to pay on a regular basis
As output from the above I would like to identify the following general phrases:
first person
World of Warcraft
prepaid game cards
debit card
Notes:
There is a previous questions similar to mine here and here but for clarification mine has the following differences:
a. I am trying to use an existing toolkit such as NLTK, OpenNLP, etc.
b. I am not interested in identifying other Parts of Speech in the sentence
c. I can use human intervention where the algorithm presents the identified noun phrases to a human expert and the human expert can then confirm or reject the findings however we do not have resources for training a model of language on hand-annotated data
Nltk has built in part of speech tagging that has proven pretty good at identifying unknown words. That said, you seem to misunderstand what a noun is and you should probably solidify your understanding of both parts of speech, and your question.
For instance, in first person first is an adjective. You could automatically assume that associated adjectives are a part of that phrase.
Alternately, if you're looking to identify general phrases my suggestion would be to implement a simple Markov Chain model and then look for especially high transition probabilities.
If you're looking for a Markov Chain implementation in Python I would point you towards this gist that I wrote up back in the day: https://gist.github.com/Slater-Victoroff/6227656
If you want to get much more advanced than that, you're going to quickly descend into dissertation territory. I hope that helps.
P.S. Nltk includes a huge number of pre-annotated corpuses that might work for your purposes.
It appears you are trying to do noun phrase extraction. The TextBlob Python library includes two noun phrase extraction implementations out of the box.
The simplest way to get started is to use the default FastNPExtractor which is based of Shlomi Babluki's algorithm described here.
from text.blob import TextBlob
text = '''
players control a character avatar within a game world in third person or first
person view, exploring the landscape, fighting various monsters, completing quests,
and interacting with non-player characters (NPCs) or other players. Also similar
to other MMORPGs, World of Warcraft requires the player to pay for a
subscription, either by buying prepaid game cards for a selected amount of
playing time, or by using a credit or debit card to pay on a regular basis
'''
blob = TextBlob(text)
print(blob.noun_phrases) # ['players control', 'character avatar' ...]
Swapping out for the other implementation (an NLTK-based chunker) is quite easy.
from text.np_extractors import ConllExtractor
blob = TextBlob(text, np_extractor=ConllExtractor())
print(blob.noun_phrases) # ['character avatar', 'game world' ...]
If neither of these suffice, you can create your own noun phrase extractor class. I recommend looking at the TextBlob np_extractor module source for examples. To gain a better understanding of noun phrase chunking, check out the NLTK book, Chapter 7.

Streets recognition, deduction of severity

I'm trying to make an analysis of a set of phrases, and I don't know exactly how "natural language processing" can help me, or if someone can share his knowledge with me.
The objective is to extract streets and localizations. Often this kind of information is not presented to the reader in a structured way, and It's hard to find a way of parsing it. I have two main objectives.
First the extraction of the streets itself. As far as I know NLP libraries can help me to tokenize a phrase and perform an analysis which will get nouns (for example). But where a street begins and where does it ends?. I assume that I will need to compare that analysis with a streets database, but I don't know wich is the optimal method.
Also, I would like to deduct the level of severity , for example, in car accidents. I'm assuming that the only way is to stablish some heuristic by the present words in the phrase (for example, if deceased word appears + 100). Am I correct?
Thanks a lot as always! :)
The first part of what you want to do ("First the extraction of the streets itself. [...] But where a street begins and where does it end?") is a subfield of NLP called Named Entity Recognition. There are many libraries available which can do this. I like NLTK for Python myself. Depending on your choice I assume that a streetname database would be useful for training the recognizer, but you might be able to get reasonable results with the default corpus. Read the documentation for your NLP library for that.
The second part, recognizing accident severity, can be treated as an independent problem at first. You could take the raw words or their part of speech tags as features, and train a classifier on it (SVM, HMM, KNN, your choice). You would need a fairly large, correctly labelled training set for that; from your description I'm not certain you have that?
"I'm assuming that the only way is to stablish some heuristic by the present words in the phrase " is very vague, and could mean a lot of things. Based on the next sentence it kind of sounds like you think scanning for a predefined list of keywords is the only way to go. In that case, no, see the paragraph above.
Once you have both parts working, you can combine them and count the number of accidents and their severity per street. Using some geocoding library you could even generalize to neighborhoods or cities. Another challenge is the detection of synonyms ("Smith Str" vs "John Smith Street") and homonyms ("Smith Street" in London vs "Smith Street" in Leeds).

Resources