How to find if a word in a sentence is pointing to a city - nlp

How to find if a word in a sentence is pointing to a city
I live in San Francisco
I work in San Jose
I was born in New York
Is there a way to find that "San Francisco" is a city in the above sentence.

The task of recognising possibly multi-word expressions that reference individuals of various specific types (locations, but also organisations, dates, etc.) is called named-entity recognition (NER).
For a simple task such as yours, existing freely available tools and models are sufficient. You could try the Stanford Named Entity Recognizer, which is free software. Try analysing your sentence using their online demo.

Related

Is there a way to have a reference term in addition to a label with Doccano?

Hi I would like to know if we can have something like the following example on Doccano:
So let's say that we have a sentence like this : "MS is an IT company". I want to label some words in this sentence, for example MS (Microsoft). MS should be labelled as a Company (so imagine that I have an entity named Company) but I also want to say that MS stands for Microsoft.
Is there a way to do that with Doccano?
Thanks
Doccano supports
Sequence Labelling good for Named Entity Recognition (NER)
Text Classification good e.g. for Sentiment Analysis
Sequence To Sequence good for Machine Translation
What you're describing sounds a little like Entity Linking.
You can see from Doccano's roadmap in its docs that Entity Linking is part of the plans, but not yet available.
For now, I suggest to frame this as a NER problem, and to have different entities for MS (Microsoft) and MS (other). If you have too many entities to choose from, the labelling could become complicated, but then you could break up the dataset in smaller entity-focussed datasets. For example, you could get only documents with MS in them and label the mentions as one of the few synonyms.

How do semantic text comparison APIs work

I am currently doing a project where we are trying to gauge explanatory answers submitted by users against a correct answer. I have come across APIs like dandelion and paralleldots, both of which are capable of checking how close 2 texts are to each other semantically.
These APIs are giving me favorable responses for questions like:
What is the distinction between debtor and creditor?
Answer1: A debtor is a person or enterprise that owes money to another
party. A creditor is a person, bank, or other enterprise that has
lent money or extended credit to another party.
Answer2: A debtor has a debt or legal obligation to pay an amount to
another person or entity, from whom goods were purchased or services
were obtained. A creditor may be a bank, supplier
Dandelion gave me a score of 81% and paralleldots gave me 4.8/5 for the same answer. This is quite expected.
However, before I prepare a demo and plan to eventually use them in production, I am interested in understanding to some extent how these APIs are generating these scores.
Is it a tf-idf based vector product of the stemmed POSses??
PS: Not an expert in NLP
This question is very broad: semantic sentence similarity is an open issue in NLP and there are a variety of ways of performing this task, all of them being far from perfect at the current stage. As an example, just consider that:
Trump is the president of the United States
and
Trump has never been the president of the United States
have a semantic similarity of 5 according to paralleldots. Now, according to your definition of similarity this may be ok or not, but the point is that according to what you have to do with this similarity it may not be fully suitable if you have specific requirements.
Anyway, as for the implementation, there's no single "standard" way of performing this and there's a pletora of features that can be used: tf-idf (or equivalent), syntactic structure of the sentence (i.e. constituency or dependency parse tree), mention of entities extracted from the text, etc... or, following the latest trends, a deep neural network which doesn't need any explicit feature.

How extracting meaning of sentences for sentiment analysis using NLP

"I had safe journey" ,assume this is a feedback for a driver ,provided by a passenger. I need to extract theses information from this sentence..
"I had safe journey" ->
SUBJECT= "driving"
SENTIMENT= "positive"
I tried with NLP Extracting Information from Text method. But I don't know how recognized Entities from these kind of sentences.How am I supposed to do that ?
To categorize entities of a sentence or a sentence as a whole, you first need to have defined set of classes/categories/groups.
for eg: To categorize journey to travelling/driving, you should train your system/algorithm to identify specific pattern of sentences which will fall under the category of driving/journey.
This training involves concepts of machine learning, Text Categorization is what you should be searching for.
Here is a reference (to just give you an idea) and you can find many more over the web.
Good Luck!
Note: Below are some links from Coursera which offers a course on NLP
Link 1
Link 2

Identifying the context of word in sentence

I created classifier to classy the class of nouns,adjectives, Named entities in given sentence. I have used large Wikipedia dataset for classification.
Like :
Where Abraham Lincoln was born?
So classifier will give this short of result - word - class
Where - question
Abraham Lincoln - Person, Movie, Book (because classifier find Abraham Lincoln in all there categories)
born - time
When Titanic was released?
when - question
Titanic - Song, movie, Vehicle, Game (Titanic
classified in all these categories)
Is there any way to identify exact context for word?
Please see :
Word sense disambiguation would not help here. Because there might not be near by word in sentence which can help
Lesk algorithm with wordnet or sysnet also does not help. Because it for suppose word Bank lesk algo will behave like this
======== TESTING simple_lesk ===========
TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
Here for word bank it suggested as financial institute and slopping land. While in my case I am already getting such prediction like Titanic then it can be movie or game.
I want to know is there any other approach apart from Lesk algo, baseline algo, traditional word sense disambiguation which can help me to identify which class is correct for particular keyword?
Titanic -
Thanks for using the pywsd examples. With regards to wsd, there are many other variants and i'm coding them by myself during my free time. So if you want to see it improve do join me in coding the open source tool =)
Meanwhile, you will find the following technologies more relevant to your task, such as:
Knowledge base population (http://www.nist.gov/tac/2014/KBP/) where tokens/segments of text are assigned an entity and the task is to link them or to solve a simplified question and answer task.
Knowledge representation (http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html)
Knowledge extraction (https://en.wikipedia.org/wiki/Knowledge_extraction)
The above technologies usually includes several sub-tasks such as:
Wikification (http://nlp.cs.rpi.edu/kbp/2014/elreading.html)
Entity linking
Slot filling (http://surdeanu.info/kbp2014/def.php)
Essentially you're asking for a tool that is an NP-complete AI system for language/text processing, so I don't really think such a tool exists as of yet. Maybe it's IBM Watson.
if you're looking for the field to look into, the field is out there but if you're looking at tools, most probably wikification tools are closest to what you might need. (http://nlp.cs.rpi.edu/paper/WikificationProposal.pdf)

What features do NLP practitioners use to pick out English names?

I am trying named entity recognition for the first time. I'm looking for features that will pick out English names. I am using the methods outlined in the coursera nlp course (week three) and the nltk book. In other words: I am defining features, identifying features of words and then running those words/features through a classifier that I train on labeled data.
What features are used to pick out English names?
I can imagine that you'd look for two capital words in a row, or a capital word and then an initial and then a capital word. (ex. John Smith or James P. Smith).
But what other features are used for NER?
Some common features:
Word lists for common names (John, Adam, etc)
casing
contains symbol or numeric characters (names generally don't)
person prefixes (Mr., Mrs., etc...)
person postfixes (Jr., Sr., etc...)
single letter abbreviation (ie, (J.) Smith).
analysis of surrounding words (you may find some words have a high probability of appearing near names).
Named Entities previously recognized (often it is easy to identify NE in some parts of the corpus based on context, but very hard in other parts. If previously identified, this is an excellent hint towards NER)
Depending what language you are working with there may be more language specific features as well. Frankly you can turn up a wealth of information with a simple Google query, I'm really not sure why you haven't turned there. Some starting points however:
Google
A survey of named entity recognition and classification
Named entity recognition without gazetteers
I had done something similar back in school using machine learning. I suppose that you will use a supervised algorithm and you will classify every single word independently and not words in combination. In that case I would choose some features for the word itself like the ones you mentioned (if the word begins with a capital letter, if the word is an abbreviation) but I would add some more features like if the previous or the next words also start from a capital letter, or if they are abbreviations. This way you can add some context and overcome the problems related to your basic independence assumption.
If you want have a look here. In the machine learning section you can find some more information and examples (the problem is slightly different but the method should be similar).
Whatever features you choose it is important that you use some measure to evaluate their relevance and possibly reduce them to the useful ones to avoid over-fitting. One of the measures you can use to evaluate them is the gain ratio but there are many more. Here you can find some basic information about feature extraction.
Hope it helps!

Resources