I am working on a project using NLP and I am trying to do something like this :
Does anyone know what this kind of tasks is called ?
This is a semantic lookup, where you look at specific terms in the text and find superordinates for them:
Chlorpromazine is a medicine or a drug; a diagnosis is a type of procedure, and a patient is a living being). You could do this with an ontology or a thesaurus; some dictionaries also have information about superordinates, often in the definition ("a horse is a mammal with four legs").
This is not named entity recognition. In NER you identify something by its name, and return a specific identifier. So for Chlorpromazine you would not say that it was a medicine/drug, but you would get a serial number or reference for it which is distinct from any other drug.
Related
Hi I would like to know if we can have something like the following example on Doccano:
So let's say that we have a sentence like this : "MS is an IT company". I want to label some words in this sentence, for example MS (Microsoft). MS should be labelled as a Company (so imagine that I have an entity named Company) but I also want to say that MS stands for Microsoft.
Is there a way to do that with Doccano?
Thanks
Doccano supports
Sequence Labelling good for Named Entity Recognition (NER)
Text Classification good e.g. for Sentiment Analysis
Sequence To Sequence good for Machine Translation
What you're describing sounds a little like Entity Linking.
You can see from Doccano's roadmap in its docs that Entity Linking is part of the plans, but not yet available.
For now, I suggest to frame this as a NER problem, and to have different entities for MS (Microsoft) and MS (other). If you have too many entities to choose from, the labelling could become complicated, but then you could break up the dataset in smaller entity-focussed datasets. For example, you could get only documents with MS in them and label the mentions as one of the few synonyms.
I have a task which is half a matcher and half an entity extraction. I want to label some words that in some contexts refer to some label. Named entity extraction would be the way to go, but these words do not necessarily share structure (they can be verbs, nouns... etc). I could simply use a dictionary, but I would like to use context to label them. I am having trouble finding a solution to this problem. Can NER be used for this or is this a completely different task?
To give some examples, consider that I am interested in the following category "customer acceptance". Now, these can be possible sentences: "this is a fair amount of data!" and "this condition is not fair". I want my word extractor to find only the second 'fair'.
In other words, it is like a dictionary that takes context into account.
I have below 2 sentences.
Sentence1: A man is moving.
Sentence2: A truck is moving.
My problem statement is to assign a question word to the above two sentences. I need to assign WHO for the first sentence and WHAT for the second sentence.
What I have done is written an algorithm which assigns WHO, WHAT to both the sentences. The reason being, man and truck are both nouns. The pos is NOUN for both man and truck and also NN is the tag for both the words ( man and truck ) I have written the logic in a way, if relation is nsubj and is NOUN/NN, then return the question word WHO and WHAT.
Is there a way to get only WHO when the sentence is involving a human being and WHAT when it involves a dog/non-living things.
Is this something which can be accomplished using Spacy which I am missing very obviously.
There isn't a one liner or anything for this in spacy, to my knowledge. Linguistically, they are both nouns. What you need is more information about these nouns, what categories they are in. You want to assign WHO when the category is person, and WHAT when the category is object. The go-to source of categories like that is Wordnet. There is a spacy/wordnet integration library, repo. You'll want to traverse up the tree of hypernyms. The shared root of the tree for both words is "physical entity", but maybe if you end up at "animal" you assign WHO, otherwise you assign WHAT..?
You can explore the wordnet hierarchies here.
I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.
I'm trying to make an analysis of a set of phrases, and I don't know exactly how "natural language processing" can help me, or if someone can share his knowledge with me.
The objective is to extract streets and localizations. Often this kind of information is not presented to the reader in a structured way, and It's hard to find a way of parsing it. I have two main objectives.
First the extraction of the streets itself. As far as I know NLP libraries can help me to tokenize a phrase and perform an analysis which will get nouns (for example). But where a street begins and where does it ends?. I assume that I will need to compare that analysis with a streets database, but I don't know wich is the optimal method.
Also, I would like to deduct the level of severity , for example, in car accidents. I'm assuming that the only way is to stablish some heuristic by the present words in the phrase (for example, if deceased word appears + 100). Am I correct?
Thanks a lot as always! :)
The first part of what you want to do ("First the extraction of the streets itself. [...] But where a street begins and where does it end?") is a subfield of NLP called Named Entity Recognition. There are many libraries available which can do this. I like NLTK for Python myself. Depending on your choice I assume that a streetname database would be useful for training the recognizer, but you might be able to get reasonable results with the default corpus. Read the documentation for your NLP library for that.
The second part, recognizing accident severity, can be treated as an independent problem at first. You could take the raw words or their part of speech tags as features, and train a classifier on it (SVM, HMM, KNN, your choice). You would need a fairly large, correctly labelled training set for that; from your description I'm not certain you have that?
"I'm assuming that the only way is to stablish some heuristic by the present words in the phrase " is very vague, and could mean a lot of things. Based on the next sentence it kind of sounds like you think scanning for a predefined list of keywords is the only way to go. In that case, no, see the paragraph above.
Once you have both parts working, you can combine them and count the number of accidents and their severity per street. Using some geocoding library you could even generalize to neighborhoods or cities. Another challenge is the detection of synonyms ("Smith Str" vs "John Smith Street") and homonyms ("Smith Street" in London vs "Smith Street" in Leeds).