I know you can create categories for words using IBM Watson NLP classification. But would it be possible to train or create categories based on quantities of a word. For example, one pack of cigarette is bad category, 2 packs is harmful category, 3 packs is lethal category. Thank you in advance.
In case you are interested, you can still use the NLC to classify text that does not contain numbers associated.
For example:
Text: "I never smoke"
Class: healthy
Text: "I smoke sometimes"
Class: not_so_healthy
Text: "I smoke a lot of cigarettes everyday"
Class: harmful
Text: "I never stop smoking"
Class: lethal
Watson NLU doesn't provide slot or word categories as do the NLU solutions e.g. from Nuance/Vocon or Inferret. That means that you can classify complete utterances as the older answer below illustrates as a class/intent, but you cannot group words of similar paradigmatical meaning into word classes - e.g. class "vehicle" = car, truck, bicycle.
Related
I want to do a Topic modelling but in my case : One article may contains many topic:
I have an article (word file) that contains several topics and each topic is associated with a company (see example below)
I have a text as input :
"IBM is an international company specializing in all that is IT, on the other hand Facebook is a social network and Google is a search engine. IBM invented a very powerful computer."
Knowing we have labeled topics : "Products and services","Communications","Products and services"...
I want to have as output:
IBM : Products and services
Facebook : Communications
Google : Products and services
So, I think that we can do this by splitting the text: associate the parts of the text that talks about company, for example :
IBM : ['IBM is an international company specializing in all that is IT', 'IBM invented a very powerful computer.']
Facebook : ['Facebook is a social network']
Google : ['Google is a search engine']
then, for each company, perform Topic Modelling based on parts of text for each company ...
OUTPUT:
IBM : Products and services
Facebook : Communications
Google : Products and services
Could you help me how I can split and match the parts of text to each company, how to determine the parts that talk about Facebook in
It seems like you have two separate problems: (1) Data preparation/cleaning, i.e. splitting your text into the right units for analysis; (2) classifying the different units of text into "topics".
1. Data Preparation
An 'easy' way of doing this would be splitting your text into sentences and use sentences as your unit of analysis. Spacy is good for this for example (see e.g. this answer here). Your example is more difficult since you want to split sentences even further, so you would have to come up with a custom logic for splitting your text according to specific patterns, e.g. using regular expressions. I don't think that there is a standard way for doing this and is depends very much on your data.
2. Topic classification
If I understand correctly, you already have the labels ("topics" like ["Products and services", "Communications"]) which you want to attribute to different texts.
In this case, topic modeling is probably not the right tool, because topic modeling is mostly used when you want to discover new topics and don't know the topics/labels yet. And in any case, a topic model would only return the most frequent/exclusive words associated to a topic and not a neat abstract topic label like "Products and services". You also need enough text for a topic model to produce meaningful output.
A more elegant solution is zero-shot classification. This basically means that you take a general machine learning model that has been pre-trained by someone else in a very general way for text classification and you simply apply it to your specific use case for "topic classification" without having to train/fine-tune it. The Transformers library has a very easy to use implementation of this.
# pip install transformers==3.1.0 # pip install in terminal
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
sequence1 = "IBM is an international company specializing in all that is IT"
sequence2 = "Facebook is a social network."
sequence3 = "Google is a search engine. "
candidate_labels = ["Products and services", "Communications"]
classifier(sequence1, candidate_labels)
# output: {'labels': ['Products and services', 'Communications'], 'scores': [0.8678683042526245, 0.1321316659450531]}
classifier(sequence2, candidate_labels)
# output: {'labels': ['Communications', 'Products and services'], 'scores': [0.525628387928009, 0.47437164187431335]}
classifier(sequence3, candidate_labels)
# output: {'labels': ['Products and services', 'Communications'], 'scores': [0.5514479279518127, 0.44855210185050964]}
=> it classifies all texts correctly based on your example and labels. The label ("topic") with the highest score is the one which the model thinks fits best to your text. Note that you have to think hard about which labels are the most suitable. In your example, I wouldn't even be sure as a human which one fits better and the model is also not very sure. With this zero-shot classification approach you can chose the topic labels that you find most adequate.
Here is an interactive web application to see what it does without coding. Here is a Jupyter notebook which demonstrates how to use it in Python. You can just copy-paste code from the notebook.
I am using "list" entity. However, I do not achieve my expected result.
Here is what I have for LUIS intent:
getAnimal
I want to get a cat**[animal]**.
Here is what I have with LUIS entities:
List Entities [animal]
cat: russian blue, persian cat, british shorthair
dog: bulldog, german shepard, beagle
rabbit: holland lop, american fuzzy lop, florida white
Here is what I have with LUIS Phrase lists:
Phrase lists [animal_phrase]
cat, russian blue, persian cat, british shorthair, dog, bulldog, german shepard, beagle, etc
Desired:
When user enters "I want to get a beagle." It will be match with "getAnimal" intent.
Actual:
When user enters "I want to get a beagle." It will be match with "None" intent.
Please help. Your help will be appreciated.
So using a phrase list is a good way to start, however you need to make sure you provide enough data for LUIS to be able to learn the intents as well as the entities separate from the phrase list. Most likely you need to add more utterances.
Additionally, if your end goal is to have LUIS recognize the getAnimal intent, I would do away with the list entity, and instead use a simple entity to take advantage of LUIS's machine learning, and do so in combination with a phrase list to boost the signal to what an animal may look like.
As the documentation on phrase lists states,
Features help LUIS recognize both intents and entities, but features
are not intents or entities themselves. Instead, features might
provide examples of related terms.
--Features, in machine learning, being a distinguishing trait or attribute of data that your system observes, and what you add to a group/class when using a phrase list
Start by
1. Creating a simple entity called Animal
2. Add more utterances to your getAnimal intent.
Following best practices outlined here, you should include at least 15 utterances per intent. Make sure to include plenty of examples of the Animal entity.
3. Be mindful to include variation in your utterances that are valuable to LUIS's learning (different word order, tense, grammatical correctness, length of utterance and entities themselves). Highly recommend reading this StackOverflow answer I wrote on how to build your app properly get accurate entity detection if you want more elaboration.
above blue highlighted words are tokens labeled to the simple Animal entity
3. Use a phrase list.
Be sure to include values that are not just 1 word long, but 2, 3, and 4 words long in length, as different animal names may possibly be that long in length (e.g. cavalier king charles spaniel, irish setter, english springer spaniel, etc.) I also included 40 animal breed names. Don't be shy about adding Related Values suggested to you into your phrase list.
After training your app to update it with your changes, prosper!
Below "I want a beagle" reaches the proper intent. LUIS will even be able to detect animals that were not entered in the app in entity extraction.
I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...I've successfully stripped the message bodies out of my emails. I'm looking to start classifying.
To make multiple labels like sports, technology, politics, entertainment I need some set of words of each one to make the labelling. Example for
Sports label will have the label data: Football, Soccer, Hockey……
Where can I find online label data to help me ?
You can use DMOZ.
Be award, there are different kinds of text. For e.g one of the most common words in email-text will be Hi or Hello but in wiki-text Hi and Hello will not be common words
What you're trying to do is called topic modeling:
https://en.wikipedia.org/wiki/Topic_model
The list of topics is very dependent on your training dataset and the ultimate purpose for which you're building this.
A good place to start can be here:
https://nlp.stanford.edu/software/tmt/tmt-0.4/
You can look on their topics, but you can probably also use it to give some initial topics to your data and just work on top of their topics.
You can use the BBC dataset.
It has labeled news articles which can help.
for feature extraction, remove stopwords, do stemming, use n-gram with tf-idf, and than choose the best features
I created classifier to classy the class of nouns,adjectives, Named entities in given sentence. I have used large Wikipedia dataset for classification.
Like :
Where Abraham Lincoln was born?
So classifier will give this short of result - word - class
Where - question
Abraham Lincoln - Person, Movie, Book (because classifier find Abraham Lincoln in all there categories)
born - time
When Titanic was released?
when - question
Titanic - Song, movie, Vehicle, Game (Titanic
classified in all these categories)
Is there any way to identify exact context for word?
Please see :
Word sense disambiguation would not help here. Because there might not be near by word in sentence which can help
Lesk algorithm with wordnet or sysnet also does not help. Because it for suppose word Bank lesk algo will behave like this
======== TESTING simple_lesk ===========
TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
Here for word bank it suggested as financial institute and slopping land. While in my case I am already getting such prediction like Titanic then it can be movie or game.
I want to know is there any other approach apart from Lesk algo, baseline algo, traditional word sense disambiguation which can help me to identify which class is correct for particular keyword?
Titanic -
Thanks for using the pywsd examples. With regards to wsd, there are many other variants and i'm coding them by myself during my free time. So if you want to see it improve do join me in coding the open source tool =)
Meanwhile, you will find the following technologies more relevant to your task, such as:
Knowledge base population (http://www.nist.gov/tac/2014/KBP/) where tokens/segments of text are assigned an entity and the task is to link them or to solve a simplified question and answer task.
Knowledge representation (http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html)
Knowledge extraction (https://en.wikipedia.org/wiki/Knowledge_extraction)
The above technologies usually includes several sub-tasks such as:
Wikification (http://nlp.cs.rpi.edu/kbp/2014/elreading.html)
Entity linking
Slot filling (http://surdeanu.info/kbp2014/def.php)
Essentially you're asking for a tool that is an NP-complete AI system for language/text processing, so I don't really think such a tool exists as of yet. Maybe it's IBM Watson.
if you're looking for the field to look into, the field is out there but if you're looking at tools, most probably wikification tools are closest to what you might need. (http://nlp.cs.rpi.edu/paper/WikificationProposal.pdf)
I'm trying to make an analysis of a set of phrases, and I don't know exactly how "natural language processing" can help me, or if someone can share his knowledge with me.
The objective is to extract streets and localizations. Often this kind of information is not presented to the reader in a structured way, and It's hard to find a way of parsing it. I have two main objectives.
First the extraction of the streets itself. As far as I know NLP libraries can help me to tokenize a phrase and perform an analysis which will get nouns (for example). But where a street begins and where does it ends?. I assume that I will need to compare that analysis with a streets database, but I don't know wich is the optimal method.
Also, I would like to deduct the level of severity , for example, in car accidents. I'm assuming that the only way is to stablish some heuristic by the present words in the phrase (for example, if deceased word appears + 100). Am I correct?
Thanks a lot as always! :)
The first part of what you want to do ("First the extraction of the streets itself. [...] But where a street begins and where does it end?") is a subfield of NLP called Named Entity Recognition. There are many libraries available which can do this. I like NLTK for Python myself. Depending on your choice I assume that a streetname database would be useful for training the recognizer, but you might be able to get reasonable results with the default corpus. Read the documentation for your NLP library for that.
The second part, recognizing accident severity, can be treated as an independent problem at first. You could take the raw words or their part of speech tags as features, and train a classifier on it (SVM, HMM, KNN, your choice). You would need a fairly large, correctly labelled training set for that; from your description I'm not certain you have that?
"I'm assuming that the only way is to stablish some heuristic by the present words in the phrase " is very vague, and could mean a lot of things. Based on the next sentence it kind of sounds like you think scanning for a predefined list of keywords is the only way to go. In that case, no, see the paragraph above.
Once you have both parts working, you can combine them and count the number of accidents and their severity per street. Using some geocoding library you could even generalize to neighborhoods or cities. Another challenge is the detection of synonyms ("Smith Str" vs "John Smith Street") and homonyms ("Smith Street" in London vs "Smith Street" in Leeds).