When we have to focus on only one domain (say weather) and we use a LSTM model to identify sub-intents inside weather using softmax classifier (which picks the sub-intent with highest score), what is the way to handle non-weather queries for which we want to say we don't have any answer? The problem is that there are too many outside domains and I don't know if it is feasible to generate data for all of them.
There is no really good way to do this.
In practice these are common approaches:
Build a class of examples of stuff you want to ignore. For a chatbot this might be greetings ("hello", "hi!", "how are you") or obscenities.
Create a confidence threshold and give an uncertain reply if all intents are below the threshold.
Related
I'm investigating various NLP algorithms and tools to solve the following problem; NLP newbie here, so pardon my question if it's too basic.
Let's say, I have a messaging app where users can send text messages to one or more people. When the user types a message, I want the app to suggest to the user who the potential recipients of the message are?
If user "A" sends a lot of text messages regarding "cats" to user "B" and some messages to user "C" and sends a lot of messages regarding "politics" to user "D", then next time user types the message about "cats" then the app should suggest "B" and "C" instead of "D".
So I'm doing some research on topic modeling and word embeddings and see that LDA and Word2Vec are the 2 probable algorithms I can use.
Wanted to pick your brain on which one you think is more suitable for this scenario.
One idea I have is, extract topics using LDA from the previous messages and rank the recipients of the messages based on the # of times a topic has been discussed (ie, the message sent) in the past. If I have this mapping of the topic and a sorted list of users who you talk about it (ranked based on frequency), then when the user types a message, I can again run topic extraction on the message, predict what the message is about and then lookup the mapping to see who can be the possible recipients and show to user.
Is this a good approach? Or else, Word2Vec (or doc2vec or lda2vec) is better suited for this problem where we can predict similar messages using vector representation of words aka word embeddings? Do we really need to extract topics from the messages to predict the recipients or is that not necessary here? Any other algorithms or techniques you think will work the best?
What are your thoughts and suggestions?
Thanks for the help.
Since you are purely looking at topic extraction from previous posts, in my opinion LDA would be a better choice. LDA would describe the statistical relationship of occurrences. Semantics of the words would mostly be ignored (if you are looking for that then you might want to rethink). But also I would suggest to have a look at a hybrid approach. I have not tried it myself but looks quiet interesting.
lda2vec new hybrid approach
Also, if you happen to try it out, would love to know your findings.
I think you're looking for recommender systems (Netflix movie suggestions, amazon purchase recommendations, ect) or Network analysis (Facebook friend recommendations) which utilize topic modeling as an attribute. I'll try to break them down:
Network Analysis:
FB friends are nodes of a network whose edges are friendship relationships. Calculates betweenness centrality, finds shortest paths between nodes, stores shortest edges as a list, closeness centrality is the sum of length between nodes.
Recommender Systems:
recommends what is popular, looks at users similar and suggests things that the user might be interested in, calculates cosine similarity by measuring angels between vectors that point in the same direction.
LDA:
topic modeler for text data - returns topics of interest might be used as a nested algorithm within the algorithms above.
Word2Vec:
This is a neccassary step in building an LDA it looks like this: word -> # say 324 then count frequency say it showed up twice in a sentence:
This is a sentence is.
[(1,1), (2,2), (3,1), (4,1), (2,2)]
It is a neural net you will probably have to use as a pre-processing step.
I hope this helps :)
Im reusing word2vec for products on my website and users. I would like to say that a user is NEGATIVELY associated to a product if he has visited the page < 5 seconds and POSITIVELY if he spent > 30 seconds on the page. Is there a way to specify this in word2vec? Or is there some other tool that enables this?
Although your question is not well defined but I think you want to store the relation of user with the product which has nothing to do with word2vec. word2vec essentially gives you a mapping from strings to contiguous domain vectors. In your problem you should give a separate new feature of User-Product relationship (NEGATIVE or POSITIVE) along with the word2vec features and you can let the model retrain the word-embeddings according to this new POSITIVE/NEGATIVE feature while solving your particular task. This way the model will adjust the word-embeddings and get some of the desired effect of the POSITIVE/NEGATIVE features.
Please be more elaborate so that I can answer your question in a better way.
I am developing an apps that use wit ai as a service. Right now, I am having problems training it. In my apps I have 3 intents:
to call
to text
to send picture
Here are my example training:
Call this number 072839485 and text this number 0623744758 and send picture to this number 0834952849.
Call this number 072839485, 0834952849 and 0623744758
In my first training I labeled that sentence with all 3 intents, and 072839485 as phone_number with role to_call_phone_number, 0623744758 as phone_number with role to_text_phone_number and 0834952849 as phone_number with role to_send_pic_phone_number.
In my second training I labeled all the 3 numbers as phone_number with to_call_phone_number role.
After many training, the wit still output the wrong labelled. When the sentence like this:
Call this number 072637464, 07263485 and 0273847584
The wit says 072637464 is to_call_phone_number but 07263485 and 0273847584 are to_send_pic_phone_number.
Am I not correctly training it? Can some one give me some suggestions about the best practice to train wit?
There aren't many best practices out there for wit.ai training at the moment, but with this particular example in mind I would recommend the following:
Pay attention to the type of entity in addition to just the value. If you choose free-text or keyword, you'll get different responses from the wit engine. For example: in your training if the number is a keyword, it'll associate the particular number with the intent/role rather than the position. This is probably the reason your training isn't working correctly.
One good practice would be to train your bot with specific examples first which will provide the bot with more information (such as user providing keyword 'photograph' and number) and then general examples which will apply to more cases (such as your second example).
Think about the user's perspective and what would seem natural to them. Work with those training examples first. Generate a list of possible training examples labelling them from general to specific and then train intents/roles/entities based on those examples rather than thinking about intents and roles first.
I am now working on a document recommendation program and I am kinda stuck here.
For each document, I have a score assigned according to user's actions. Then, when a new document comes in, I need to predict how user will like it and rerank the whole documents again according to their scores. My solution is to use a threshold to divide those scores into "recommend" and "not recommend". Then naiveBayes or other classification models can either give me a label or return the possibility of that label (I am using NLTK package to do text analytics).
Am I on the right way? My question is when I get that possibility, how can I convert it into the score that I use to do the ranking? Or I should use logistic regression in scikit instead?
Thanks!
It sounds like you are trying to force a ranking problem into a classification problem. What you really want to do is learn how to rank the documents given a "query".
I would suggest trying out something like the SVM-Rank algorithm. It takes as input a set of "recommended" and "not recommended" vectors and then learns how to rank them so that the recommended ones come first. There is also a simple python tool in dlib you can use to do it. See here for an example: http://dlib.net/svm_rank.py.html
Objective: a node.js function that can be passed a news article (title, text, tags, etc.) and will return a category for that article ("Technology", "Fashion", "Food", etc.)
I'm not picky about exactly what categories are returned, as long as the list of possible results is finite and reasonable (10-50).
There are Web APIs that do this (eg, alchemy), but I'd prefer not to incur the extra cost (both in terms of external HTTP requests and also $$) if possible.
I've had a look at the node module "natural". I'm a bit new to NLP, but it seems like maybe I could achieve this by training a BayesClassifier on a reasonable word list. Does this seem like a good/logical approach? Can you think of anything better?
I don't know if you are still looking for an answer, but let me put my two cents for anyone who happens to come back to this question.
Having worked in NLP i would suggest you look into the following approach to solve the problem.
Don't look for a single package solution. There are great packages out there, no doubt for lots of things. But when it comes to active research areas like NLP, ML and optimization, the tools tend to be atleast 3 or 4 iterations behind whats there is academia.
Coming to the core problem. What you want to achieve is text classification.
The simplest way to achieve this would be an SVM multiclass classifier.
Simplest yes, but also with very very (see the double stress) reasonable classification accuracy, runtime performance and ease of use.
The thing which you would need to work on would be the feature set used to represent your news article/text/tag. You could use a bag of words model. add named entities as additional features. You can use article location/time as features. (though for a simple category classification this might not give you much improvement).
The bottom line is. SVM works great. they have multiple implementations. and during runtime you don't really need much ML machinery.
Feature engineering on the other hand is very task specific. But given some basic set of features and a good labelled data you can train a very decent classifier.
here are some resources for you.
http://svmlight.joachims.org/
SVM multiclass is what you would be interested in.
And here is a tutorial by SVM zen himself!
http://www.cs.cornell.edu/People/tj/publications/joachims_98a.pdf
I don't know about the stability of this but from the code its a binary classifier SVM. which means if you have a known set of tags of size N you want to classify the text into, you will have to train N binary SVM classifiers. One each for the N category tags.
Hope this helps.