I have created entity for size and it is not working as expected. below is my size entity
and my price entity
i have intent for search. So when i say search camera in 200 dollar it give me response as below
but when i do same search with size entity like this search camera in 23 megapixel it does not recognize 23 megapixel as size. Please help me how can i resolve this issue.
As i have noticed when i train my agent with user says as "i want camera for 20 megapixel" and when i test it in console with phrase "i need camera in 20 megapixel" in this case it only recognize camera [i have separate entity for camera]
But when i do same search with phrase as i need camera in 200 dollars"
it works. It seems it does not recognize my size entity because of different text variance.
As i tested more my agent only recognize size entity when my phrases matches with phrase that i have used in training phrases.
Related
I would like to build an NLP classification model.
My input is a paragraph or a sentence. Ideally, my output is a score or probability (between 0 and 1).
I have defined specific entities ex-ante, each entity belongs to a single group.
Based on business insights, we know that the output to predict does not depend on the entities by themselves, but depends on their groups. For example, the phrase “Max barks” would return 1 because “Max” belongs to the group “Dogs”, but “Kitty barks” would return 0 (because Kitty is not a dog). If “Max” was a cat, the phrase would return 0.
One way to do so would be to generate all the sentences with all the permutations of dogs and cats (in my example) but that is very cumbersome!
Another way would be to replace the entity with the name of the group (the phrase “Max” becomes “” for example) but that looks weird to me!
I don't have any other idea how to tackle this problem.
Could you please help me, ideally with code?
Thanks a lot.
If I understand your question correctly, you are to classify the text into "dog activities" vs. "non-dog activities" and in the text you are referencing dogs, cats (and maybe other animals) by their names but you know which name is related with which species.
In such a case I would suggest introducing a named entity token replacing each name of an animal with its species. In your example "Max barks" could be replaced with "%DOG% barks" and "Kitty barks" with "%CAT% barks".
This would form a strong signal for the model to pick up and train correctly.
Otherwise, you could also go with your approach of generating all of the potential examples of dogs and cats where the name would be loosely linked with a one or the other group by the label of the training / testing example. Even though it is a bit cumbersome it can be more practical that introducing another step to the processing pipeline - Name Entity Recognition - which translates the names of the animals to their species. And such a step would be necessary both in the training and during inference.
What is the right approach for multi-label text information extraction/classification
Having texts that describe a caregiver/patient visit : (made-up example)
Mr *** visits the clinic on 02/2/2018 complaining about pain in the
lower back for several days, No pathological findings in the x-ray or
in the blood tests. I suggest Mr *** 5 resting days.
Now, that text can be even in a paragraph size where the only information I care about will be lower back pain and resting days. I have 300-400 different labels but the number of labeled samples can be around 1000-1500 (total) . When I label the text I also mark the relevant words that create the "label" ,here it will be ['pain','lower','back'].
When I just use look-up for those words (or the other 300-400 labels) in other texts I manage to label a larger amount of texts but if the words are written in different patterns such as Ache in the lower back or "lowerback pain" and I've never added that pattern to the look-up table of "lower back pain" I won't find it.
Due to the fact that I can have large paragraph but the only information I need is just 3-4 words, DL/ML models do not manage to learn with that amount of data and a high number of labels.I am wondering if there is a way to use the lookup table as a feature in the training phase or to try other approaches
I have used Tensorflow object detection for quite awhile now. I am more of a user, I dont really know how it works. I am wondering is it possible to train it to recognize an object is something and not something? For example, I want to detect cracks on the tiles. Can i use object detection to do so where i show an image of a tile and it can tell me if there is a crack (and also show the location), or it will tell me if there is no crack on the tile?
I have tried to train using pictures with and without defect, using 2 classes (1 for defect and 1 for no defect). But the results keep showing both (if the picture have defect) in 1 picture. Is there a way to show only the one with defect?
Basically i would like to do defect checking. This is a simplistic case of 1 defect. but the actual case will have a few defects.
Thank you.
In case you're only expecting input images of tiles, either with defects or not, you don't need a class for no defect.
The API adds a background class for everything which is not the other classes.
So you simply need to state one class - defect, and tiles which are not detected as such are not defected.
So in your training set - simply give bounding boxes of defects, and no bounding box in case of no defect, and then your model should learn to detect the defects as mentioned above.
Thanks for taking the time to read my question!
So I am running an experiment to see if I can predict whether an individual has been diagnosed with depression (or at least says they have been) based on the words (or tokens)they use in their tweets. I found 139 users that at some point tweeted "I have been diagnosed with depression" or some variant of this phrase in an earnest context (.e. not joking or sarcastic. Human beings that were native speakers in the language of the tweet were used to discern whether the tweet being made was genuine or not).
I then collected the entire public timeline of tweets of all of these users' tweets, giving me a "depressed user tweet corpus" of about 17000 tweets.
Next I created a database of about 4000 random "control" users, and with their timelines created a "control tweet corpus" of about 800,000 tweets.
Then I combined them both into a big dataframe,which looks like this:
,class,tweet
0,depressed,tweet text .. *
1,depressed,tweet text.
2,depressed,# tweet text
3,depressed,저 tweet text
4,depressed,# tweet text😚
5,depressed,# tweet text😍
6,depressed,# tweet text ?
7,depressed,# tweet text ?
8,depressed,tweet text *
9,depressed,# tweet text ?
10,depressed,# tweet text
11,depressed,tweet text *
12,depressed,#tweet text
13,depressed,
14,depressed,tweet text !
15,depressed,tweet text
16,depressed,tweet text. .
17,depressed,tweet text
...
50595,control,#tweet text?
150596,control,"# tweet text."
150597,control,# tweet text.
150598,control,"# tweet text. *"
150599,control,"#tweet text?"t
150600,control,"# tweet text?"
150601,control,# tweet text?
150602,control,# tweet text.
150603,control,#tweet text~
150604,control,# tweet text.
Then I trained a multinomial naive bayes classifier using an object from the CountVectorizer() class imported from the sklearn library:
count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(tweet_corpus['tweet'].values)
classifier = MultinomialNB()
targets = tweet_corpus['class'].values
classifier.fit(counts, targets)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior= True)
Unfortunately, after running a 6-fold cross validation test, the results suck and I am trying to figure out why.
Total tweets classified: 613952
Score: 0.0
Confusion matrix:
[[596070 743]
[ 17139 0]]
So, I didn't properly predict a single depressed person's tweet! My initial thought is that I have not properly normalized the counts of the control group, and therefore even tokens which appear more frequently among the depressed user corpus are over represented in the control tweet corpus due to its much larger size. I was under the impression that .fit() did this already, so maybe I am on the wrong track here? If not, any suggestions on the most efficient way to normalize the data between two groups of disparate size?
You should use a re-sampling techniques to deal with unbalanced classes. There are many ways to do that "by hand" in Python, but I recommend unbalanced learn which compiles re-sampling techniques commonly used in datasets showing strong between-class imbalance.
If you are using Anaconda, you can use:
conda install -c glemaitre imbalanced-learn.
or simply:
pip install -U imbalanced-learn
This library is compteible with sci-kit learn. Your dataset looks very interesting, is it public? Hope this helps.
I recently installed PredictionIO.
What I'd like to achieve is: I'd like to categorize content on the words included in the text. But how can I import data like raw Tweets to PredictionIO? Is it possible to let PredictionIO run over the content and find strong words and to sort them in categories?
What I'd like to get is something like this: Query for Boston Red Sox --> keywords that should appear would be: baseball, Boston, sports, ...
So I'll add on a little to what Thomas said. He's right, it all depends whether or not you have labels associated to your tweets. If your data is labeled then this will be a Text Classification problem. Look at this for more detailed info:
If you're instead looking to cluster, or group, a set of unlabeled observations then, as Thomas said, your best bet is to incorporate LDA into the works. Look at the latter documentation for more information, but basically once you run the LDA model you'll obtain an object of type DistributedLDAModel which has a method topicDistributions which gives you, for each tweet, a vector where each component is associated to a topic, and the component entry gives you the probability that the tweet belongs to that topic. You can cluster by assigning each tweet the topic with highest probability.
You also have access to a matrix of size MxN, where M is the number of words in your vocabulary, and N is the number of topics, or clusters, you wish to discover in your data. You can roughly interpret the ij th entry of this Topics Matrix as the probability that the word i appears in a document given that the document belongs to topic j. Another rule you could use for clustering is to treat each word vector associated to your tweets as a vector of counts. Then, you can interpret the ij entry of the product of your word matrix (tweets as rows, words as columns) and the Topics Matrix returned by LDA as the probability that tweet i belongs to topic j (this follows under certain assumptions, feel free to ask if you want more details). Again now you assign tweet i to the topic associated to the largest numerical value in row i of the resulting matrix. You can even use this clustering rule for assigning topics to incoming observations once you have used your original set of tweets for topic discovery!
Now, for data processing, you can still use the Text Classification reference for transforming your Tweets to word count vectors via the DataSource and Preparator components. As for importing your data, if you have the tweets saved locally on a file, you can use PredictionIO's Python SDK to import your data. An example is also given in the classification reference.
Feel free to ask any questions if anything isn't clear, and good luck!
So, really depends on if you have labelled data.
For example:
Baseball :: "I love Boston Red Sox #GoRedSox"
Sports :: "Woohoo! I love sports #winning"
Boston :: "Baseball time at Fenway Park. Red Sox FTW!"
...
Then you would be able to train a model to classifying Tweets against these keywords. You might be interested in templates for MLlib Naive Bayes, Decision Trees.
If you don't have labelled data (really, who wants to manually label Tweets) you might be able to use approaches such as Topic Modeling (e.g., LDA).
I don't think there is a template for LDA but being an active open source project it wouldn't surprise me if someone has already implemented this so might be a good idea to ask on PredictionIO user or dev forums.