I am building a classifier that predicts the damage of a vehicle(like high, low, medium, good). I referred to this GitHub repository
https://github.com/raviranjan0309/Car-Damage-Detector
There is a retrained_label.txt file in models/tf_files which consist of four classes
not,
car,
high,
low
I do not want these four classes and I want my tf to predict one of the following
Good,
High Damage,
Low Damage,
Medium Damage
Is this possible ?
Should I need to retrain the tf for these classes ?
If so how ?
Thanks
The file you mentioned only has 4 words in it and to be honest it is difficult to understand why they are in that file.
Normally, for any tensorflow related analysis, you have to retrain the algorithm to be able to predict based on new labels.
If you are new to ML/DL and Tensorflow, I would suggest looking into excellent tutorials on Titanic predictors where you can use a simple database to predict either one of two outcomes: survive or die
You can then use a similar code and just use a different dataset (in this case I guess a car dataset) to have it predict one of four possible outcomes for damage. The only problem is getting that dataset of course
(many examples, but here's one: https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8)
Without having at least a 1000 or so data point with car information where that damage is already listed, it would be quite challenging.
So just to summarize:
1) yes you have to retrain and probably need a different dataset too
2) you may be able to create a dataset with damage info based on what you already have
3) once training/testing sets are ready, you can then retrain using simple ML techniques
Related
I am trying to train a CNN model for a regression problem, after that, I categorize predicted labels into 4 classes and check some accuracy metrics. In confusion matrix accuracy of class 2,3 are around 54% and accuracy of class 1,4 are more than 90%. labels are between 0-100 and classes are 1: 0-45,2: 45-60, 3:60-70, 4:70-100. I do not know where the problem comes from Is it because of the distribution of labels in the training set and what is the solution! Regards...
I attached the plot in the following link.
Training set target distribution
It's not a good idea to create classes that way. Giving to some classes a smaller window of values (i.e. you predict 2 for 15 values and 1 for 45 values), it is intrinsically more difficult for your model to predict class 2, and the best thing the model will learn during training will be to avoid class 2 as much as possible.
You may confirm this having a look at False Negatives for classes 2 and 3, if they are too many, it might be due to this.
The best thing to do would be categorizing your output space in equal portions, and trusting your model will learn which classes are less frequent, without trying to force that proportion by yourself.
If you don't have good results, it means you have to improve your model in other ways, maybe using data augmentation to get a uniform distribution of training samples may help.
If this doesn't sound convincing for you, try to have a look at this paper:
https://papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf
In end-to-end models for autonomous driving, neural networks have to predict classes indicating the steering angle. The distribution of these values is highly imbalanced as most of the time the car is going straight. Despite this, the best models do not discriminate against some classes to adapt to data distribution.
Good luck!
what approach should i take when I want my CNN multi-class network to output something like [0.1, 0,1] when image doesn't belong
to any class. Using softmax and categorical_crossentropy for multi-class would give me output that sums up to 1 so still not what I want.
I'm new to neural networks so sorry for silly question and thanks in advance for any help.
I think you are gonna think about Bayesian Learning. First, talking about uncertainty.
For example, given several pictures of dog breeds as training data—when a user uploads a photo of his dog—the hypothetical website should return a prediction with rather high confidence. But what should happen if a user uploads a photo of a cat and asks the website to decide on a dog breed?
The above is an example of out of distribution test data. The model has been trained on photos of dogs of different breeds, and has (hopefully) learnt to distinguish between them well. But the model has never seen a cat before, and a photo of a cat would lie outside of the data distribution the model was trained on. This illustrative example can be extended to more serious settings, such as MRI scans with structures a diagnostics system has never observed before, or scenes an autonomous car steering system has never been trained on.
A possible desired behaviour of a model in such cases would be to return a prediction (attempting to extrapolate far away from our observed data), but return an answer with the added information that the point lies outside of the data distribution. We want our model to possess some quantity conveying a high level of uncertainty with such inputs (alternatively, conveying low confidence).
Then, I think you could read briefly this paper when they also apply to classification task and generate uncertainty for classes (dog, cat...). From this paper, you can extend your finding to application using this paper, and I think you will find what you want.
I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.
In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.
Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.
I would like to use scikit-learn's svm.SVC() estimator to perform classification tasks on multi-dimensional time series - that is, on time series where the points in the series take values in R^d, where d > 1.
The issue with doing this is that svm.SVC() will only take ndarray objects of dimension at most 2, whereas the dimension of such a dataset would be 3. Specifically, the shape of a given dataset would be (n_samples, n_features, d).
Is there a workaround available? One simple solution would just be to reshape the dataset so that it is 2-dimensional, however I imagine this would lead to the classifier not learning from the dataset properly.
Without any further knowledge about the data reshaping is the best you can do. Feature engineering is a very manual art that depends heavily on domain knowledge.
As a rule of thumb: if you don't really know anything about the data throw in the raw data and see if it works. If you have an idea what properties of the data may be beneficial for classification, try to work it in a feature.
Say we want to classify swiping patterns on a touch screen. This closely resembles your data: We acquired many time series of such patterns by recording the 2D position every few milliseconds.
In the raw data, each time series is characterized by n_timepoints * 2 features. We can use that directly for classification. If we have additional knowledge we can use that to create additional/alternative features.
Let's assume we want to distinguish between zig-zag and wavy patterns. In that case smoothness (however that is defined) may be a very informative feature that we can add as a further column to the raw data.
On the other hand, if we want to distinguish between slow and fast patterns, the instantaneous velocity may be a good feature. However, the velocity can be computed as a simple difference along the time axis. Even linear classifiers can model this easily so it may turn out that such features, although good in principle, do not improve classification of raw data.
If you have lots and lots and lots and lots of data (say an internet full of good examples) Deep Learning neural networks can automatically learn features to some extent, but let's say this is rather advanced. In the end, most practical applications come down to try and error. See what features you can come up with and try them out in practice. And beware the overfitting gremlin.
My machine learning goal is to search for potential risks (will cost more money) and opportunities (will save money) from a Project Requirements document.
My idea is to classify sentences from the data into one of these categories: Risk, Opportunity and Irrelevant (no risk, no opportunity, default categorie).
I will use a multinomial Bayes classifier for this with tf-dif.
Now I need to have data for my training set and test set. The way I will do this is label every sentence from requirement documents with 1 of the 3 categories. Is this a good approach?
Or should I only label sentences which are obviously a risk/opportunity/irrelevant?
Also, is the Irrelevant categorie a good idea?
I believe the three-class approach is a good one. This is similar to sentiment analysis, where you typically have positive, negative and neutral documents (or sentences). The neutral comprises the vast majority of the instances, so your classification problem will be unbalanced. That is not necessarily an issue, but for difficult problems like this one, a naive bayes classifier might simply classify everything in the neutral/irrelevant bucket since the prior for neutral will be quite high.
your sampling (labeling) should be representative of the reality. Don't try to create a dataset of 1000 risk, 1000 opportunity, 1000 irrelevant. Instead, take a sample of say 10000 requirements, and assign the proper label to each, even if it means having much more 'Irrelevant' than 'Risk' for instance.
text classification models require many instances, since the search space is vast. I wonder if you have considered the fact that to get reliable results (say over 90%), you may need to manually label thousands of instances.
and even if you have thousands of training instances, your problem looks particularly difficult, unless there are some obvious keywords to trigger 'risk' or 'opportunity' that I don't understand. Ask yourself: would this be easy for a human to judge? If you asked 3 judges to classify your requirements, would they all come up with the same answer? If not, then it might be 10s of thousands of training documents that you will need, and the classification accuracy may still be disappointing.