Default value in Svm prediction Scikitlearn - python-3.x

I am using scikitlearn for svm classification.
I need a classifier that returns default value when a given test item doesn't match any of the training-set items, i.e. when the distance is very high. Is that possible?
For Example
Let's say my training-set is
X= [[0.5,0.5,2],[4, 4,16],[16, 16,64]]
and labels
y=[0,1,2]
then I run training
clf = svm.SVC()
clf.fit(X, y)
then I run prediction
clf.predict([-100,-100,-200])
Now as we can see the test-item [-100,-100,-200] is too far away from any of the training-items, in this case the prediction will yield [2] which is this item [16, 16,64], is there anyway to make it return anything else (not from training-set)?

I think you can create a label for those big values, and added into your training set.
X= [[0.5,0.5,2],[4, 4,16],[16, 16,64],[-100,-100,200]]
Y=[0,1,2,100]
and give a try.
Since SVM is supervised learning, which means the 'OUTPUT' have to be specified. If you are not certain about the 'OUTPUT', do some non supervised clustering (kmeans for example), and have a rough idea how many possible 'OUTPUT' you will expect.

Related

SkLearn SVM - How to get multiple predictions ordered by probability?

I am doing some text classification.
Let's say I have 10 categories and 100 "samples", where each sample is a sentence of text. I have split my samples into 80:20 (training, testing) and trained the SVM classifier:
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words=('english'),ngram_range=(1,2))), ('tfidf', TfidfTransformer()),
('clf-svm', SGDClassifier(loss='hinge', penalty='l2', random_state=42, learning_rate='adaptive', eta0=0.9))])
# Fit training data to SVM classifier, predict with testing data and print accuracy
text_clf_svm = text_clf_svm.fit(training_data, training_sub_categories)
Now when it comes to predicting, I do not want just a single category to be predicted. I want to see, for example, a list of the "top 5" categories for a given unseen sample as well as their associated probabilities:
top_5_category_predictions = text_clf_svm.predict(a_single_unseen_sample)
Since text_clf_svm.predict returns a value which represents the index of the categories available, I want to see something like this as output:
[(4,0.70),(1,0.20),(7,0.04),(9,0.06)]
Anyone know how to achieve this?
This is something I had used a while back for a similar problem:
probs = clf.predict_proba(X_test)
# Sort desc and only extract the top-n
top_n_category_predictions = np.argsort(probs)[:,:-n-1:-1]
This will give you the top n categories for each sample.
If you also want to see the probabilities corresponding to these categories, then you can do:
top_n_probs = np.sort(probs)[:,:-n-1:-1]
Note: Here X_test is of shape (n_samples, n_features). So make sure you use your single_unseen_sample in the same format.

How does TfidfVectorizer compute scores on test data

In scikit-learn TfidfVectorizer allows us to fit over training data, and later use the same vectorizer to transform over our test data.
The output of the transformation over the train data is a matrix that represents a tf-idf score for each word for a given document.
However, how does the fitted vectorizer compute the score for new inputs? I have guessed that either:
The score of a word in a new document computed by some aggregation of the scores of the same word over documents in the training set.
The new document is 'added' to the existing corpus and new scores are calculated.
I have tried deducing the operation from scikit-learn's source code but could not quite figure it out. Is it one of the options I've previously mentioned or something else entirely?
Please assist.
It is definitely the former: each word's idf (inverse document-frequency) is calculated based on the training documents only. This makes sense because these values are precisely the ones that are calculated when you call fit on your vectorizer. If the second option you describe was true, we would essentially refit a vectorizer each time, and we would also cause information leak as idf's from the test set would be used during model evaluation.
Beyond these purely conceptual explanations, you can also run the following code to convince yourself:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
x_train = ["We love apples", "We really love bananas"]
vect.fit(x_train)
print(vect.get_feature_names())
>>> ['apples', 'bananas', 'love', 'really', 'we']
x_test = ["We really love pears"]
vectorized = vect.transform(x_test)
print(vectorized.toarray())
>>> array([[0. , 0. , 0.50154891, 0.70490949, 0.50154891]])
Following the reasoning of how the fit methodology works, you can recalculate these tfidf values yourself:
"apples" and "bananas" obviously have a tfidf score of 0 because they do not appear in x_test. "pears", on the other hand, does not exist in x_train and so will not even appear in the vectorization. Hence, only "love", "really" and "we" will have a tfidf score.
Scikit-learn implements tfidf as log((1+n)/(1+df) + 1) * f where n is the number of documents in the training set (2 for us), df the number of documents in which the word appears in the training set only, and f the frequency count of the word in the test set. Hence:
tfidf_love = (np.log((1+2)/(1+2))+1)*1
tfidf_really = (np.log((1+2)/(1+1))+1)*1
tfidf_we = (np.log((1+2)/(1+2))+1)*1
You then need to scale these tfidf scores by the L2 distance of your document:
tfidf_non_scaled = np.array([tfidf_love,tfidf_really,tfidf_we])
tfidf_list = tfidf_non_scaled/sum(tfidf_non_scaled**2)**0.5
print(tfidf_list)
>>> [0.50154891 0.70490949 0.50154891]
You can see that indeed, we are getting the same values, which confirms the way scikit-learn implemented this methodology.

How to learn the embeddings in Pytorch and retrieve it later

I am building a recommendation system where I predict the best item for each user given their purchase history of items. I have userIDs and itemIDs and how much itemID was purchased by userID. I have Millions of users and thousands of products. Not all products are purchased(there are some products that no one has bought them yet). Since the users and items are big I don't want to use one-hot vectors. I am using pytorch and I want to create and train the embeddings so that I can make the predictions for each user-item pair. I followed this tutorial https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html. If it's an accurate assumption that the embedding layer is being trained, then do I retrieve the learned weights through model.parameters() method or should I use the embedding.data.weight option?
model.parameters() returns all the parameters of your model, including the embeddings.
So all these parameters of your model are handed over to the optimizer (line below) and will be trained later when calling optimizer.step() - so yes your embeddings are trained along with all other parameters of the network.(you can also freeze certain layers by setting i.e. embedding.weight.requires_grad = False, but this is not the case here).
# summing it up:
# this line specifies which parameters are trained with the optimizer
# model.parameters() just returns all parameters
# embedding class weights are also parameters and will thus be trained
optimizer = optim.SGD(model.parameters(), lr=0.001)
You can see that your embedding weights are also of type Parameter by doing so:
import torch
embedding_maxtrix = torch.nn.Embedding(10, 10)
print(type(embedding_maxtrix.weight))
This will output the type of the weights, which is Parameter:
<class 'torch.nn.parameter.Parameter'>
I'm not entirely sure what mean by retrieve. Do you mean getting a single vector, or do you want just the whole matrix to save it, or do something else?
embedding_maxtrix = torch.nn.Embedding(5, 5)
# this will get you a single embedding vector
print('Getting a single vector:\n', embedding_maxtrix(torch.LongTensor([0])))
# of course you can do the same for a seqeunce
print('Getting vectors for a sequence:\n', embedding_maxtrix(torch.LongTensor([1, 2, 3])))
# this will give the the whole embedding matrix
print('Getting weights:\n', embedding_maxtrix.weight.data)
Output:
Getting a single vector:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020]], grad_fn=<EmbeddingBackward>)
Getting vectors for a sequence:
tensor([[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502]],
grad_fn=<EmbeddingBackward>)
Getting weights:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020],
[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502],
[-0.5829, -0.1918, -0.8079, 0.6922, -0.2627]])
I hope this answers your question, you can also take a look at the documentation, there you can find some useful examples as well.
https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding

How to use cross_val_predict to predict probabilities for a new dataset?

I am using sklearn's cross_val_predict for training like so:
myprobs_train = cross_val_predict(LogisticRegression(),X = x_old, y=y_old, method='predict_proba', cv=10)
I am happy with the returned probabilities, and would like now to score up a brand-new dataset. I tried:
myprobs_test = cross_val_predict(LogisticRegression(), X =x_new, y= None, method='predict_proba',cv=10)
but this did not work, it's complaining about y having zero shape. Does it mean there's no way to apply the trained and cross-validated model from cross_val_predict on new data? Or am I just using it wrong?
Thank you!
You are looking at a wrong method. Cross validation methods do not return a trained model; they return values that evaluate the performance of a model (logistic regression in your case). Your goal is to fit some data and then generate prediction for new data. The relevant methods are fit and predict of the LogisticRegression class. Here is the basic structure:
logreg = linear_model.LogisticRegression()
logreg.fit(x_old, y_old)
predictions = logreg.predict(x_new)
I have the same concern as #user3490622. If we can only use cross_val_predict on training and testing sets, why y (target) is None as the default value? (sklearn page)
To partially achieve the desired results of multiple predicted probability, one could use the fit then predict approach repeatedly to mimic the cross-validation.

How to apply random forest properly?

I am new to machine learning and python. Now I am trying to apply random forest to predict binary results of a target. In my data I have 24 predictors (1000 observations) where one of them is categorical(gender) and all the others numerical. Among numerical ones, there are two types of values which are volume of money in euros (very skewed and scaled) and numbers (number of transactions from an atm). I have transformed the big scale features and did the imputation. Last, I have checked correlation and collinearity and based on that removed some features (as a result I had 24 features.) Now when I implement RF it is always perfect in the training set while the ratios not so good according to crossvalidation. And even applying it in the test set it gives very very low recall values. How should I remedy this?
def classification_model(model, data, predictors, outcome):
# Fit the model:
model.fit(data[predictors], data[outcome])
# Make predictions on training set:
predictions = model.predict(data[predictors])
# Print accuracy
accuracy = metrics.accuracy_score(predictions, data[outcome])
print("Accuracy : %s" % "{0:.3%}".format(accuracy))
# Perform k-fold cross-validation with 5 folds
kf = KFold(data.shape[0], n_folds=5)
error = []
for train, test in kf:
# Filter training data
train_predictors = (data[predictors].iloc[train, :])
# The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]
# Training the algorithm using the predictors and target.
model.fit(train_predictors, train_target)
# Record error from each cross-validation run
error.append(model.score(data[predictors].iloc[test, :], data[outcome].iloc[test]))
print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))
# Fit the model again so that it can be refered outside the function:
model.fit(data[predictors], data[outcome])
outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20)
predictor_var = train.drop('Sold', axis=1).columns.values
classification_model(model,train,predictor_var,outcome_var)
#Create a series with feature importances:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)
outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20, max_depth=20, oob_score = True)
predictor_var = ['fet1','fet2','fet3','fet4']
classification_model(model,train,predictor_var,outcome_var)
In Random Forest it is very easy to overfit. To resolve this you need to do parameter search a little more rigorously to know the best parameter to use. [Here](http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html
) is the link on how to do this: (from the scikit doc).
It is overfitting and you need to search for the best parameter that will work work on the model. The link provides implementation for Grid and Randomized search for hyper parameter estimation.
And it will also be fun to go through this MIT Artificial Intelligence lecture to get get deep theoretical orientation: https://www.youtube.com/watch?v=UHBmv7qCey4&t=318s.
Hope this helps!

Resources