I'm really new at ML. I trained my dataset then I save it with pickle. My trained dataset has text and value. I'm trying to get an estimate from my new dataset, which has only text.
However, when I try to predict new values with my trained data, I'm getting an error, which says
ValueError: Number of features of the model must match the input. Model n_features is 17804 and input n_features is 24635
You can check my code below. What I have to do at this point ?
with open('trained.pickle', 'rb') as read_pickle:
loaded=pickle.load(read_pickle)
dataset2 = pandas.read_csv('/root/Desktop/predict.csv' , encoding='cp1252')
X2_train=dataset2['text']
train_tfIdf = vectorizer_tfidf.fit_transform(X2_train.values.astype('U'))
x = loaded.predict(train_tfIdf)
print(x)
fit_transform fits to the data and then transforms it, which you don't want to do while testing. It is like retraining the tfidf. So, for the purpose of prediction, I would suggest using the transform method simply.
Related
I'm trying to make a network using augmentation.
First I use ImageDataGenerator with validation_split=0.2.
train_generator = ImageDataGenerator(
rotation_range=90,
zoom_range=0.15,
width_shift_range=0.2,
height_shift_range=0.2,
fill_mode="nearest",
validation_split=0.2
)
Then I tried to create a augmented training data end a not augmented validation data.
I have to use flow instead of flow_from_directory.
train_augm = train_generator.flow([data_train, ebv_train], z_train, batch_size=128,subset='training')
valid_augm = train_generator.flow([data_train, ebv_train], z_train, batch_size=1,subset='validation')
I get this error menssage.
ValueError: Training and validation subsets have different number of classes after the split. If your numpy arrays are sorted by the label, you might want to shuffle them.
What I'm doing wrong?
The model.fit code is something like this
training_history = model.fit(
train_augm,
steps_per_epoch= len(data_train)//128,
epochs=10,
validation_data=valid_augm
)
The number of classes in the training data is not equal to the number of classes in the validation data. If you didn't shuffle it, please shuffle it. If you're still getting the error, I am assuming that some of the class has a very small number of data. you can reshuffle it, but sometimes you will get the same error. What you can do is, add more data to that specific class or manually split into training and validation.
For random split, you can take a look at train_test_split library.
I have a multilabel prediction with a scikit-learn pipeline. It is working properly in terms of internal testing and getting metrics for each of the label predictions. However, I'm having trouble getting the right structure for data output. When I run code on unseen/external data, it apparently runs through predictions for each of the labels but replaces the values in the same column. So I only get one column of predictions.
This data set involves more than 20 labels (categories), and it's part of an NLP model. Each of the labels is binarized (0 or 1). I am new and really appreciate the help. Thank you!
Here are three parts to the code: (1) pipeline, (2) for loop for test/validation data with fit/predict, and (3) attempts at coding the predict function for external data.
1) Pipeline:
SVC_pipeline = Pipeline([
('tfidf',
TfidfVectorizer(tokenizer=LemmaTokenizer(), min_df=8)),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=6)),
])
2) For loop:
for category in categories:
print('processing {}'.format(category))
# train
SVC_pipeline.fit(X_train, train[category])
# test
prediction = SVC_pipeline.predict(X_test)
print('Test accuracy is
{}'.format(accuracy_score(test[category], prediction)))
3) Predict external data:
doctext = sampdf['doc_text']
pred = SVC_pipeline.predict(doctext)
Also tried this:
for category in categories:
print('... Processing {}'.format(category))
svcpredict = SVC_pipeline.predict(testthis)
np.savetxt("/Users/.../Dropbox/.../svcpredicts.csv", svcpredict)
I also tried other a few other variations, but they all had the same result. The metrics ran through all labels and gave me varying metrics for each category. But the output only gave me one column of predictions.
Thanks!
I am building a recommendation system where I predict the best item for each user given their purchase history of items. I have userIDs and itemIDs and how much itemID was purchased by userID. I have Millions of users and thousands of products. Not all products are purchased(there are some products that no one has bought them yet). Since the users and items are big I don't want to use one-hot vectors. I am using pytorch and I want to create and train the embeddings so that I can make the predictions for each user-item pair. I followed this tutorial https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html. If it's an accurate assumption that the embedding layer is being trained, then do I retrieve the learned weights through model.parameters() method or should I use the embedding.data.weight option?
model.parameters() returns all the parameters of your model, including the embeddings.
So all these parameters of your model are handed over to the optimizer (line below) and will be trained later when calling optimizer.step() - so yes your embeddings are trained along with all other parameters of the network.(you can also freeze certain layers by setting i.e. embedding.weight.requires_grad = False, but this is not the case here).
# summing it up:
# this line specifies which parameters are trained with the optimizer
# model.parameters() just returns all parameters
# embedding class weights are also parameters and will thus be trained
optimizer = optim.SGD(model.parameters(), lr=0.001)
You can see that your embedding weights are also of type Parameter by doing so:
import torch
embedding_maxtrix = torch.nn.Embedding(10, 10)
print(type(embedding_maxtrix.weight))
This will output the type of the weights, which is Parameter:
<class 'torch.nn.parameter.Parameter'>
I'm not entirely sure what mean by retrieve. Do you mean getting a single vector, or do you want just the whole matrix to save it, or do something else?
embedding_maxtrix = torch.nn.Embedding(5, 5)
# this will get you a single embedding vector
print('Getting a single vector:\n', embedding_maxtrix(torch.LongTensor([0])))
# of course you can do the same for a seqeunce
print('Getting vectors for a sequence:\n', embedding_maxtrix(torch.LongTensor([1, 2, 3])))
# this will give the the whole embedding matrix
print('Getting weights:\n', embedding_maxtrix.weight.data)
Output:
Getting a single vector:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020]], grad_fn=<EmbeddingBackward>)
Getting vectors for a sequence:
tensor([[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502]],
grad_fn=<EmbeddingBackward>)
Getting weights:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020],
[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502],
[-0.5829, -0.1918, -0.8079, 0.6922, -0.2627]])
I hope this answers your question, you can also take a look at the documentation, there you can find some useful examples as well.
https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding
I am using sklearn's cross_val_predict for training like so:
myprobs_train = cross_val_predict(LogisticRegression(),X = x_old, y=y_old, method='predict_proba', cv=10)
I am happy with the returned probabilities, and would like now to score up a brand-new dataset. I tried:
myprobs_test = cross_val_predict(LogisticRegression(), X =x_new, y= None, method='predict_proba',cv=10)
but this did not work, it's complaining about y having zero shape. Does it mean there's no way to apply the trained and cross-validated model from cross_val_predict on new data? Or am I just using it wrong?
Thank you!
You are looking at a wrong method. Cross validation methods do not return a trained model; they return values that evaluate the performance of a model (logistic regression in your case). Your goal is to fit some data and then generate prediction for new data. The relevant methods are fit and predict of the LogisticRegression class. Here is the basic structure:
logreg = linear_model.LogisticRegression()
logreg.fit(x_old, y_old)
predictions = logreg.predict(x_new)
I have the same concern as #user3490622. If we can only use cross_val_predict on training and testing sets, why y (target) is None as the default value? (sklearn page)
To partially achieve the desired results of multiple predicted probability, one could use the fit then predict approach repeatedly to mimic the cross-validation.
I am planning to use an SGDClassifier in production. The idea is to train the classifier on some training data, use cPickle to dump it to a .pkl file and reuse it later in a script. However, there are certain high cardinality fields which are categorical in nature and translated to one hot matrix representation which creates around 5000 features. Now the input that I get for the predict will only have one of these features and rest all will be zeroes. It will also include ofcourse the other numerical features apart from this. From the docs, it appears that the predict function expects an array of array as input. Is there any way I can transform my input to the format expected by the predict function without having to store the fields everytime I train the model ?
Update
So, let us say my input contains 3 fields:
{
rate: 10, // numeric
flagged: 0, //binary
host: 'somehost.com' // keeping this categorical
}
host can have around 5000 different values. Now I loaded the file to a pandas dataframe, used the get_dummies function to transform the host field to around 5000 new fields which are binary fields.
Then I trained by model and stored it using cPickle.
Now, when I need to use the predict function, for the input, I only have 3 fields (shown above). However, as per my understanding the predict endpoint will expect an array of vectors and each vector is supposed to have those 5000 fields.
For the entry that I need to predict, I know only one field for that entry which will be the value of host itself.
For example, if my input is
{
rate: 5,
flagged: 1
host: 'new_host.com'
}
I know that the fields expected by the predict should be:
{
rate: 5,
flagged: 1
new_host: 1
}
But if I translate it to vector format, I don't know which index to place the new_host field. Also, I don't know in advance what other hosts are (unless I store it somewhere during the training phase)
I hope I am making some sense. Let me know if I am doing it the wrong way.
I don't know which index to place the new_host field
A good approach that has worked for me is to build a pipeline which you then use for training and prediction. This way you do not have to concern yourself with the column index of whatever output is produced by your transformation:
# in training
pipl = Pipeline(steps=[('binarizer', LabelBinarizer(),
('clf', SGDClassifier())])
model = pipl.train(X, Y)
pickle.dump(mf, model)
# in production
model = pickle.load(mf)
y = model.predict(X)
As X, Y inputs you need to pass an array-like object. Make sure the input is the same structure for both training and test, e.g.
X = [[data.get('rate'), data.get('flagged'), data.get('host')]]
Y = [[y-cols]] # your example doesn't specify what is Y in your data
More flexible: Pandas DataFrame + Pipeline
What also works nicely is to use a Pandas DataFrame in combination with sklearn-pandas as it allows you to use different transformations on different column names. E.g.
df = pd.DataFrame.from_dict(data)
mapper = DataFrameMapper([
('host', sklearn.preprocessing.LabelBinarizer()),
('rate', sklearn.preprocessing.StandardScaler())
])
pipl = Pipeline(steps=[('mapper', mapper),
('clf', SGDClassifier())])
X = df[x-cols]
y = df[y-col(s)]
pipl.fit()
Note that x-cols and y-col(s) are the list of the feature and target columns respectively.
You should use a scikit-learn transformer instead of get_dummies. In this case, LabelBinarizer makes sense. Seeing as LabelBinarizer doesn't work in a pipeline, this is one way to do what you want:
binarizer = LabelBinarizer()
# fitting LabelBinarizer means it remembers all the columns it's seen
one_hot_data = binarizer.fit_transform(X_train[:, categorical_col])
# replace string column with one-hot representation
X_train = np.concatenate([np.delete(X_train, categorical_col, axis=1),
one_hot_data], axis=1)
model = SGDClassifier()
clf.fit(X_train, y)
pickle.dump(f, {'clf': clf, 'binarizer': binarizer})
then at prediction time:
estimators = pickle.load(f)
clf = estimators['clf']
binarizer = estimators['binarizer']
one_hot_data = binarizer.transform(X_test[:, categorical_col])
X_test = np.concatenate([np.delete(X_test, categorical_col, axis=1),
one_hot_data], axis=1)
clf.predict(X_test)