How to properly deal with NaNs in Tensorflow model - python-3.x

I am currently training a Tensorflow model which has various values and features filled with NaN. For example:
feature = [np.Nan, 'foo', 'foo', np.Nan, 'bar', 'foo']
Tensorflow doesn't deal with NaN values, so I replaced them 0:
feature = [0 'foo', 'foo', 0, 'bar', 'foo']
But of course Tensorflow doesn't deal with mixed tensors. What I really want to do is have the model ignore these inputs when training a neural network model.
But since I'm working with tf.feature_columns, I don't have the freedom to feed these inputs directly in the model because I need to explicitly state if they are strings or ints when using tf.categorical and tf.numeric_column methods.
Any suggestions for working with types of feature columns? I would much prefer to stick with tf.feature_columns if possible.

Susmit mostly answered the question in the comments, for completeness: to "ignore" the value you could use the data as your vocabulary without NaN and the "<UNK>" lookup would return -1.
na_mask = np.isna(feature)
vocab = np.unique(feature[~na_mask])
feature[na_mask] = "<UNK>"
...
tf.feature_column.categorical_column_with_vocabulary_list("feature", vocab)

Related

Python Surprise package gives different predictions for predict method vs manual compute using latent factors

I am using the surprise package for matrix factorization. Below is the code for the tutorial:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)
algo.predict(str(196), str(302))
Out:
Prediction(uid='196', iid='301', r_ui=4, est=3.0740854315737174, details={'was_impossible': False})
However, when I use the SVD equation from its documentation and source code to manually compute the r_hat (r prediction):
algo.trainset.global_mean + algo.bi[301] + algo.bu[196] + np.dot(algo.qi[301], algo.pu[196])
Out:
2.817335384596893
The predictions does not match at all. Am I doing anything wrong or missing something?
I managed to figure it out. There's a difference between raw users/items and inner users/items. The former refers to the actual names of the users and items (e.g., user = John or a number like 10; items = Avengers or a number like 20) while the latter I assume to be the label encoded values given to the original users/items.
The hidden attributes of the trainset contain 4 attributes, _inner2raw_id_items, _inner2raw_id_users, _raw2inner_id_items, _raw2inner_id_users, which are dicts containing the conversion from one to the other.
If we call trainset._raw2inner_id_users and trainset._raw2inner_id_items, we get:
_raw2inner_id_users
{'196': 0,
'186': 1,
'22': 2, ...}
_raw2inner_id_items
{'242': 0,
'302': 1,
'377': 2, ...
'301': 404, ...}
Therefore, when we call:
algo.predict(str(196), str(302))
Out:
# different from original post as the prediction changes from run to run
Prediction(uid='196', iid='301', r_ui=None, est=3.2072618383879736, details={'was_impossible': False})
We are actually referring to the 0th user and 1st item. So when we use the manual computation using the latent factors, bias, and global mean according to the SVD equation, we should use these numbers instead:
algo.trainset.global_mean + algo.bi[404] + algo.bu[0] + np.dot(algo.qi[404], algo.pu[0])
Output:
3.2072618383879736

Word vocabulary generated by Word2vec and Glove models are different for the same corpus

I'm using CONLL2003 dataset to generate word embeddings using Word2vec and Glove.
The number of words returned by word2vecmodel.wv.vocab is different(much lesser) than glove.dictionary.
Here is the code:
Word2Vec:
word2vecmodel = Word2Vec(result ,size= 100, window =5, sg = 1)
X = word2vecmodel[word2vecmodel.wv.vocab]
w2vwords = list(word2vecmodel.wv.vocab)
Output len(w2vwords) = 4653
Glove:
from glove import Corpus
from glove import Glove
import numpy as np
corpus = Corpus()
nparray = []
allwords = []
no_clusters=500
corpus.fit(result, window=5)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
Output: len(glove.dictionary) = 22833
The input is a list of sentences. For example:
result[1:5] =
['Peter', 'Blackburn'],
['BRUSSELS', '1996-08-22'],
['The',
'European',
'Commission',
'said',
'Thursday',
'disagreed',
'German',
'advice',
'consumers',
'shun',
'British',
'lamb',
'scientists',
'determine',
'whether',
'mad',
'cow',
'disease',
'transmitted',
'sheep',
'.'],
['Germany',
"'s",
'representative',
'European',
'Union',
"'s",
'veterinary',
'committee',
'Werner',
'Zwingmann',
'said',
'Wednesday',
'consumers',
'buy',
'sheepmeat',
'countries',
'Britain',
'scientific',
'advice',
'clearer',
'.']]
There are totally 13517 sentences in the result list.
Can someone please explain why the list of words for which the embeddings are created are drastically different in size?
You haven't mentioned which Word2Vec implementation you're using, but I'll assume you're using the popular Gensim library.
Like the original word2vec.c code released by Google, Gensim Word2Vec uses a default min_count parameter of 5, meaning that any words appearing fewer than 5 times are ignored.
The word2vec algorithm needs many varied examples of a word's usage is different contexts to generate strong word-vectors. When words are rare, they fail to get very good word-vectors themselves: the few examples only show a few uses that may be idiosyncractic compared to what a larger sampling would show, and can't be subtly balanced against many other word representations in the manner that's best.
But further, given that in typical word-distributions, there are many such low-frequency words, altogether they also tend to make the word-vectors for other more-frequent qords worse. The lower-frequency words are, comparatively, 'interference' that absorbs training state/effort to the detriment of other more-improtant words. (At best, you can offset this effect a bit by using more training epochs.)
So, discarding low-frequency words is usually the right approach. If you really need vectors-for those words, obtaining more data so that those words are no longer rare is the best approach.
You can also use a lower min_count, including as low as min_count=1 to retain all words. But often discarding such rare words is better for whatever end-purpose for which the word-vectors will be used.

How to learn the embeddings in Pytorch and retrieve it later

I am building a recommendation system where I predict the best item for each user given their purchase history of items. I have userIDs and itemIDs and how much itemID was purchased by userID. I have Millions of users and thousands of products. Not all products are purchased(there are some products that no one has bought them yet). Since the users and items are big I don't want to use one-hot vectors. I am using pytorch and I want to create and train the embeddings so that I can make the predictions for each user-item pair. I followed this tutorial https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html. If it's an accurate assumption that the embedding layer is being trained, then do I retrieve the learned weights through model.parameters() method or should I use the embedding.data.weight option?
model.parameters() returns all the parameters of your model, including the embeddings.
So all these parameters of your model are handed over to the optimizer (line below) and will be trained later when calling optimizer.step() - so yes your embeddings are trained along with all other parameters of the network.(you can also freeze certain layers by setting i.e. embedding.weight.requires_grad = False, but this is not the case here).
# summing it up:
# this line specifies which parameters are trained with the optimizer
# model.parameters() just returns all parameters
# embedding class weights are also parameters and will thus be trained
optimizer = optim.SGD(model.parameters(), lr=0.001)
You can see that your embedding weights are also of type Parameter by doing so:
import torch
embedding_maxtrix = torch.nn.Embedding(10, 10)
print(type(embedding_maxtrix.weight))
This will output the type of the weights, which is Parameter:
<class 'torch.nn.parameter.Parameter'>
I'm not entirely sure what mean by retrieve. Do you mean getting a single vector, or do you want just the whole matrix to save it, or do something else?
embedding_maxtrix = torch.nn.Embedding(5, 5)
# this will get you a single embedding vector
print('Getting a single vector:\n', embedding_maxtrix(torch.LongTensor([0])))
# of course you can do the same for a seqeunce
print('Getting vectors for a sequence:\n', embedding_maxtrix(torch.LongTensor([1, 2, 3])))
# this will give the the whole embedding matrix
print('Getting weights:\n', embedding_maxtrix.weight.data)
Output:
Getting a single vector:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020]], grad_fn=<EmbeddingBackward>)
Getting vectors for a sequence:
tensor([[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502]],
grad_fn=<EmbeddingBackward>)
Getting weights:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020],
[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502],
[-0.5829, -0.1918, -0.8079, 0.6922, -0.2627]])
I hope this answers your question, you can also take a look at the documentation, there you can find some useful examples as well.
https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding

Feeding list of tf.truncated_normal() or list of dictionaries into a Tensorflow Model

I am new to tensorflow and I am trying to learn how to use the tool efficiently.
I expand on the question below but here is the tldr:
I am wondering what is the best way to feed the following weights and biases into my model with feed_dict:
def generate_initial_population(my_population_size):
my_weights = []
my_biases = []
for _ in range(my_population_size):
my_weights.append({
'h1': tf.Variable(tf.truncated_normal([n_inputs, n_hidden_1])),
'h2': tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.truncated_normal([n_hidden_2, n_class]))
})
my_biases.append({
'b1': tf.Variable(tf.truncated_normal([n_hidden_1])),
'b2': tf.Variable(tf.truncated_normal([n_hidden_2])),
'out': tf.Variable(tf.truncated_normal([n_class]))
})
return my_weights, my_biases
weights, biases = generate_initial_population(population_size)
I cannot simply use feed_dict={weights_ph: weights} because it will generate errors. I do not know how to deal with this problem efficiently
Examining the code at the end might help with understanding what i am talking about.
I am wondering if there is any way I could feed a list containing tf.truncated_normals to my model.
I get the ValueError: setting an array element with a sequence. error because I believe it is trying to convert to np.array but has issues with the dimensions
I have found an easy workout where i figure out the values of all the tensors first with session run and then feed that into my model.
I am just confused if this is the right solution since I would be inclined to believe it is slower because you have to execute session twice?
This solution also doesnt work however if my original list is not perfect shape
like [ [1, [1,2]]] or when my truncated_normals do not have the same shapes
I was thinking im just going to feed my weird shape list into my model and then use tf.gather to get the specific indexes I want to work on.
Since i cannot do that is my solution the proper way to deal with this... simply calculating the truncated_normals first and then feeding that into the model. then reshaping the list while inside the model if you need to?
I also am having a very similar problems because I wanted to feed in a list of dictionaries into the model as well. Is the proper way of dealing with that to extract the data from dictionaries and then just feed in each value from each key separately.
I am trying to learn and i couldnt find this information elsewhere
Here is a code snippet i designed to fail to explain what i mean
import tensorflow as tf
list_ph = tf.placeholder(dtype=tf.float32)
index_ph = tf.placeholder(dtype=tf.int32)
def model(my_list, index):
value = tf.gather(my_list, index, axis=0)
return value
my_model = model(list_ph, index_ph)
with tf.Session() as sess:
var_list = []
truncated_normal = tf.Variable(tf.truncated_normal(shape=[5, 3]))
for i in range(4):
var_list.append(truncated_normal)
# for i in range(4):
# var_list.append({i: i*2})
sess.run(tf.global_variables_initializer())
#will work but will not work for dictionaries
val = sess.run(var_list)
# will not work, but will work if you feed val
var = sess.run(my_model, feed_dict={list_ph: var_list, index_ph: 1})

scikit-learn: Is there a way to provide an object as an input to predict function of a classifier?

I am planning to use an SGDClassifier in production. The idea is to train the classifier on some training data, use cPickle to dump it to a .pkl file and reuse it later in a script. However, there are certain high cardinality fields which are categorical in nature and translated to one hot matrix representation which creates around 5000 features. Now the input that I get for the predict will only have one of these features and rest all will be zeroes. It will also include ofcourse the other numerical features apart from this. From the docs, it appears that the predict function expects an array of array as input. Is there any way I can transform my input to the format expected by the predict function without having to store the fields everytime I train the model ?
Update
So, let us say my input contains 3 fields:
{
rate: 10, // numeric
flagged: 0, //binary
host: 'somehost.com' // keeping this categorical
}
host can have around 5000 different values. Now I loaded the file to a pandas dataframe, used the get_dummies function to transform the host field to around 5000 new fields which are binary fields.
Then I trained by model and stored it using cPickle.
Now, when I need to use the predict function, for the input, I only have 3 fields (shown above). However, as per my understanding the predict endpoint will expect an array of vectors and each vector is supposed to have those 5000 fields.
For the entry that I need to predict, I know only one field for that entry which will be the value of host itself.
For example, if my input is
{
rate: 5,
flagged: 1
host: 'new_host.com'
}
I know that the fields expected by the predict should be:
{
rate: 5,
flagged: 1
new_host: 1
}
But if I translate it to vector format, I don't know which index to place the new_host field. Also, I don't know in advance what other hosts are (unless I store it somewhere during the training phase)
I hope I am making some sense. Let me know if I am doing it the wrong way.
I don't know which index to place the new_host field
A good approach that has worked for me is to build a pipeline which you then use for training and prediction. This way you do not have to concern yourself with the column index of whatever output is produced by your transformation:
# in training
pipl = Pipeline(steps=[('binarizer', LabelBinarizer(),
('clf', SGDClassifier())])
model = pipl.train(X, Y)
pickle.dump(mf, model)
# in production
model = pickle.load(mf)
y = model.predict(X)
As X, Y inputs you need to pass an array-like object. Make sure the input is the same structure for both training and test, e.g.
X = [[data.get('rate'), data.get('flagged'), data.get('host')]]
Y = [[y-cols]] # your example doesn't specify what is Y in your data
More flexible: Pandas DataFrame + Pipeline
What also works nicely is to use a Pandas DataFrame in combination with sklearn-pandas as it allows you to use different transformations on different column names. E.g.
df = pd.DataFrame.from_dict(data)
mapper = DataFrameMapper([
('host', sklearn.preprocessing.LabelBinarizer()),
('rate', sklearn.preprocessing.StandardScaler())
])
pipl = Pipeline(steps=[('mapper', mapper),
('clf', SGDClassifier())])
X = df[x-cols]
y = df[y-col(s)]
pipl.fit()
Note that x-cols and y-col(s) are the list of the feature and target columns respectively.
You should use a scikit-learn transformer instead of get_dummies. In this case, LabelBinarizer makes sense. Seeing as LabelBinarizer doesn't work in a pipeline, this is one way to do what you want:
binarizer = LabelBinarizer()
# fitting LabelBinarizer means it remembers all the columns it's seen
one_hot_data = binarizer.fit_transform(X_train[:, categorical_col])
# replace string column with one-hot representation
X_train = np.concatenate([np.delete(X_train, categorical_col, axis=1),
one_hot_data], axis=1)
model = SGDClassifier()
clf.fit(X_train, y)
pickle.dump(f, {'clf': clf, 'binarizer': binarizer})
then at prediction time:
estimators = pickle.load(f)
clf = estimators['clf']
binarizer = estimators['binarizer']
one_hot_data = binarizer.transform(X_test[:, categorical_col])
X_test = np.concatenate([np.delete(X_test, categorical_col, axis=1),
one_hot_data], axis=1)
clf.predict(X_test)

Resources