How to use cuda for dictionary inputs? - pytorch

How can I use pytorch cuda for dictionary inputs like above?
train_dataset[0]
>> {'ages': '40대',
'cust': 'M000034966',
'sequence_item': ['PD0816', 'PD0796', 'PD0777', 'PD1161'],
'sequence_timestamp': tensor([1.6108e+09, 1.6108e+09, 1.6108e+09, 1.6108e+09]),
'sex': '여성',
'target': tensor(0.),
'target_item': 'PD1468',
'target_timestamp': tensor(1.6108e+09)}
For categorical features like sex, I encode and embed them in my custom model. It is just a string type in dictionary, not a tensor, so when should I use .cuda()?

How about doing like this?
It seems that each dictionary variables can be mapped to a GPU.

Related

How to properly deal with NaNs in Tensorflow model

I am currently training a Tensorflow model which has various values and features filled with NaN. For example:
feature = [np.Nan, 'foo', 'foo', np.Nan, 'bar', 'foo']
Tensorflow doesn't deal with NaN values, so I replaced them 0:
feature = [0 'foo', 'foo', 0, 'bar', 'foo']
But of course Tensorflow doesn't deal with mixed tensors. What I really want to do is have the model ignore these inputs when training a neural network model.
But since I'm working with tf.feature_columns, I don't have the freedom to feed these inputs directly in the model because I need to explicitly state if they are strings or ints when using tf.categorical and tf.numeric_column methods.
Any suggestions for working with types of feature columns? I would much prefer to stick with tf.feature_columns if possible.
Susmit mostly answered the question in the comments, for completeness: to "ignore" the value you could use the data as your vocabulary without NaN and the "<UNK>" lookup would return -1.
na_mask = np.isna(feature)
vocab = np.unique(feature[~na_mask])
feature[na_mask] = "<UNK>"
...
tf.feature_column.categorical_column_with_vocabulary_list("feature", vocab)

How to learn the embeddings in Pytorch and retrieve it later

I am building a recommendation system where I predict the best item for each user given their purchase history of items. I have userIDs and itemIDs and how much itemID was purchased by userID. I have Millions of users and thousands of products. Not all products are purchased(there are some products that no one has bought them yet). Since the users and items are big I don't want to use one-hot vectors. I am using pytorch and I want to create and train the embeddings so that I can make the predictions for each user-item pair. I followed this tutorial https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html. If it's an accurate assumption that the embedding layer is being trained, then do I retrieve the learned weights through model.parameters() method or should I use the embedding.data.weight option?
model.parameters() returns all the parameters of your model, including the embeddings.
So all these parameters of your model are handed over to the optimizer (line below) and will be trained later when calling optimizer.step() - so yes your embeddings are trained along with all other parameters of the network.(you can also freeze certain layers by setting i.e. embedding.weight.requires_grad = False, but this is not the case here).
# summing it up:
# this line specifies which parameters are trained with the optimizer
# model.parameters() just returns all parameters
# embedding class weights are also parameters and will thus be trained
optimizer = optim.SGD(model.parameters(), lr=0.001)
You can see that your embedding weights are also of type Parameter by doing so:
import torch
embedding_maxtrix = torch.nn.Embedding(10, 10)
print(type(embedding_maxtrix.weight))
This will output the type of the weights, which is Parameter:
<class 'torch.nn.parameter.Parameter'>
I'm not entirely sure what mean by retrieve. Do you mean getting a single vector, or do you want just the whole matrix to save it, or do something else?
embedding_maxtrix = torch.nn.Embedding(5, 5)
# this will get you a single embedding vector
print('Getting a single vector:\n', embedding_maxtrix(torch.LongTensor([0])))
# of course you can do the same for a seqeunce
print('Getting vectors for a sequence:\n', embedding_maxtrix(torch.LongTensor([1, 2, 3])))
# this will give the the whole embedding matrix
print('Getting weights:\n', embedding_maxtrix.weight.data)
Output:
Getting a single vector:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020]], grad_fn=<EmbeddingBackward>)
Getting vectors for a sequence:
tensor([[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502]],
grad_fn=<EmbeddingBackward>)
Getting weights:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020],
[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502],
[-0.5829, -0.1918, -0.8079, 0.6922, -0.2627]])
I hope this answers your question, you can also take a look at the documentation, there you can find some useful examples as well.
https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding

how to provide a extra target argument to input_fn of tf.estimator

As you know, in order to utilize tf.estimator, one needs to implement the model function builds a pipeline that yields batches of (features, labels) pairs, therefore the signature should be as following:
model_fn(features, labels, mode, params, config):
These features and labels should be returned from the input_fn. We assume that features -> X, and labels-> y, I am having a problem here because I have two type of labels.(targets, labels)
Features = X : [None, 2048]
Labels = targets: [None, 2048]
labels: [None, 1]
In order to provide targets and labels as separate arguments instead of just one label argument, what would be the alternative?
Note: I tried to concatenate targets and labels, then slice them where it needs but it created an additional problem during execution of the model. Therefore I am wondering whether you guys have any other better ideas or not?
Thank you.
In your input_fn, you can simply return a dictionary instead of a tensor as labels. That is, your input function likely returns an iterator over a tuple (features, labels). Both features and labels can either be a single tensor or a dict. This dict should map from strings to tensors.
You can prepare the dataset as one returning three elements (features, targets, labels), and then include a mapping to pack the targets into a dict (there might be better ways but this works):
data = ... # prepare dataset of 3-tuples
def pack_in_dict(features, targets, labels):
return features, {"targets": targets, "labels": labels}
data = data.map(pack_in_dict)
Now, if one of the elements is a dict (say, labels), then the corresponding input to model_fn will also be a dict. You can then simply use labels["targets"] and labels["labels"] in your model_fn.

keras: how to predict classes in order?

I'm trying to predict image classes in keras (binary classification). The model accuracy is fine, but it seems that ImageDataGenerator shuffles the input images, so I was not able to match the predicted class with the original images.
datagen = ImageDataGenerator(rescale=1./255)
generator = datagen.flow_from_directory(
pred_data_dir,
target_size=(img_width, img_height),
batch_size=32,
class_mode=None,
shuffle=False,
save_to_dir='images/aug'.format(feature))
print model.predict_generator(generator, nb_input)
For example, if I have a1.jpg, a2.jpg,..., a9.jpg under pred_data_dir, I expect to get an array like
[class for a1.jpg, class for a2.jpg, ... class for a9.jpg]
from model.predict_generator(), but actually I got something like
[class for a3.jpg, class for a8.jpg, ... class for a2.jpg]
How can I resolve the issue?
Look at the source code of flow_from_directory. In my case, I had to rename all images. They were named 1.jpg .. 1000.jpg, but to be in order, they had to be named 0001.jpg .. 1000.jpg. The sorting is important here.
flow_from_directory uses sorted(os.listdir(directory)), thus the sorting is not always intuitive.
The flow_from_directory() method returns a DirectoryIterator object with a filenames member that lists all the files. Since that member is used for subsequent batch generation and iteration, you should be able to use it to match your filenames to predictions.
For your example, generator.filenames should give you a parallel list like ['a3.jpg', 'a8.jpg', ..., 'a2.jpg'].

Scikit-Learn Linear Regression how to get coefficient's respective features?

I'm trying to perform feature selection by evaluating my regressions coefficient outputs, and select the features with the highest magnitude coefficients. The problem is, I don't know how to get the respective features, as only coefficients are returned form the coef._ attribute. The documentation says:
Estimated coefficients for the linear regression problem. If multiple
targets are passed during the fit (y 2D), this is a 2D array of
shape (n_targets, n_features), while if only one target is passed,
this is a 1D array of length n_features.
I am passing into my regression.fit(A,B), where A is a 2-D array, with tfidf value for each feature in a document. Example format:
"feature1" "feature2"
"Doc1" .44 .22
"Doc2" .11 .6
"Doc3" .22 .2
B are my target values for the data, which are just numbers 1-100 associated with each document:
"Doc1" 50
"Doc2" 11
"Doc3" 99
Using regression.coef_, I get a list of coefficients, but not their corresponding features! How can I get the features? I'm guessing I need to modfy the structure of my B targets, but I don't know how.
What I found to work was:
X = your independent variables
coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1)
The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)
You can do that by creating a data frame:
cdf = pd.DataFrame(regression.coef_, X.columns, columns=['Coefficients'])
print(cdf)
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(logistic.coef_)})
I suppose you are working on some feature selection task. Well using regression.coef_ does get the corresponding coefficients to the features, i.e. regression.coef_[0] corresponds to "feature1" and regression.coef_[1] corresponds to "feature2". This should be what you desire.
Well I in its turn recommend tree model from sklearn, which could also be used for feature selection. To be specific, check out here.
Coefficients and features in zip
print(list(zip(X_train.columns.tolist(),logreg.coef_[0])))
Coefficients and features in DataFrame
pd.DataFrame({"Feature":X_train.columns.tolist(),"Coefficients":logreg.coef_[0]})
This is the easiest and most intuitive way:
pd.DataFrame(logisticRegr.coef_, columns=x_train.columns)
or the same but transposing index and columns
pd.DataFrame(logisticRegr.coef_, columns=x_train.columns).T
Suppose your train data X variable is 'df_X' then you can map into a dictionary and feed into pandas dataframe to get the mapping:
pd.DataFrame(dict(zip(df_X.columns,model.coef_[0])),index=[0]).T
Try putting them in a series with the data columns names as index:
coeffs = pd.Series(model.coef_[0], index=X.columns.values)
coeffs.sort_values(ascending = False)

Resources