Sparse matrices in python tensorflow text classification - python-3.x

I've been trying to implement a text classification routine using the tensorflow package in python. I already had a successful perceptron version working in the scikit-learn enviroment but scikit-learn does not have multilayer neural networks (except for some mythical version 0.18 that I can't seem to find/install anywhere).
I thought it was best to try something simpler in tensorflow first, to learn how the package works and what it can and cannot do, so I went with nearest neighbors. So far so good, except I just can't find a way to feed a sparse version of the vocabulary matrix (bag-of-words vectorizations of the texts) to a placeholder in tensorflow (in scikit-learn this is no problem at all). Converting the vocabulary matrix to a dense matrix solves the problem but severely slows down the algorithm and clogs up RAM.
Is there any way around this? From what I found on the web it seems tensorflow has very limited support for sparse objects (only certain operations will accept them as input), but I hope I'm wrong.
P.S. Yes, I read this thread and it did not solve my problem. And yes, I know I could stick to the perceptron of the scikit-learn or choose another package, but that's a decision I'll make based on the answers I get here.

With TensorFlow 1.0.1, I can do this:
a = sparse.csr_matrix([[0, 1, 2], [5, 0, 0], [0, 0, 5],
[10, 1, 0], [0, 0, 4]])
# w = np.arange(6, dtype=np.float32).reshape([3, 2])
a = a.tocoo()
a_ = tf.sparse_placeholder('float32')
w_ = tf.Variable(tf.random_normal([3, 2], stddev=1.0))
a_sum = tf.sparse_reduce_sum(a_, 1)
a_mul = tf.sparse_tensor_dense_matmul(a_, w_)
# Initializing the variables
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
indices = np.array(zip(a.row, a.col), dtype=np.int32)
values = np.array(a.data, dtype=np.float32)
shape = np.array(a.shape, dtype=np.int32)
print sess.run(a_mul, feed_dict={a_: tf.SparseTensorValue(indices, values, shape)})
w = sess.run(w_)
print np.dot(a.todense(), w)
You can find the code from API page: sparse placeholder. After the first layer, other layers of neural network will be dense matrix.

Related

scikit-learn: Why do we need the target data to have the same size with training data

I am pretty new to ML and I am studying the k-nearest neighbors classifier using python documentation
I am somehow confused about the training part. Let's say my training data is some points in 1d
training = [[1], [4], [3]]
and I want to use k-nearest neighbor classifier to label them into to "teams" :
labels = [[0], [1]]
Why does that doesn't make sense ?
I get an error that target values size does not match the input.
If I put instead labels = [[0], [1], [1]] or labels = [[0], [0], [1]]
it will compile .
Also a side-note question : does the permutation of labels matter?
It looks like you're trying to clusterize your data, which is done by sklearn.neighbors.NearestNeighbors(). sklearn.neighbors.KNeighborsClassifier() is a supervized model which requires supplying the actual class for every observation for training and can predict the class for the previously unseen data after that.
However, NearestNeighbors() method does not allow to limit the amount of clusters iirc, you should probably try something like sklearn.cluster.KMeans(n_clusters=2).

Is there a way to call Macro-Precision in Hugging Face Trainer?

I'm currently making tests on the DEFT-2015 dataset using Hugging Face models. I would like to compare my results to what has been done.
I checked in the list_metrics method from the datasets library, but I did not see Macro Precision, which was the metric used at the time by the researchers.
Do you have any indication for how I could tackle the problem ?
The huggingface library (version 4.20.0) seems to depend behind the curtains calls to scikit-learn library.
If you just use (without using scikit-learn):
from datasets import load_metric
metric = load_metric("precision")
precision = metric.compute(predictions=[...],references=[...])
it will throw an error that
scikit-learn is not installed.
Why this intro?
Well, in fact, you can easily use datasets metrics to calculate however you want your metric (just exactly like scikit-learn does).
You just need to add the 'average' parameter:
from datasets import load_metric
metric = load_metric("precision")
precision = metric.compute(predictions = [0, 0, 0, 0, 1, 1, 2, 2],
references = [0, 0, 0, 0, 1, 1, 1, 2],
average='macro')
# This prints 0.833333334
print(precision)
The snippet above will print {'precision': 0.8333333333333334}, because (1 + 1 + 0.5) / 3 = 0.83, which is exactly the definition of macro precision you are searching for.
Conclusion : Use the average parameter to set the way you want to calculate your metric (micro/macro/weighted).

How to convert signal data set of 400 samples with 5000 data points into a tensor of [400, 1, 5000] in pytorch?

I have 400 sensor recordings and each one is having length of 5000. I want to convert it into a tensor of [400,1,5000] or [batch_size, input_channels, signal_length] for a ML problem to train a 1DCNN network by using pytorch nn.Conv1d.
This operation is often referred to as unsqueezing a dimension. There are multiple ways of achieving this, either with an explicit reshape, or with slicing tricks.
Using torch.Torch.unsqueeze, either out-of-place:
>>> x.unsqueeze(dim=1) # won't affect x
Or in-place with torch.Tensor.unsqueeze_:
>>> x.unsqueeze_(dim=1) # will mutate x
Using indexing:
>>> x[:, None] # will insert a singleton at dim=1
Reshaping the tensor with torch.Tensor.reshape:
>>> x.reshape(len(x), 1, -1)
This is not the recommended method as it doesn't generalize. In my opinion, you should not use reshape or view if you are not actually reshaping the tensor.

Computing matrix derivatives with torch.autograd.grad (PyTorch)

I am trying to compute matrix derivatives in PyTorch using torch.autograd.grad however I am running into few issues. Here is a minimal working example to reproduce the error.
theta = torch.tensor(np.random.uniform(low=-np.pi, high=np.pi), requires_grad=True)
rot_mat = torch.tensor([[torch.cos(theta), torch.sin(theta), 0],
[-torch.sin(theta), torch.cos(theta), 0]],
dtype=torch.float, requires_grad=True)
torch.autograd.grad(outputs=rot_mat,
inputs=theta, grad_outputs=torch.ones_like(rot_mat),
create_graph=True, retain_graph=True)
This code results in the error "One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior."
I tried using allow_unused=True but the gradients are returned as None. I am not sure what is causing the graph to be disconnected here.
Pytorch autograd graph will be created only if pytorch functions are used.
I think python 2d list used while creating rot_mat disconnects the graph. So using torch functions create rotation matrix and also just use backward() function to compute gradients. Here's sample code:
import torch
import numpy as np
theta = torch.tensor(np.random.uniform(low=-np.pi, high=np.pi), requires_grad=True)
# create required values and convert it to torch 1d tensor
cos_t = torch.cos(theta).view(1)
sin_t = torch.sin(theta).view(1)
msin_t = -sin_t
zero = torch.zeros(1)
# create rotation matrix using only pytorch functions
rot_1d = torch.cat((cos_t, sin_t, zero, msin_t, cos_t, zero))
rot_mat = rot_1d.view((2, 3))
# Autograd
rot_mat.backward(torch.ones_like(rot_mat))
# gradient
print(theta.grad)

Problems using poly kernel in GridSearchCV and SVM classifier

I am trying to do a grid search using a SVM classifier.
Consider my data and target that have been parsed from file and input to numpy arrays.
I then preprocess them.
# Transform the data to have zero mean and unit variance.
zeroMeanUnitVarianceScaler = preprocessing.StandardScaler().fit(data)
zeroMeanUnitVarianceScaler.transform(data)
scaledData = data
# Transform the target to have range [-1, 1].
scaledTarget = np.empty([161L,], dtype=int)
for i in range(len(target)):
if(target[i] == 'Malignant'):
scaledTarget[i] = 1
if(target[i] == 'Benign'):
scaledTarget[i] = -1
I now try to set up my grid and fit the scaled data to targets.
# Generate parameters for parameter grid.
CValues = np.logspace(-3, 3, 7)
GammaValues = np.logspace(-3, 3, 7)
kernelValues = ('poly', 'sigmoid')
# kernelValues = ('linear', 'rbf', 'sigmoid')
degreeValues = np.array([0, 1, 2, 3, 4])
coef0Values = np.logspace(-3, 3, 7)
# Generate the parameter grid.
paramGrid = dict(C=CValues, gamma=GammaValues, kernel=kernelValues,
coef0=coef0Values)
# Create and train a SVM classifier using the parameter grid and with
stratified shuffle split.
stratifiedShuffleSplit = StratifiedShuffleSplit(n_splits = 10, test_size =
0.25, train_size = None, random_state = 0)
clf = GridSearchCV(estimator=svm.SVC(), param_grid=paramGrid,
cv=stratifiedShuffleSplit, n_jobs=1)
clf.fit(scaledData, scaledTarget)
If I uncomment the line kernelValues = ('linear', 'rbf', 'sigmoid'), then the code runs in approximately 50 seconds on my 16 GB i7-4950 3.6 GHz machine running windows 10.
However, if I try to run the code as is with 'poly' as a possible kernel value, then the code hangs forever. For example, I ran it yesterday overnight and it did not return anything when I got back in the office today.
Interestingly enough, if I try to create a SVM classifier with a poly kernel, it returns a result immediately
clf = svm.SVC(kernel='poly',degree=2)
clf.fit(data, target)
It hangs up when I do the above code. I have not tried other cv methods to see if that changes anything.
Is this a bug in sci-kit learn? Am I doing things properly? On a side note, is my method of doing gridsearch/cross validation using GridSearchCV and StratifiedShuffleSplit sensible? It seems to me the most brute force (i.e. time consuming) but robust method.
Thank you!

Resources