Why is tuning svm algorithm using singular values (SVDs) extremely slow? - svm

I have a large data of more than 300,000 variables and about 5000 rows. By applying singular value decomposition using RSpectra, I retreived 300 singular values. Running svm with hyperparameter tuning by using these 300 variables has become incredibly slow. It took more than 17 hours with a 24GB RAM machine. This algorithm worked much faster when I run it with a document feature matrix(dfm) of 60,000 variables and 5000 rows.
library(doMC)
start_time <- Sys.time()
registerDoMC(cores=5)
library(e1071)
set.seed(123) #for reproducibility
svm_tuned_upsample <- tune(svm,
train.x = train_svd_df[,-1],
train.y = as.factor(train_svd_df$Include),
kernel = "radial",
type = "C-classification",
parallel= TRUE,
ranges=list(cost=c(0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1, 5, 6, 7, 8, 10, 15),
gamma=c(0.0009, 0.001, 0.002, 0.003, 0.0035, 0.004, 0.0045, 0.005)),
validation.x=tune.control(sampling = "cross",cross=10)
)
Sys.time() - start_time

Related

Ignore padding class (0) during multi class classification

I have a problem where given a set of tokens, predict another token. For this task I use an embedding layer with Vocab-size + 1 as input_size. The +1 is because the sequences are padded with zeros. Eg. given a Vocab-size of 10 000 and max_sequence_len=6, x_train looks like:
array([[ 0, 0, 0, 11, 22, 4],
[ 29, 6, 12, 29, 1576, 29],
...,
[ 0, 0, 67, 8947, 7274, 7019],
[ 0, 0, 0, 15, 10000, 50]])
y_train consists of integers between 1 and 10000, with other words, this becomes a multi-class classification problem with 10000 classes.
My problem: When I specify the output size in the output layer, I would like to specify 10000, but the model will predict the classes 0-9999 if I do this. Another approach is to set output size to 10001, but then the model can predict the 0-class (padding), which is unwanted.
Since y_train is mapped from 1 to 10000, I could remap it to 0-9999, but since they share mapping with the input, this seems like an unnecessary workaround.
EDIT:
I realize, and which #Andrey pointed out in the comments, that I could allow for 10001 classes, and simply add padding to the vocabulary, although I am never interested in the network predicting 0's.
How can I tell the model to predict on the labels 1-10000, whilst at the meantime have 10000 classes, not 10001?
I would use the following approach:
import tensorflow as tf
inputs = tf.keras.layers.Input(shape=())
x = tf.keras.layers.Embedding(10001, 512)(inputs) # input shape of full vocab size [10001]
x = tf.keras.layers.Dense(10000, activation='softmax')(x) # training weights based on reduced vocab size [10000]
z = tf.zeros(tf.shape(x)[:-1])[..., tf.newaxis]
x = tf.concat([z, x], axis=-1) # add constant zero on the first position (to avoid predicting 0)
model = tf.keras.Model(inputs=inputs, outputs=x)
inputs = tf.random.uniform([10, 10], 0, 10001, dtype=tf.int32)
labels = tf.random.uniform([10, 10], 0, 10001, dtype=tf.int32)
model.compile(loss='sparse_categorical_crossentropy')
model.fit(inputs, labels)
pred = model.predict(inputs) # all zero positions filled by 0 (which is minimum value)

Getting Warning: "ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations." in ElasticNetCV sklear

I am trying to run elasticnet regression on a particular dataset. This is my code:
elastic= ElasticNetCV(l1_ratio = [0.001,0.005,0.01,0.03,0.07,0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 1],
alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006,
0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6],
max_iter = 50000, cv = 10,tol=0.001)
elastic.fit(x_train,y_train)
alpha=elastic.alpha_
ratio=elastic.alpha_
print('best alpha:',alpha)
print('best l1 ratio:',ratio)
I am receiving the following warning and unable to finish execution properly.
/opt/conda/lib/python3.7/site-packages/sklearn/linear_model/_coordinate_descent.py:527: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.1304511543456042, tolerance: 0.16652297337822466
tol, rng, random, positive)
/opt/conda/lib/python3.7/site-packages/sklearn/linear_model/_coordinate_descent.py:527: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.3276808651494596, tolerance: 0.17237036604178843
tol, rng, random, positive)
/opt/conda/lib/python3.7/site-packages/sklearn/linear_model/_coordinate_descent.py:527: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.1082449562463794, tolerance: 0.17237036604178843
tol, rng, random, positive)
/opt/conda/lib/python3.7/site-packages/sklearn/linear_model/_coordinate_descent.py:527: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.1717336811567476, tolerance: 0.16652297337822466
tol, rng, random, positive)
/opt/conda/lib/python3.7/site-packages/sklearn/linear_model/_coordinate_descent.py:527: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.147435308235048, tolerance: 0.17237036604178843
tol, rng, random, positive)
I have tried increasing tolerance value. However, the problem does not seem to change

How to adjust the batch data by the amount of labels in PyTorch

I have made n-grams / doc-ids for document classification,
def create_dataset(tok_docs, vocab, n):
n_grams = []
document_ids = []
for i, doc in enumerate(tok_docs):
for n_gram in [doc[0][i:i+n] for i in range(len(doc[0]) - 1)]:
n_grams.append(n_gram)
document_ids.append(i)
return n_grams, document_ids
def create_pytorch_datasets(n_grams, doc_ids):
n_grams_tensor = torch.tensor(n_grams)
doc_ids_tensor = troch.tensor(doc_ids)
full_dataset = TensorDataset(n_grams_tensor, doc_ids_tensor)
return full_dataset
create_dataset returns pair of (n-grams, document_ids) like below:
n_grams, doc_ids = create_dataset( ... )
train_data = create_pytorch_datasets(n_grams, doc_ids)
>>> train_data[0:100]
(tensor([[2076, 517, 54, 3647, 1182, 7086],
[517, 54, 3647, 1182, 7086, 1149],
...
]),
tensor(([0, 0, 0, 0, 0, ..., 3, 3, 3]))
train_loader = DataLoader(train_data, batch_size = batch_size, shuffle = True)
The first of tensor content means n-grams and the second one does doc_id.
But as you know, by the length of documents, the amount of training data according to the label would changes.
If one document has very long length, there would be so many pairs that have its label in training data.
I think it can cause overfitting in model, because the classification model tends to classify inputs to long length documents.
So, I want to extract input batches from a uniform distribution for label (doc_ids). How can I fix it in code above?
p.s)
If there is train_data like below, I want to extract batch by the probability like that:
n-grams doc_ids
([1, 2, 3, 4], 1) ====> 0.33
([1, 3, 5, 7], 2) ====> 0.33
([2, 3, 4, 5], 3) ====> 0.33 * 0.25
([3, 5, 2, 5], 3) ====> 0.33 * 0.25
([6, 3, 4, 5], 3) ====> 0.33 * 0.25
([2, 3, 1, 5], 3) ====> 0.33 * 0.25
In pytorch you can specify a sampler or a batch_sampler to the dataloader to change how the sampling of datapoints is done.
docs on the dataloader:
https://pytorch.org/docs/stable/data.html#data-loading-order-and-sampler
documentation on the sampler: https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler
For instance, you can use the WeightedRandomSampler to specify a weight to every datapoint. The weighting can be the inverse length of the document for instance.
I would make the following modifications in the code:
def create_dataset(tok_docs, vocab, n):
n_grams = []
document_ids = []
weights = [] # << list of weights for sampling
for i, doc in enumerate(tok_docs):
for n_gram in [doc[0][i:i+n] for i in range(len(doc[0]) - 1)]:
n_grams.append(n_gram)
document_ids.append(i)
weights.append(1/len(doc[0])) # << ngrams of long documents are sampled less often
return n_grams, document_ids, weights
sampler = WeightedRandomSampler(weights, 1, replacement=True) # << create the sampler
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=False, sampler=sampler) # << includes the sampler in the dataloader

What n_estimators and max_features means in RandomForestRegressor

I was reading about fine tuning the model using GridSearchCV and I came across a Parameter Grid Shown below :
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
Here I am not getting the concept of n_estimator and max_feature. Is it like n_estimator means number of records from data and max_features means number of attributes to be selected from data?
After Going further I got this result :
>> grid_search.best_params_
{'max_feature':8, 'n_estimator':30}
So the thing is I am not getting what Actually this result want to say..
After reading the documentation for RandomForest Regressor you can see that n_estimators is the number of trees to be used in the forest. Since Random Forest is an ensemble method comprising of creating multiple decision trees, this parameter is used to control the number of trees to be used in the process.
max_features on the other hand, determines the maximum number of features to consider while looking for a split. For more information on max_features read this answer.
n_estimators: This is the number of trees (in general the number of samples on which this algorithm will work then it will aggregate them to give you the final answer) you want to build before taking the maximum voting or averages of predictions. The higher number of trees give you better performance but makes your code slower.
max_features: The number of features to consider when looking for the best split.
>> grid_search.best_params_ :- {'max_feature':8, 'n_estimator':30}
This means they are best hyperparameter you should run model among n_estimators{3,10,30} or max_features {2, 4, 6, 8}

What is the replace for softmax layer in case more than one output can be activated?

For example, I have CNN which tries to predict numbers from MNIST dataset (code written using Keras). It has 10 outputs, which form softmax layer. Only one of outputs can be true (independently for each digit from 0 to 9):
Real: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
Predicted: [0.02, 0.9, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]
Sum of predicted is equal to 1.0 due to definition of softmax.
Let's say I have a task where I need to classify some objects that can fall in several categories:
Real: [0, 1, 0, 1, 0, 1, 0, 0, 0, 1]
So I need to normalize in some other way. I need function which gives value on range [0, 1] and which sum can be larger than 1.
I need something like that:
Predicted: [0.1, 0.9, 0.05, 0.9, 0.01, 0.8, 0.1, 0.01, 0.2, 0.9]
Each number is probability that object falls in given category. After that I can use some threshold like 0.5 to distinguish categories in which given object falls.
The following questions appear:
So which activation function can be used for this?
May be this function already exists in Keras?
May be you can propose some other way to predict in this case?
Your problem is one of multi-label classification, and in the context of Keras it is discussed, for example, here: https://github.com/fchollet/keras/issues/741
In short the suggested solution for it in keras is to replace the softmax layer with a sigmoid layer and use binary_crossentropy as your cost function.
an example from that thread:
# Build a classifier optimized for maximizing f1_score (uses class_weights)
clf = Sequential()
clf.add(Dropout(0.3))
clf.add(Dense(xt.shape[1], 1600, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(1600, 1200, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(1200, 800, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(800, yt.shape[1], activation='sigmoid'))
clf.compile(optimizer=Adam(), loss='binary_crossentropy')
clf.fit(xt, yt, batch_size=64, nb_epoch=300, validation_data=(xs, ys), class_weight=W, verbose=0)
preds = clf.predict(xs)
preds[preds>=0.5] = 1
preds[preds<0.5] = 0
print f1_score(ys, preds, average='macro')

Resources