Less features, longer model training time - scikit-learn

I use machine learning algorithm in Malware analysis. When I input some features, I get strange training time. For example:
4 feature(A,B,C,D), model training time is 3 seconds.
3 Features(A,B,C), training time is 5 seconds.
2 features(A, B), training time is 8 seconds.
1 feature(A), training time is 4 seconds.
This kind of result happens on both MLP and Random Forest. In my opinion, the training time should be faster if I use less features, but the result is complete different.
In KNN, the result will be like these:
If I using 6,5,4,3 features(A,B,C,D,E,F), model testing time is about 1.1 seconds, almost the same.
2 features(A,B), model testing time is 3 seconds.
1 feature (A), model testing time is 5 seconds.
My dataset has 17K records and using 10-Fold cross-validation. The feature is sort by their entropy, feature A have highest entropy and feature F is lowest. Using Google Colab with sklearn for the testing. I tried several times in different date, and the trend is the same.
The feature of my dataset has total 79 features, the appearance only happens with few features.
Thanks for anyone who reply me, I have no idea about it.

It does seem at first glance that having fewer features will result in lower training times. However, depending on which algorithm is being used, this may not be the case. In training, an objective function (loss function) is being minimized by the algorithm. Taking the case of the MLP neural network, if you change the features (especially depending on whether they're informative or not), you're changing the feature space (or "error surface") over which the optimization occurs and possibly the minima of the function will be harder to find, resulting in more steps and longer training in order to satisfy the convergence criteria.

Related

Can the increase in training loss lead to better accuracy?

I'm working on a competition on Kaggle. First, I trained a Longformer base with the competition dataset and achieved a quite good result on the leaderboard. Due to the CUDA memory limit and time limit, I could only train 2 epochs with a batch size of 1. The loss started at about 2.5 and gradually decreased to 0.6 at the end of my training.
I then continued training 2 more epochs using that saved weights. This time I used a little bit larger learning rate (the one on the Longformer paper) and added the validation data to the training data (meaning I no longer split the dataset 90/10). I did this to try to achieve a better result.
However, this time the loss started at about 0.4 and constantly increased to 1.6 at about half of the first epoch. I stopped because I didn't want to waste computational resources.
Should I have waited more? Could it eventually lead to a better test result? I think the model could have been slightly overfitting at first.
Your model got fitted to the original training data the first time you trained it. When you added the validation data to the training set the second time around, the distribution of your training data must have changed significantly. Thus, the loss increased in your second training session since your model was unfamiliar with this new distribution.
Should you have waited more? Yes, the loss would have eventually decreased (although not necessarily to a value lower than the original training loss)
Could it have led to a better test result? Probably. It depends on if your validation data contains patterns that are:
Not present in your training data already
Similar to those that your model will encounter in deployment
In fact it's possible for an increase in training loss to lead to an increase in training accuracy. Accuracy is not perfectly (negatively) correlated with any loss function. This is simply because a loss function is a continuous function of the model outputs whereas accuracy is a discrete function of model outputs. For example, a model that predicts low confidence but always correct is 100% accurate, whereas a model that predicts high confidence but is occasionally wrong can produce a lower loss value but less than 100% accuracy.

RandomForestRegressor: questions about output, parameters and execution time

I have used the following code to run and evaluate a RandomForestRegressor model for my data:
My dataset is 36 features, 1 label with around 31 million rows. The features are continuous and the labels are binary.
I have the following questions:
When I use np.unique(Y_Pred) it tells me array([0. , 0.5, 1. ]). Why am I getting 0.5 as an output? Is there a parameter I can change in the model to fix it? I don't know whether to include it as a 1 or 0. For now I've included it as a 1 (hence Y_Pred > 0.45 in my code).
The documentation says the most important parameters to adjust are n_estimators and max_features. For n_estimators what is a reasonable number? I've started at 2 because of how slow it took to run on my TPU Google Colab session (43 minutes for each tree or 86 minutes total). Should I bother increasing trees to improve accuracy? Are there any other parameters I can change to improve speed? All of my features are reasonably important so I don't want to start dropping them.
Is there anything I am doing wrong that is making it slow, or anything I can do to make it faster?
Any help would be greatly appreciated.
When your labels are binary, you should use the RandomForestClassifier so that you can get the 1 or 0 as the output directly from the model.
you could play around with the max_samples parameter to reduce the number of datapoints used for each tree in the random forest. Since you have 31 millions records, it make sense to subsample them for each tree.
max_depth has greatly help you to reduce the training time. You need to find the sweet spot the get a balance between computation time and model performance.

How to split the training data and test data for LSTM for time series prediction in Tensorflow

I recently learn the LSTM for time series prediction from
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb
In his tutorial, he says: Instead of training the Recurrent Neural Network on the complete sequences of almost 300k observations, we will use the following function to create a batch of shorter sub-sequences picked at random from the training-data.
def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random start-index.
# This points somewhere into the training-data.
idx = np.random.randint(num_train - sequence_length)
# Copy the sequences of data starting at this index.
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
yield (x_batch, y_batch)
He try to create several bacth samples for training.
My question is that, can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
my motivation for this question is that, for time series prediction, we want to training the past and predict for the furture. Therefore, is it legal to shuttle the training samples?
In the tutorial, the author chose a piece of continuous samples such as
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
Can we pick x_batch and y_batch not continous. For example, the x_batch[0] is picked at 10:00am and x_batch[1] is picked at 9:00am at the same day?
In summary: The follow two question are
(1) can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
(2) when we train LSTM, Do we need to consider the influence of time order? what parameters we learn for LSTM.
Thanks
(1) We cannot. Imagine trying to predict the weather for tomorrow. Would you want a sequence of temperature values for the last 10 hours or would you want random temperature values of the last 5 years?
Your dataset is a long sequence of values in a 1-hour interval. Your LSTM takes in a sequence of samples that is chronologically connected. For example, with sequence_length = 10 it can take the data from 2018-03-01 09:00:00 to 2018-03-01 19:00:00 as input. If you shuffle the dataset before generating batches that consist of these sequences, you will train your LSTM on predicting based on a sequence of random samples from your whole dataset.
(2) Yes, we need to consider temporal ordering for time series. You can find ways to test your time series LSTM in python here: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/
The train/test data must be split in such a way as to respect the temporal ordering and the model is never trained on data from the future and only tested on data from the future.
It depends a lot on the dataset. For example, the weather from a random day in the dataset is highly related to the weather of the surrounding days. So, in this case, you should try a statefull LSTM (ie, a LSTM that uses the previous records as input to the next one) and train in order.
However, if your records (or a transformation of them) are independent from each other, but depend on some notion of time, such as the inter-arrival time of the items in a record or a subset of these records, there should be noticeable differences when using shuffling. In some cases, it will improve the robustness of the model; in other cases, it will not generalize. Noticing these differences is part of the evaluation of the model.
In the end, the question is: the "time series" as it is is really a time series (ie, records really depend on their neighbor) or there is some transformation that can break this dependency, but preserv the structure of the problem? And, for this question, there is only one way to get to the answer: explore the dataset.
About authoritative references, I will have to let you down. I learn this from a seasoned researcher in the field, however, according to him, he learn it through a lot of experimentation and failures. As he told me: these aren't rules, they are guidelines; try all the solutions that fits your budget; improve on the best ones; try again.

Gensim doc2vec file stream training worse performance

Recently I switched to gensim 3.6 and the main reason was the optimized training process, which streams the training data directly from file, thus avoiding the GIL performance penalties.
This is how I used to trin my doc2vec:
training_iterations = 20
d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0)
d2v.build_vocab(corpus)
for epoch in range(training_iterations):
d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter)
d2v.alpha -= 0.0002
d2v.min_alpha = d2v.alpha
And it is classifying documents quite well, only draw back is that when it is trained CPUs are utilized at 70%
So the new way:
corpus_fname = "spped.data"
save_as_line_sentence(corpus, corpus_fname)
# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = cpu_count()
# Train models using all cores
d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores, dm=0, vector_size=200, epochs=50)
Now all CPUs are utilized at 100%
but the model is performing very poorly.
According to the documentation, I should not use the train method also, I should use only epoch count and not iterations, also the min_aplpha and aplha values should not be touched.
The configuration of both Doc2Vec looks the same to me so is there an issue with my new set up or configuration, or there is something wrong with the new version of gensim?
P.S I am using the same corpus in both cases, also I tried epoch count = 100, also with smaller numbers like 5-20, but I had no luck
EDIT: First model was doing 20 iterations 5 epoch each, second was doing 50 epoch, so having the second model make 100 epochs made it perform even better, since I was no longer managing the alpha by myself.
About the second issue that popped up: when providing file with line documents, the doc ids were not always corresponding to the lines, I didn't manage to figure out what could be causing this, it seems to work fine for small corpus, If I find out what I am doing wrong I will update this answer.
The final configuration for corpus of size 4GB looks like this
d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0)
d2v.build_vocab(corpus)
d2v.train(corpus, total_examples=d2v.corpus_count, epochs=100)
Most users should not be calling train() more than once in their own loop, where they try to manage the alpha & iterations themselves. It is too easy to do it wrong.
Specifically, your code where you call train() in a loop is doing it wrong. Whatever online source or tutorial you modeled this code on, you should stop consulting, as it's misleading or outdated. (The notebooks bundled with gensim are better examples on which to base any code.)
Even more specifically: your looping code is actually doing 100 passes over the data, 20 of your outer loops, then the default d2v.iter 5 times each call to train(). And your first train() call is smoothly decaying the effective alpha from 0.025 to 0.00025, a 100x reduction. But then your next train() call uses a fixed alpha of 0.0248 for 5 passes. Then 0.0246, etc, until your last loop does 5 passes at alpha=0.0212 – not even 80% of the starting value. That is, the lowest alpha will have been reached early in your training.
Call the two options exactly the same except for the way the corpus_file is specified, instead of an iterable corpus.
You should get similar results from both corpus forms. (If you had a reproducible test case where the same corpus gets very different-quality results, and there wasn't some other error, that could be worth reporting to gensim as a bug.)
If the results for both aren't as good as when you were managing train() and alpha wrongly, it would likely be because you aren't doing a comparable amount of total training.

How to efficiently implement training multiple related time series in Keras?

I have 5 time series that I want a neural network to predict. The time series are related to each other. Each time series consists of numbers between 0 and 100. I want to predict the next number for each time series. I already have a model to train one time series using a GRU and that works reasonably well. I have tried two strategies:
I normalized the numbers and made the problem a regression problem. The best validation accuracy so far is 0.38.
I one-hot-encoded the time series, and this works significantly better (an added accuracy of 0.15) but it costs 100 times as much memory.
For 5 time series, I tried 5 independent models but in that case the relationship between the 5 time series was lost. I am wondering what an efficient strategy to proceed might be. I can think of two myself but I might be missing something:
I can stack the input so that I have a five-hot-encoded input instead of 5 one-hot-encoded. Can this be done?
I can create 5 models and merge them. I am not sure what to do with the output. Should I split the model, one for each time series?
Is there a strategy I have overlooked? Memory is a problem. With thousands of time series, with sample lengths of 100, the data uses a lot of memory and processing time. I Googled around but could not find an efficient strategy. Could someone suggest how to implement this problem efficiently in Keras?

Resources