I'm using libsvm-3.21 for epsilon-svr. I have a training data with so many non zeros (sparse format). When I use svm-scale to scale the features into the range [0, 1], I'm getting this warning
WARNING: original #nonzeros 503981
> new #nonzeros 6450944
If feature values are non-negative and sparse, use -l 0 rather than the default -l -1
Should I ignore this warning, does this affect my predictions ?
Sparse inputs can be processed more efficiently. So if you can scale your data in a way as to preserve the zeros training your model might be quite a bit faster.
And a model that takes less time to train might end up giving you better results since you have more time optimising parameters.
Related
I am working on a time-series prediction problem using GradientBoostingRegressor, and I think I'm seeing significant overfitting, as evidenced by a significantly better RMSE for training than for prediction. In order to examine this, I'm trying to use sklearn.model_selection.cross_validate, but I'm having problems understanding the result.
First: I was calculating RMSE by fitting to all my training data, then "predicting" the training data outputs using the fitted model and comparing those with the training outputs (the same ones I used for fitting). The RMSE that I observe is the same order of magnitude the predicted values and, more important, it's in the same ballpark as the RMSE I get when I submit my predicted results to Kaggle (although the latter is lower, reflecting overfitting).
Second, I use the same training data, but apply sklearn.model_selection.cross_validate as follows:
cross_validate( predictor, features, targets, cv = 5, scoring = "neg_mean_squared_error" )
I figure the neg_mean_squared_error should be the square of my RMSE. Accounting for that, I still find that the error reported by cross_validate is one or two orders of magnitude smaller than the RMSE I was calculating as described above.
In addition, when I modify my GradientBoostingRegressor max_depth from 3 to 2, which I would expect reduces overfitting and thus should improve the CV error, I find that the opposite is the case.
I'm keenly interested to use Cross Validation so I don't have to validate my hyperparameter choices by using up Kaggle submissions, but given what I've observed, I'm not clear that the results will be understandable or useful.
Can someone explain how I should be using Cross Validation to get meaningful results?
I think there is a conceptual problem here.
If you want to compute the error of a prediction you should not use the training data. As the name says theese type of data are used only in training, for evaluating accuracy scores you ahve to use data that the model has never seen.
About cross-validation I can tell that it's an approach to find the best training/testing set. The process is as follows: you divide your data into n groups and you do various iterating changing the testing group you pick. If you have n groups you will do n iteration and each time the training and testing set will be different. It's more understamdable in the image below.
Basically what you should do it's kile this:
Train the model using months from 0 to 30 (for example)
See the predictions made with months from 31 to 35 as input.
If the input has to be the same lenght divide feature in half (should be 17 months).
I hope I understood correctly, othewise comment.
I have a time series forecasting case with ten features (inputs), and only one output. I'm using 22 timesteps (history of features) for one step ahead prediction using LSTM. Also, I apply MinMaxScaler for input normalization, but I don't normalize the output. The output contains some rare jumps (such as 20, 50, or more than 100), but the other values are between 0 and ~5 (all values are positive). In this case, it's important to forecast both normal and outlier outputs correctly so I dont want to miss the jumps in my forecasting model. I think if I use MinMaxScaler for output, most of the values will be something near the zero but the others (outliers) will be near one.
What is the best way to normalize the output? Should I leave it without normalization?
What is the best LSTM structure to handle this issue? (currently, I'm using LSTM with relu and Dense layer with relu as the last layer so I the output will be a positive value). I think I should select activation functions correctly for this case.
I think first of all, you should decide on a metric to measure performance. For example, do you want to use MAE or MSE? Or some other metric you decide based on the task at hand. For example, you may tolerate greater error for the "rare jumps", but not for the normal cases, or vice versa. Once you are decided on the error metric, ideally, you should set that as the cost function that the LSTM network would be minimizing.
Now the goal would be to minimize the desired error metric you set. If this was a convex problem, the scaling of the output will not matter. But we now that this is not the case with the complex deep learning architectures. What this means is that while minimizing the cost function with gradient decent, it might get stuck in a local minimum with a very delayed convergence. In this case, normalizing the output might help. How?
Assume that your output has a mean value of 5. With last layers parameters initialized around zero and a bias value of zero (i.e. the linear transformation of relu), the network needs to learn that the bias should be around 5. Depending on the complexity of the network this could take some epochs. However, if you normalize the data, or initialize the bias at 5, then your network starts with a good estimate of the bias and thus converges faster.
Now back to your questions:
I would at least make the output zero mean and use Dense layer with linear output.
The architecture you have seems fine, you can try stacking 2-4 LSTM layers if you think your input has complex time dependencies.
Feel free to update the OP with the the code and the performance you get and we can discuss what else can be improved.
I start with a basic Logistic Regression, using all defaults hyper-parameters. And I get a score of 0.8855
Question Next I run a RandomSearch to find the best hyper-parameters; According to the RandomSearch C=10 with Max_iterations=110 gives the score of 0.89
I run the logistic with these hyper parameters but get a much better accuracy, 0.91 !
Why am I not getting exactly the same number?
You will definitely not get the same accuracy when you run it again in your train set, this is because when you do k-fold cross validation to check the performance of a particular set of hyper parameters you will divide the entire data into k sets and use k-1 sets for training and validate it on the left over one set. And you repeat this process k times and each time you take a different set of data for validating. And finally you compute the average of all the k iterations and report your accuracy which is what you got in random_result.best_score_, the figure below explains the process
And now after getting the best set of hyperparameters you will fit it on the entire training data i.e. set 1, set 2 and set 3, so now it is prone to have some variations since the data has changed and you are evaluating on the entire train data. So what you observe is totally normal and the usual behavior.
I am trying to build a random forest on a slightly large data set - half million rows and 20K columns (dense matrix).
I have tried modifying the hyperparameters such as:
n_jobs = -1 or iterating over max depth. However it's either getting stopped because of a memory issue (I have a 320GB server) or the accuracy is very low (when i use a lower max_depth)
Is there a way where I can still use all the features and build the model without any memory issue or not loosing on accuracy?
In my opinion (don't know exactly your case and dataset) you should focus on extract information from your dataset, especially if you have 20k of columns. I assume some of them will not give much variance or will be redundant, so you can make you dataset slightly smaller and more robust to potential overfit.
Also, you should try to use some dimensionality reduction methods which will allows you to make your dataset smaller retaining the most of the variance.
sample code for pca
pca gist
PCA for example (did not mean to offend you if you already know this methods)
pca wiki
I want to feed images to a Keras CNN. The program randomly feeds either an image downloaded from the net, or an image of random pixel values. How do I set batch size and epoch number? My training data is essentially infinite.
Even if your dataset is infinite, you have to set both batch size and number of epochs.
For batch size, you can use the largest batch size that fits into your GPU/CPU RAM, by just trial and error. For example you can try power of two batch sizes like 32, 64, 128, 256.
For number of epochs, this is a parameter that always has to be tuned for the specific problem. You can use a validation set to then train until the validation loss is maximized, or the training loss is almost constant (it converges). Make sure to use a different part of the dataset to decide when to stop training. Then you can report final metrics on another different set (the test set).
It is because implementations are vectorised for faster & efficient execution. When the data is large, all the data cannot fit the memory & hence we use batch size to still get some vectorisation.
In my opinion, one should use a batch size as large as your machine can handle.