I wonder how / if it is possible to run sklearn models / xgboost training for a large dataset.
If I use a dataframe that contains several giga-bytes, the machine crashes during the training.
Can you assist me please?
Scikit-learn documentation has an in-depth discussion about different strategies to scale models to bigger data.
Strategies include:
Streaming instances
Extracting features
Incremental learning (see also the partial_fit entry in the glossary)
Related
I'm using Windows 10 machine. Libraries: Keras with Tensorflow 2.0 Embeddings: Glove(100 dimensions).
I am trying to implement an LSTM architecture for multi-label text classification.
I am using different types of fine-tuning to achieve better results but with no luck so far.
The main problem I believe is the difference in class distributions of my dataset but after a lot of tries and errors, I couldn't implement stratified-k-split in Keras.
I am also experimenting with dropout layers, batch sizes, # of layers, learning rates, clip values, validation splits but I get a minimum boost or worst performance sometimes.
For metrics, I use mainly ROC and F1.
I also followed the suggestion from a StackOverflow member who said to delete some of my examples so I can balance my dataset but if I do that I will have a very low number of examples.
What would you suggest to me?
If someone can provide code based on my implementation for
stratified-k-split I would be grateful cause I have checked all the
online resources but can't implement it.
Any tips, suggestions will be really helpful.
Metrics Plots
Dataset form+Embedings form+train-test-split form
Dataset's labels distribution
My LSTM implementation
I am looking at the torch.multiprocessing module and trying to use it speed up my code. I just want to repeatedly train several neural networks with the same dataset. So I don't want to share any training process parameters among them except the initial dataset. I am wondering will torch.multiprocessing suits for this case? I saw many examples but they are all about distributed learning.
A quick code example is appreciated!
I am trying to replicate the hyperparameter tuning example reported at this link but I want to use scikit learn XGBoost instead of tensorflow in my training application.
I am able to run multiple trials in a single job, on for each of the hyperparameters combination. However, the Training output object returned by ML-Engine does not include the finalMetric field, reporting metric information (see the differences in the picture below).
What I get with the example of the link above:
Training output object with Tensorflow training app
What I get running my Training application with XGBoost:
Training output object with XGBoost training app
Is there a way for XGBoost to return training metrics to ML-Engine?
It seems that this process is automated for tensorflow, as specified in the documentation:
How Cloud ML Engine gets your metric
You may notice that there are no instructions in this documentation
for passing your hyperparameter metric to the Cloud ML Engine training
service. That's because the service monitors TensorFlow summary events
generated by your training application and retrieves the metric.
Is there a similar mechanism for XGBoost?
Now, I can always dump each metric results to a file at the end of each trial and then analyze them manually to select the best parameters. But, by doing so, am I loosing the automated mechanism offered by Cloud ML Engine, especially concerning the "ALGORITHM_UNSPECIFIED" hyperparameters search algorithm?
i.e.,
ALGORITHM_UNSPECIFIED: [...] applies Bayesian optimization to search
the space of possible hyperparameter values, resulting in the most
effective technique for your set of hyperparameters.
Hyperparameter tuning support of XGBoost was implemented in a different way. We created the cloudml-hypertune python package to help do it. We're still working on the public doc for it. At the meantime, you can refer to this staging sample to learn about how to use it.
Sara Robinson over at google put together a good post on how to do this. Rather than regurgitate and claim it as my own, I'll post this here for anyone else that comes across this post:
https://sararobinson.dev/2019/09/12/hyperparameter-tuning-xgboost.html
I have seen around that we could use scikit-learn libraries with pyspark for working on a partition on a single worker.
But what if we want to work on training dataset that is distributed and say the regression algorithm should concern with entire dataset. Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset but only on that particular partition. Please correct me if I'm wrong..
And how good is spark-sklearn in solving this problem
As described in the documentation, spark-sklearn does answer your requirements
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default
in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
so, to specifically answer your questions:
But what if we want to work on training dataset that is distributed
and say the regression algorithm should concern with entire dataset.
Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset on that particular partition
In spark-sklearn, spark is used as the replacement to the joblib library as a multithreading framework. So, going from an execution on a single machine to an excution on mutliple machines is seamlessly handled by spark for you. In other terms, as stated in the Auto scaling scikit-learn with spark article:
no change is required in the code between the single-machine case and the cluster case.
I am looking to implement with Spark, a multi label classification algorithm with multi output, but I am surprised that there isn’t any model in Spark Machine Learning libraries that can do this.
How can I do this with Spark ?
Otherwise Scikit Learn Logistic Regresssion support multi label classification in input/output , but doesn't support a huge data for training.
to view the code in scikit learn, please click on the following link:
https://gist.github.com/mkbouaziz/5bdb463c99ba9da317a1495d4635d0fc
Also in Spark there is Logistic Regression that supports multilabel classification based on the api documentation. See also this.
The problem that you have on scikitlearn for the huge amount of training data will disappear with spark, using an appropriate Spark configuration.
Another approach is to use binary classifiers for each of the labels that your problem has, and get multilabel by running relevant-irrelevant predictions for that label. You can easily do that in Spark using any binary classifier.
Indirectly, what might also be of help, is to use multilabel categorization with nearest-neighbors, which is also state-of-the-art. Some nearest neighbors Spark extensions, like Spark KNN or Spark KNN graphs, for instance.