Using Kernel K-Means in Scikit - scikit-learn

I am working with a very large dataset (1.5 Million rows) and thought about using an SVR.
Since there is so much data I though about switching to a linear SVM and using the nystroem
method to make a kernel from the uniform sampled data.
However I would rather like to construct the kernel via Kernel K-Means, but I did not find an official
implementation yet.
This link provides a unofficual method, but this results in a very large model since it is serialized.
https://tslearn.readthedocs.io/en/stable/gen_modules/clustering/tslearn.clustering.KernelKMeans.html
Maybe someone has a clue where to look for this or how to implement this codewise from an arbitrary dataset?

Related

Loading a model checkpoint in lesser amount of memory

I had a question that I can't find any answers to online. I have trained a model whose checkpoint file is about 20 GB. Since I do not have enough RAM with my system (or Colaboratory/Kaggle either - the limit being 16 GB), I can't use my model for predictions.
I know that the model has to be loaded into memory for the inferencing to work. However, is there a workaround or a method that can:
Save some memory and be able to load it in 16 GB of RAM (for CPU), or the memory in the TPU/GPU
Can use any framework (since I would be working with both) TensorFlow + Keras, or PyTorch (which I am using right now)
Is such a method even possible to do in either of these libraries? One of my tentative solutions was not load it in chunks perhaps, essentially maintaining a buffer for the model weights and biases and performing calculations accordingly - though I haven't found any implementations for that.
I would also like to add that I wouldn't mind the performance slowdown since it is to be expected with low-specification hardware. As long as it doesn't take more than two weeks :) I can definitely wait that long...
Yoy can try the following:
split model by two parts
load weights to the both parts separately calling model.load_weights(by_name=True)
call the first model with your input
call the second model with the output of the first model

Keras use multi-gpu without Model object (not for training)

I have a bunch of tensor operations (matmul, transpose, etc..) I would like to run on a large dataset.
Since they are still matrix operations, and since I am using Keras generators to load the data batches, It would make sense to use GPUs to compute them.
Now, I've searched a while and I can't seem to find which is the correct way to use Keras to do parallel GPU operations, using generators, outside of the standard Model object interface.
Does anyone know how to do it? Thanks!

How does the model in sklearn handle large data sets in python?

Now I have 10GB of data set to train the model in sklearn, but my computer only has 8GB of memory, so I have other ways to go besides incremental classifier.
I think sklearn can be used for larger data if the technique is right. If your chosen algorithms support partial_fit or an online learning approach then you're on track. The chunk_size may influence your success
This link may be useful( Working with big data in python and numpy, not enough ram, how to save partial results on the disc?)
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

What kinds of optimization are used in PyTorch methods?

I'm using PyTorch to implement an intense sequence of matrix operations, using methods such as torch.mm or torch.dot. I was wondering if PyTorch uses multithreading or other optimization mechanisms to speed up the process. I am not utilizing a GPU. I appreciate if you could inform me of how fast these methods are and whether I need to take any actions to help the process.
PyTorch uses an efficient BLAS implementation and multithreading (openMP, if I'm not wrong) to parallelize such operations with multiple cores. Some performance loss comes from the Python itself - since this is an interpreted language, no significant compiler-like optimization can be done. You can use the jit module to speed up the "wrapper" code around the matrix multiplies, but for anything more than very small matrices this cost is probably negligible.
One big improvement you may be able to get manually, but which PyTorch doesn't apply automatically, is to properly order the matrix multiplies. As you probably know, depending on matrix shapes, a multiplication ABCD may have different performance computed as A(B(CD)) than if computed as (AB)(CD), etc.

SVM results in Rapidminer much worse than in knime

I'm comparing various classification algorithms for a project using knime. I was very happy with the results I got for Support Vector Machines (LibSVM).
I then wanted to try hierarchical classification and installed the Rapidminer plugin for knime. To get things to work I first tested the SVM implementation without hierarchies.
Comparing the results of the knime LibSVM implementation and the rapidminer LibSVM implementation I noticed that the rapidminer implementation yielded worse results.
The knime implementation in fact produced an error rate of approximately 2.4% while the rapidminer one produced an error rate of approx. 61%.
Why is that? Am I doing something wrong?
I use C-SVC SVMs with linear kernel, 1.0 Cost, 0.001 epsilon and 80mb cache for both implementations.
The documents are wikipedia article texts, preprocessed, transformed to a binary document vector and labeled with some kind of type.
I hope you can help me.
You do not need to include the Row IDs in this case (Row ID tab, make to button show Do not use by clicking on it in case it is Use and the text field is not disabled), and you should not perform Nominal to... transformations on them. After that, you should get similar results in both cases.

Resources