How to reduce the size of matrix - python-3.x

I have a huge matrix with the size of 6000x10000. I want to reduce the size of matrix for example i need 3000x10000 but i tried pca decomposition method that's not working for me
Without data loss how to make this?

Related

Multi-Class imbalance with a 3D array

I have created a dataset for human activity recognition with accelerometer data (x,y,z) and HR (bpm). I have extracted the data into 2.5 seconds (20hz , equal to 50 samples), but there is class imbalance which I would like to balance with a method such as SMOTE. The problem is that I have not found a way to do this without corrupting the samples.
The shape is: (Y, 50,4) where Y is of arbitrary length.
That means that every new sample x has to have the same (x, 50, 4) shape.
The dataset will be used in training a CNN-LSTM model.
I can't do the over-and under-sampling by using reshape(-1,4) beforehand and then back to the original shape. This will corrupt the segments of length 50.
Any idea how this can be done, preferably with implemented libraries such as scikit-learn or imbalanced-learn?
Or is the best approach to implement class_weights when trainin the model in keras?

What are some ways to speed up data loading on large sparse arrays (~1 million x 1 million, density ~0.0001) in Pytorch?

I am working on a binary classification problem. I have ~1.5 million data points, and the dimensionality of the feature space is 1 million. This dataset is stored as a sparse array, with a density of ~0.0001. For this post, I'll limit the scope to assume that the model is a shallow feedforward neural network, and also assume that the dimensionality has already been optimized (so cannot be reduced below 1 million). Naiive approaches to create mini-batches out of this data to feed into the network would take a lot of time (As an example, a basic approach of creating a TensorDataset (map style) from a torch.sparse.FloatTensor representation of the input array, and wrapping a DataLoader around it, means ~20s to get a mini-batch of 32 to the network, as opposed to say ~0.1s to perform the actual training). I am looking for ways to speed this up.
What I've tried
I first figured that reading from such a large sparse array in every iteration of the DataLoader was computationally intensive, so I broke down this sparse array into smaller sparse arrays
For the DataLoader to read from these multiple sparse arrays in an iterative fashion, I replaced the map style dataset that I had inside the DataLoader with an IterableDataset, and streamed these smaller sparse arrays into this IterableDataset like so:
from itertools import chain
from scipy import sparse
class SparseIterDataset(torch.utils.data.IterableDataset):
def __init__(self, fpaths):
super(SparseIter).__init__()
self.fpaths = fpaths
def read_from_file(self, fpath):
data = sparse.load_npz(fpath).toarray()
for d in data:
yield torch.Tensor(d)
def get_stream(self, fpaths):
return chain.from_iterable(map(self.read_from_file, fpaths))
def __iter__(self):
return self.get_stream(self.fpaths)
With this approach, I was able to bring down the time from the naiive base case of ~20s to ~0.2s per minibatch of 32. However, given that my dataset has ~1.5 million samples, this still implies a lot of time spent on even making one pass through the dataset. (As a comparison, even though it's slightly apples to oranges, running a logistic regression on scikit-learn on the original sparse array takes about ~6s per iteration through the whole dataset. With pytorch, with the approach I just outlined, it would take ~3000s just to load all the minibatches in an epoch)
One thing which I am aware of but yet to try is using multiprocess data loading by setting the num_workers argument in the DataLoader. I believe this has its own catches in the case of iterable style datasets though. Plus even a 10x speedup would still mean ~300s per epoch in loading mini batches. I feel I'm being inordinately slow! Are there any other approaches/improvements/best practices that you could suggest?
Your dataset in un-sparsified form would be 1.5M x 1M x 1 byte = 1.5TB as uint8, or 1.5M x 1M x 4 byte = 6TB as float32. Simply reading 6TB from memory to CPU could take 5-10 minutes on a modern CPU (depending on the architecture), and transfer speeds from CPU to GPU would be a bit slower than that (NVIDIA V100 on PCIe has 32GB/s theoretical).
Approaches:
Benchmark everything individually - eg in jupyter
%%timeit data = sparse.load_npz(fpath).toarray()
%%timeit dense = data.toarray() # un-sparsify for comparison
%%timeit t = torch.tensor(data) # probably about the same as the line above
Also print out the shapes and datatypes of everything to make sure they are as expected. I haven't tried running your code but I am pretty sure that (a) sparse.load_npz is extremely fast and unlikely to be a bottleneck, but (b) torch.tensor(data) produces a dense tensor and is also quite slow here
Use torch.sparse. I think torch sparse tensors can be used as regular tensors in most cases. You'd have to do some data prep to convert from scipy.sparse to torch.sparse:
A sparse tensor is represented as a pair of dense tensors: a tensor of
values and a 2D tensor of indices. A sparse tensor can be constructed by
providing these two tensors, as well as the size of the sparse tensor
You mention torch.sparse.FloatTensor but I'm pretty sure you're not making sparse tensors in your code - there is no reason to expect those would be constructed simply from passing a scipy.sparse array to a regular tensor constructor, since that's not how they're usually made.
If you figure out a good way to do this, I recommend you post it as a project or git on github, it would be quite useful.
If torch.sparse doesn't work out, think of other ways to either convert the data to dense only on the GPU, or avoid converting it entirely.
See also:
https://towardsdatascience.com/sparse-matrices-in-pytorch-be8ecaccae6
https://github.com/rusty1s/pytorch_sparse

If Keras is forcing me to use a large batch size for prediction, can I simply fill in a bunch of fake values and only look at the predictions I need

...or is there a way to circumvent this?
In stateful LSTMs, I have to define a batch size but Keras is forcing me to use the same batch size in training as in prediction, but I find that my modeling problem depends a lot on having larger batch sizes to see good performance.

Fitting a random forest model on a large dataset - few million rows and few thousands columns

I am trying to build a random forest on a slightly large data set - half million rows and 20K columns (dense matrix).
I have tried modifying the hyperparameters such as:
n_jobs = -1 or iterating over max depth. However it's either getting stopped because of a memory issue (I have a 320GB server) or the accuracy is very low (when i use a lower max_depth)
Is there a way where I can still use all the features and build the model without any memory issue or not loosing on accuracy?
In my opinion (don't know exactly your case and dataset) you should focus on extract information from your dataset, especially if you have 20k of columns. I assume some of them will not give much variance or will be redundant, so you can make you dataset slightly smaller and more robust to potential overfit.
Also, you should try to use some dimensionality reduction methods which will allows you to make your dataset smaller retaining the most of the variance.
sample code for pca
pca gist
PCA for example (did not mean to offend you if you already know this methods)
pca wiki

Comparing parallel k-means batch vs mini-batch speed

I am trying to cluster 1000 dimension, 250k vectors using k-means. The machine that I am working on has 80 dual-cores.
Just confirming, if anyone has compared the run-time of k-means default batch parallel version against k-means mini-batch version? The example comparison page on sklean documents doesn't provide much info as the dataset is quite small.
Much appreciate your help.
Regards,
Conventional wisdom holds that Mini-Batch K-Means should be faster and more efficient for greater than 10,000 samples. Since you have 250,000 samples, you should probably use mini-batch if you don't want to test it out on your own.
Note that the example you referenced can very easily be changed to a 5000, 10,000 or 20,000 point example by changing n_samples in this line:
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
I agree that this won't necessarily scale the same for 1000 dimensional vectors, but since you are constructing the example and are using either k-means or mini batch k-means and it only takes a second to switch between them... You should just do a scaling study for your 1000 dimensional vectors for 5k, 10k, 15k, 20k samples.
Theoretically, there is no reason why Mini-Batch K-Means should underperform K-Means due to vector dimensionality and we know that it does better for larger sample sizes, so I would go with mini batch off the cuff e.g. bias for action over research.

Resources