How do I find distribution of a very large dataset - statistics

I have a airline.csv file which has around 1 million entries,what can I do to find the distribution of such a large dataset?

There are a lot of probability distributions in statistics. So you need to select a distribution according to the predictions that needs to be done using the existing dataset.
https://en.wikipedia.org/wiki/Distribution_fitting

Related

Fitting a random forest model on a large dataset - few million rows and few thousands columns

I am trying to build a random forest on a slightly large data set - half million rows and 20K columns (dense matrix).
I have tried modifying the hyperparameters such as:
n_jobs = -1 or iterating over max depth. However it's either getting stopped because of a memory issue (I have a 320GB server) or the accuracy is very low (when i use a lower max_depth)
Is there a way where I can still use all the features and build the model without any memory issue or not loosing on accuracy?
In my opinion (don't know exactly your case and dataset) you should focus on extract information from your dataset, especially if you have 20k of columns. I assume some of them will not give much variance or will be redundant, so you can make you dataset slightly smaller and more robust to potential overfit.
Also, you should try to use some dimensionality reduction methods which will allows you to make your dataset smaller retaining the most of the variance.
sample code for pca
pca gist
PCA for example (did not mean to offend you if you already know this methods)
pca wiki

How does the model in sklearn handle large data sets in python?

Now I have 10GB of data set to train the model in sklearn, but my computer only has 8GB of memory, so I have other ways to go besides incremental classifier.
I think sklearn can be used for larger data if the technique is right. If your chosen algorithms support partial_fit or an online learning approach then you're on track. The chunk_size may influence your success
This link may be useful( Working with big data in python and numpy, not enough ram, how to save partial results on the disc?)
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

scikit-learn SVM with a lot of samples / mini batch possible?

According to http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html I read:
"The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples."
I have currently 350,000 samples and 4,500 classes and this number will grow further to 1-2 million samples and 10k + classes.
My problem is that I am running out of memory. All is working as it should when I use just 200,000 samples with less than 1000 classes.
Is there a way to build-in or use something like minibatches with SVM? I saw there exists MiniBatchKMeans but I dont think its for SVM?
Any input welcome!
I mentioned this problem in my answer to this question.
You can split your large dataset into batches that can be safely consumed by an SVM algorithm, then find support vectors for each batch separately, and then build a resulting SVM model on a dataset consisting of all the support vectors found in all the batches.
Also if there is no need in using kernels in your case, then you can use sklearn's SGDClassifier, which implements stochastic gradient descent. It fits linear SVM by default.

Using multiple training files in libsvm

I am trying to train a binary classifier using libsvm.
My data quantity is very large and I need to know of any way I can divide the input data into different files and input to the train function.
So basically I know this :
svm-train train file
I wonder if there's a way to do:
svm-train train_file1 train_file2 train_file3.....
Does anyone know any way to do this??
From the FAQ's of libsvm
For large problems, please specify enough cache size (i.e., -m). You may train only a subset of the data. You can use the program subset.py in the directory "tools" to obtain a random subset.

Comparing parallel k-means batch vs mini-batch speed

I am trying to cluster 1000 dimension, 250k vectors using k-means. The machine that I am working on has 80 dual-cores.
Just confirming, if anyone has compared the run-time of k-means default batch parallel version against k-means mini-batch version? The example comparison page on sklean documents doesn't provide much info as the dataset is quite small.
Much appreciate your help.
Regards,
Conventional wisdom holds that Mini-Batch K-Means should be faster and more efficient for greater than 10,000 samples. Since you have 250,000 samples, you should probably use mini-batch if you don't want to test it out on your own.
Note that the example you referenced can very easily be changed to a 5000, 10,000 or 20,000 point example by changing n_samples in this line:
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
I agree that this won't necessarily scale the same for 1000 dimensional vectors, but since you are constructing the example and are using either k-means or mini batch k-means and it only takes a second to switch between them... You should just do a scaling study for your 1000 dimensional vectors for 5k, 10k, 15k, 20k samples.
Theoretically, there is no reason why Mini-Batch K-Means should underperform K-Means due to vector dimensionality and we know that it does better for larger sample sizes, so I would go with mini batch off the cuff e.g. bias for action over research.

Resources