LIBSVM with large data samples - svm

I am currently looking to use libsvm (or an alternate if it is suggested; opencv also looks like a viable option) in order to train an SVM. My training data sets are rather large; around 50 binary 128MB files. It appears to use libsvm I must convert the data to a proper format; however I was wondering if it is possible to do training on the raw binary data itself? Thanks in advance.

No, you cannot use your raw binary (image) data for training nor for testing.
In order to use libsvm you have to convert your binary data files into this format.
See this stackoverflow post for the details of the libsvm data-format.

Related

Implementing Computer Vison on AutoML to classify dementia using MRI images in .NII file format

I am using .NII file format which represents, the neuroimaging dataset. I need to use Auto ML to label the dataset of images that are nearly 2GB in size per patient. The main issue is with using Auto ML to label the dataset of images with.NII file extension and classify whether the patient is having dementia or not.
Requirement: Forget about the problem domain of implementation like dementia. I would like to know about the procedure of using Auto ML for Computer vision applications through ML studio to use.NII file format dataset images.
Any help would be thankful.
The requirement of using .nii or other file formats in Azure auto ML is a challenging task. Unfortunately, Auto ML image input format will be using in only JSON format. Kindly check the document
Answering regarding requirement of .nii format of dataset, there are different file format convertors available like "Medical Image Convertor". This software is commercial and can be used for 10days for free. Convert .nii file formats into JPG and proceed with the general documentation provided in the top of the answer.

How does the model in sklearn handle large data sets in python?

Now I have 10GB of data set to train the model in sklearn, but my computer only has 8GB of memory, so I have other ways to go besides incremental classifier.
I think sklearn can be used for larger data if the technique is right. If your chosen algorithms support partial_fit or an online learning approach then you're on track. The chunk_size may influence your success
This link may be useful( Working with big data in python and numpy, not enough ram, how to save partial results on the disc?)
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

Convert Text Data into SVMFile format for Spam Classification?

How can I convert Text Data into LibSVM file format for training the model for spam classification.
Are SVMFiles already Labeled ?
SVM format is neither required or that useful. It is used in Apache Spark ML example, only because it can be map directly to the required format.
Are SVMFiles already Labeled ?
Not necessarily, but Spark can read only labeled variant.
In practice you should use org.apache.spark.ml.feature tools to extract relevant features from your data.
You can follow the documentation as well as a number of questions on SO.,

Using multiple training files in libsvm

I am trying to train a binary classifier using libsvm.
My data quantity is very large and I need to know of any way I can divide the input data into different files and input to the train function.
So basically I know this :
svm-train train file
I wonder if there's a way to do:
svm-train train_file1 train_file2 train_file3.....
Does anyone know any way to do this??
From the FAQ's of libsvm
For large problems, please specify enough cache size (i.e., -m). You may train only a subset of the data. You can use the program subset.py in the directory "tools" to obtain a random subset.

Text Classification using Naive bayes

Do guide me along if I am not posting in the right section.
I have some text files for my training data which are unformatted in word documents. They all contain ASCII characters only.
I would like to train a model on the text files using data mining methods.
The text files do have about 300 words in each file on average.
Are there any software that are recommended for me to start on it?
My initial idea is to use all the words in one of the file as training data and the remaining as test data. This is to perform cross fold validation.
However, I have tools such as weka but it does not seem to satisfy my needs as converting to csv files does not seem to be feasible in my case as the text files are separated
I have trying to perform cross validation in such a way that all the words in the training data are considered as features.
You need to use weka StringToWord filter and convert your text files to arff files. After that you can use weka classification algorithms. Watch following video to learn basics.

Resources