Using multiple training files in libsvm - svm

I am trying to train a binary classifier using libsvm.
My data quantity is very large and I need to know of any way I can divide the input data into different files and input to the train function.
So basically I know this :
svm-train train file
I wonder if there's a way to do:
svm-train train_file1 train_file2 train_file3.....
Does anyone know any way to do this??

From the FAQ's of libsvm
For large problems, please specify enough cache size (i.e., -m). You may train only a subset of the data. You can use the program subset.py in the directory "tools" to obtain a random subset.

Related

Bert for relation extraction

i am working with bert for relation extraction from binary classification tsv file, it is the first time to use bert so there is some points i need to understand more?
how can i get an output like giving it a test data and show the classification results whether it is classified correctly or not?
how bert extract features of the sentences, and is there a method to know what are the features that is chosen?
i used once the hidden layers and another time i didn't use i got the accuracy of not using the hidden layer higher than using it, is there an reason for that?

in the face recognition training part, trainner.yml file is generated. is it useful to find the confuison matrix?

I was working on a face recognition code. I wanted to find the performance metrics (accuracy, precision, etc.) of my algorithm. I've used HaarCascade Algorithm and LBPH FaceRecogniser When I searched for it on the net, I could find only those sources where already existing datasets are taken and the parameters are computed. I want to use the data obtained from training my model (trained from the images folder). A folder named "trainner.yml" is generated automatically after running the file.
is the data from the trainner.yml file my dataset? What is my dataset now and how can I find the confusion matrix
Thanks

If Keras is forcing me to use a large batch size for prediction, can I simply fill in a bunch of fake values and only look at the predictions I need

...or is there a way to circumvent this?
In stateful LSTMs, I have to define a batch size but Keras is forcing me to use the same batch size in training as in prediction, but I find that my modeling problem depends a lot on having larger batch sizes to see good performance.

LIBSVM with large data samples

I am currently looking to use libsvm (or an alternate if it is suggested; opencv also looks like a viable option) in order to train an SVM. My training data sets are rather large; around 50 binary 128MB files. It appears to use libsvm I must convert the data to a proper format; however I was wondering if it is possible to do training on the raw binary data itself? Thanks in advance.
No, you cannot use your raw binary (image) data for training nor for testing.
In order to use libsvm you have to convert your binary data files into this format.
See this stackoverflow post for the details of the libsvm data-format.

Best practice for training on large scale datasets like ImageNet using Theano/Lasagne?

I found that all of the examples of Theano/Lasagne deal with small data set like mnist and cifar10 which can be loaded into memory completely.
My question is how to write efficient code for training on large scale datasets?
Specifically, what is the best way to prepare mini-batches (including real time data augmentation) in order to keep the GPU busy?
Maybe like using CAFFE's ImageDataLayer?
For example, I have a big txt file which contains all the image paths and labels.
It would be appreciated to show some code.
Thank you very much!
In case the data doesn't fit into memory, a good way is to prepare the minibatches and store them into an HDF5 file, which is then used at training time.
However, this does suffice when doing data augmentation as this is done on the fly. Because of Pythons global interpreter lock, images cannot already be loaded and preprocesed while the GPU is busy.
The best way around this, that I know of, is the Fuel library.
Fuel loads and preprocesses the minibatches in a different python process and then streams them to the training process over a TCP socket:
http://fuel.readthedocs.org/en/latest/server.html#data-processing-server
It additionally provides some functions to preprocess the data, such as scaling and mean subtraction:
http://fuel.readthedocs.org/en/latest/overview.html#transformers-apply-some-transformation-on-the-fly
Hope this helps.
Michael

Resources