Convert Text Data into SVMFile format for Spam Classification? - apache-spark

How can I convert Text Data into LibSVM file format for training the model for spam classification.
Are SVMFiles already Labeled ?

SVM format is neither required or that useful. It is used in Apache Spark ML example, only because it can be map directly to the required format.
Are SVMFiles already Labeled ?
Not necessarily, but Spark can read only labeled variant.
In practice you should use org.apache.spark.ml.feature tools to extract relevant features from your data.
You can follow the documentation as well as a number of questions on SO.,

Related

Implementing Computer Vison on AutoML to classify dementia using MRI images in .NII file format

I am using .NII file format which represents, the neuroimaging dataset. I need to use Auto ML to label the dataset of images that are nearly 2GB in size per patient. The main issue is with using Auto ML to label the dataset of images with.NII file extension and classify whether the patient is having dementia or not.
Requirement: Forget about the problem domain of implementation like dementia. I would like to know about the procedure of using Auto ML for Computer vision applications through ML studio to use.NII file format dataset images.
Any help would be thankful.
The requirement of using .nii or other file formats in Azure auto ML is a challenging task. Unfortunately, Auto ML image input format will be using in only JSON format. Kindly check the document
Answering regarding requirement of .nii format of dataset, there are different file format convertors available like "Medical Image Convertor". This software is commercial and can be used for 10days for free. Convert .nii file formats into JPG and proceed with the general documentation provided in the top of the answer.

LIBSVM with large data samples

I am currently looking to use libsvm (or an alternate if it is suggested; opencv also looks like a viable option) in order to train an SVM. My training data sets are rather large; around 50 binary 128MB files. It appears to use libsvm I must convert the data to a proper format; however I was wondering if it is possible to do training on the raw binary data itself? Thanks in advance.
No, you cannot use your raw binary (image) data for training nor for testing.
In order to use libsvm you have to convert your binary data files into this format.
See this stackoverflow post for the details of the libsvm data-format.

Using topic modeling Java toolkit

I'm working on text classification and I want to use Topic models (LDA).
My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news.
I saw two Java toolkits: mallet and lingpipe.
I've read mallet tutorial on importing the data and it gets data in plain text, not the format that I have. is there any way that I could change it?
Also read a little about the lingpipe, the example from tutorial was using arrays of integers. Is it convenient for large data?
I need to know which implementation of LDA is better for me? Are there any other implementation that suits my data? (in Java)
From the keyword-weight file you can create an artificial text containing the words in random order with the given weights. Run mallet on the so-generated texts to retrieve the topics.

Text Classification using Naive bayes

Do guide me along if I am not posting in the right section.
I have some text files for my training data which are unformatted in word documents. They all contain ASCII characters only.
I would like to train a model on the text files using data mining methods.
The text files do have about 300 words in each file on average.
Are there any software that are recommended for me to start on it?
My initial idea is to use all the words in one of the file as training data and the remaining as test data. This is to perform cross fold validation.
However, I have tools such as weka but it does not seem to satisfy my needs as converting to csv files does not seem to be feasible in my case as the text files are separated
I have trying to perform cross validation in such a way that all the words in the training data are considered as features.
You need to use weka StringToWord filter and convert your text files to arff files. After that you can use weka classification algorithms. Watch following video to learn basics.

Convert one-document-per-line to Blei's lda-c/dtm format for topic modeling?

I am doing Latent Dirichlet Analyses for some research and keep running into a problem. Most lda software requires documents to be in doclines format, meaning a CSV or other delimited file in which each line represents the entirety of a document. However, Blei's lda-c and dynamic topic model software requires that data be in the format: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count] where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared
in the document. Note that [term_1] is an integer which indexes the
term; it is not a string.
Does anyone know of a utility that will let me quickly convert to this format? Thank you.
If you are working with R, the lda package contains a function lexicalize that will convert raw text into the lda-c format necessary for the lda package.
example <- c("I am the very model of a modern major general",
"I have a major headache")
corpus <- lexicalize(example, lower=TRUE)
Similarly, the topicmodels package has a function dtm2ldaformat that will convert a document term matrix to the lda format. You can convert a plain text document into a document term matrix using the tm package, also in R.
So with these existing functions there's a lot of flexibility in getting text into R for topic modelling.
The Mallet package from University of Massachusetts Amherst is another option.
http://mallet.cs.umass.edu/
http://mallet.cs.umass.edu/topics.php
And here is an excellent step-by-step demo on how to use Mallet:
http://programminghistorian.org/lessons/topic-modeling-and-mallet
You can use mallet with just normal text files as input source.
Gensim offers an implementation of Blei's corpus format. See here. You could write a quick corpus based on your CSV file in Python and then save it in lda-c with gensim. It should not be too hard.
For Python, there is an available function for this(may not be available at the time of the question).
lda.utils.dtm2ldac
The document is https://pythonhosted.org/lda/api.html#module-lda.utils

Resources