I am trying to classify different concepts in a text using n-gram. My data tyically exists of six columns:
The word that needs classification
The classification
First word on the left of 1)
Second word on the left of 1)
First word on the right of 1)
Second word on the right of 1)
When I try to use a SVM in Rapidminer, I get the error that it can not handle polynominal values. I know that this can be done because I have read it in different papers. I set the second column to 'label' and have tried to set the rest to 'text' or 'real', but it seems to have no effect. What am I doing wrong?
You have to use the Support Vector Machine (LibSVM) Operator.
In contrast to the classic SVM which only supports two class problems, the LibSVM implementation (http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf) supports multi-class classification as well as regression.
One approach could be to create attributes with names equal to the words and values equal to the distance from the word of interest. Of course, all possible words would need to be represented as attributes so the input data would be large.
Related
I am trying to build a RNN model for text classification and I am currently building my dataset.
I am trying to do some of the work automatically and I'm using an API that gets me some information for each text I send to it.
So basically :
I have, for each text on my dataframe, I have a df['label'] that contain a 1 to 3 word string.
I have a list of vocabulary (my futur classes) and for each on the df['label'] item, and want to attribute one of the vocabulary list item, depending on which is closest in meaning.
So I need to measure how close each of the labels are close in meaning to my vocabulary list.
Any help ?
I'm trying to create an RNN that would predict the next word, given the previous word. But I'm struggling with modeling this into a dataset, specifically, how to indicate the next word to be predicted as a 'label'.
I could use a hot label encoded vector for each word in the vocabulary, but a) it'll have tens of thousands of dimensions, given the large vocabulary, b) I'll lose all the other info contained in the word vector. Perhaps that info would be useful in calculating the error, i.e how off the predictions were from the actual word.
What should I do? Should I just use the one hot encoded vector?
I am working with certain programs in python3.4. I want to use WAG matrix for phylogeny inference, but I am confused about the formula implemented by it.
For example, in phylogenetics study, when a sequence file is used to generate a distance based matrix, there is a formula called "p-distance" implemented and on the basis of this formula and some standard values for sequence data, a matrix is generated which is later used to construct a tree. When a character based method for tree construction is used, "WAG" is one of the matrices used for likelihood tree construction. What I want to say is that if one wants to implement this matrix, then what is the formula basis for it?
I want to write codes for this implementation. But first I need to understand the logic used by WAG matrix.
I have an aligned protein sequence file and I need to generate "WAG"
matrix from it. The thing is that I have been studying literature
regarding wag matrix but I could not get how does it perform
calculation??? Does it have a specific formula?? (For example,
'p-distance' is a formula used bu distance matrix) I want to give
aligned protein sequence file as input and have a matrix generated as
output.
I have 45000 text records in my dataframe. I wanted to convert those 45000 records into word vectors so that I can train a classifier on the word vector. I am not tokenizing the sentences. I just split the each entry into list of words.
After training word2vec model with 300 features, the shape of the model resulted in only 26000. How can I preserve all of my 45000 records ?
In the classifier model, I need all of those 45000 records, so that it can match 45000 output labels.
If you are splitting each entry into a list of words, that's essentially 'tokenization'.
Word2Vec just learns vectors for each word, not for each text example ('record') – so there's nothing to 'preserve', no vectors for the 45,000 records are ever created. But if there are 26,000 unique words among the records (after applying min_count), you will have 26,000 vectors at the end.
Gensim's Doc2Vec (the '
Paragraph Vector' algorithm) can create a vector for each text example, so you may want to try that.
If you only have word-vectors, one simplistic way to create a vector for a larger text is to just add all the individual word vectors together. Further options include choosing between using the unit-normed word-vectors or raw word-vectors of many magnitudes; whether to then unit-norm the sum; and whether to otherwise weight the words by any other importance factor (such as TF/IDF).
Note that unless your documents are very long, this is a quite small training set for either Word2Vec or Doc2Vec.
What is a general guideline to handle missing categorical feature values when using Random Forest Regressor (or any ensemble learner for that matter)? I know that scikit learn has impute function (like mean...strategy or proximity) to impute missing values (numerical). But, how does one handle missing categorical value : Like Industry (oil, computer, auto, None), major(bachelors, masters, doctoral, None).
Any suggestion is appreciated.
Breiman and Cutler, the inventors of Random Forest, suggest two possible strategies (see http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1):
Random forests has two ways of replacing missing values. The first way
is fast. If the mth variable is not categorical, the method computes
the median of all values of this variable in class j, then it uses
this value to replace all missing values of the mth variable in class
j. If the mth variable is categorical, the replacement is the most
frequent non-missing value in class j. These replacement values are
called fills.
The second way of replacing missing values is computationally more
expensive but has given better performance than the first, even with
large amounts of missing data. It replaces missing values only in the
training set. It begins by doing a rough and inaccurate filling in of
the missing values. Then it does a forest run and computes
proximities.
Alternatively, leaving your label variable aside for a minute, you could train a classifier on rows that have non-null values for the categorical variable in question, using all of your features in the classifier. Then use this classifier to predict values for the categorical variable in question in your 'test set'. Armed with a more complete data set, you can now return to the task of predicting values for your original label variable.