I would like to use libsvm for a keypoint detection algorithm. Each keypoint has 36 features, but each sample of an Object has a different count of keypoints...
my input array would look like:
Object 1: (K1_F1,...K1_F36,K2_F1,...K2_F36, ... , K12_F1,...K12_F36)
Object 1: (K1_F1,...K1_F36,K2_F1,...K2_F36, ... , K15_F1,...K15_F36)
Object 2: (K1_F1,...K1_F36,K2_F1,...K2_F36, ... , K16_F1,...K16_F36)
Object 2: (K1_F1,...K1_F36,K2_F1,...K2_F36, ... , K9_F1,...K9_F36)
Is it even possible to train with different count of keypoints?
In short: no, it is not possible. SVM requires constant shape data representation. There are two ways for approaching such a problem:
Create some conversion into constant size representation, one of the most common are clustering methods, bag of words representations and other compression-based approaches.
Find a suitable kernel function, which for two sets of keypoints returns a valid scalar product value in some space and feed it to the SVM.
Related
I am currently working on a project that uses Linear Discriminant Analysis to transform some high-dimensional feature set into a scalar value according to some binary labels.
So I train LDA on the data and the labels and then use either transform(X) or decision_function(X) to project the data into a one-dimensional space.
I would like to understand the difference between these two functions. My intuition would be that the decision_function(X) would be transform(X) + bias, but this is not the case.
Also, I found that those two functions give a different AUC score, and thus indicate that it is not a monotonic transformation as I would have thought.
In the documentation, it states that the transform(X) projects the data to maximize class separation, but I would have expected decision_function(X) to do this.
I hope someone could help me understand the difference between these two.
LDA projects your multivariate data onto a 1D space. The projection is based on a linear combination of all your attributes (columns in X). The weights of each attribute are determined by maximizing the class separation. Subsequently, a threshold value in 1D space is determined which gives the best classification results. transform(X) gives you the value of each observation in this 1D space x' = transform(X). decision_function(X) gives you the log-likelihood of an attribute being a positive class log(P(y=1|x')).
I have a number of datasets where I have an array of x, y, z coordinates of the endpoints of segments. First and second point represents a segment, so does third, fourth and so on...
The above data represents just a part of dataset... The entire dataset is a lot bigger.
I am required to train my machine with several datasets like this, so that it can predict the category of any unknown dataset further... The test dataset will also be the same as the above.
I need help with the approach. Which algorithm or approach can I use here to classify any unknown dataset into these known categories?
Its an unsupervised learning problem. If you know roughly in how many classes your data should be split use K-Means (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
Otherwise, a combination of TSNE (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) and Kmeans usually works well. Basically transform data using TSNE and run Kmeans in transformed data.
How do I choose the number of the max_features parameter in TfidfVectorizer module? Should I use the maximum number of elements in the data?
The description of the parameter does not give me a clear vision of how to choose the value for it:
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
This parameter is absolutely optional and should be calibrated according to the rational thinking and the data structure.
Sometimes it is not effective to transform the whole vocabulary, as the data may have some exceptionally rare words, which, if passed to TfidfVectorizer().fit(), will add unwanted dimensions to inputs in the future. One of the appropriate techniques in this case, for instance, would be to print out word frequences accross documents and then set a certain threshold for them. Imagine you have set a threshold of 50, and your data corpus consists of 100 words. After looking at the word frequences 20 words occur less than 50 times. Thus, you set max_features=80 and you are good to go.
If max_features is set to None, then the whole corpus is considered during the TF-IDF transformation. Otherwise, if you pass, say, 5 to max_features, that would mean creating a feature matrix out of the most 5 frequent words accross text documents.
Quick example
Assume you work with hardware-related documents. Your raw data is the following:
from sklearn.feature_extraction.text import TfidfVectorizer
data = ['gpu processor cpu performance',
'gpu performance ram computer',
'cpu computer ram processor jeans']
You see the word jeans in the third document is hardly related and occures only once in the whole dataset. The best way to omit the word, of course, would be to use stop_words parameter, but imagine if there are plenty of such words; or words that are related to the topic but occur scarcely. In the second case, the max_features parameter might help. If you proceed with max_features=None, then it will create a 3x7 sparse matrix, while the best-case scenario would be 3x6 matrix:
tf = TfidfVectorizer(max_features=None).fit(data)
tf.vocabulary_.__len__() # returns 7 as we passed 7 words
tf.fit_transform(data) # returns 3x7 sparse matrix
tf = TfidfVectorizer(max_features=6).fit(data) # excluding 'jeans'
tf.vocabulary_ # prints out every words except 'jeans'
tf.vocabulary_.__len__() # returns 6
tf.fit_transform(data) # returns 3x6 sparse matrix
This question refers to best practices in Theano. Here is what I am trying to do:
I am building a neural network for an SMT system. In this context, I conceptually represent sentences as variable-length lists of words, and words as fixed-length lists of integers. Ideally, I would like to represent my corpus as a 3D tensor (first dimension = sentences in corpus, second dimension = words in sentence, third dimension = integer features in words). The difficulty is that sentences have variable length and, to my knowledge, tensors in Theano have the strict requirement that all lengths in one dimension must be the same.
Solutions I have thought of include:
Use padding with dummy words so that sentences become equally sized. But this means that whenever I iterate over a sentence, I need to include special code to discard the padding.
Represent the corpus as a vector of matrices. However, this makes it hard to work with certain functions. For instance, if I want to add up the representations of all the words in a sentence, I can't simply use *corpus.sum(axis=1)*. I would have to loop over sentences, do *sentence.sum(axis=0)*, and then gather the results into another tensor.
My question is: which of these alternatives are preferred, or is there a better one?
The first option is probably the best option in most cases. It's what I do though it does mean passing around a separate vector of sentence lengths and masking certain results to eliminate the padding region when needed.
In general, if you want to perform a consistent operation to all sentences then you'll usually get much better speed applying that operation to a single 3D tensor than sequentially to a series of matrices. This is especially true for operations running on a GPU.
If you're using scan operations the speed differences will become even more magnified. You'll be better off scanning over a 3D tensor and operating on a per-word matrix in your step function that covers all (or a minibatch of) sentences. If needed, you may need to know which rows of that matrix are real data and which are padding. As an aside, I find that setting the first dimension of a 3D tensor to be the temporal/sequence position dimension helps when using scan, which always scans over the first dimension.
Often, using the value zero as your padding value will result in the padding have no impact on your operations.
The other option, looping over the sentences, would mean mixing Theano and Python code which can make some computations difficult or impossible. For example, getting the gradient of a cost function with respect to some parameters over a all (or batch) of your sentences may not be possible if the data is stored in lots of separate matrices.
I'm using the SVM classifier in the machine learning scikit-learn package for python.
My features are integers. When I call the fit function, I get the user warning "Scaler assumes floating point values as input, got int32", the SVM returns its prediction, I calculate the confusion matrix (I have 2 classes) and the prediction accuracy.
I've tried to avoid the user warning, so I saved the features as floats. Indeed, the warning disappeared, but I got a completely different confusion matrix and prediction accuracy (surprisingly much less accurate)
Does someone know why it happens? What is preferable, should I send the features as float or integers?
Thanks!
You should convert them as floats but the way to do it depends on what the integer features actually represent.
What is the meaning of your integers? Are they category membership indicators (for instance: 1 == sport, 2 == business, 3 == media, 4 == people...) or numerical measures with an order relationship (3 is larger than 2 that is in turn is larger than 1). You cannot say that "people" is larger than "media" for instance. It is meaningless and would confuse the machine learning algorithm to give it this assumption.
Categorical features should hence be transformed to explode each feature as several boolean features (with value 0.0 or 1.0) for each possible category. Have a look at the DictVectorizer class in scikit-learn to better understand what I mean by categorical features.
If there are numerical values just convert them as floats and maybe use the Scaler to have them loosely in the range [-1, 1]. If they span several order of magnitudes (e.g. counts of word occurrences) then taking the logarithm of the counts might yield better results. More documentation on feature preprocessing and examples in this section of the documentation: http://scikit-learn.org/stable/modules/preprocessing.html
Edit: also read this guide that has many more details for features representation and preprocessing: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf