I want to do a multi-label text classification on a big data set set and it seems like that big data machine learning tools such as Apache Mahout or Spark MLLib are not currently support that. I would like to know has any one done a multi-label classification for big data sets before? Are there any plan to integrate multi-label classification in either Mahout or Spark in the near future?
This paper addresses the nature of the benefits you would receive from multioutput forecasting... namely:
The ability to account for multiple independent input parameters when making a prediction, rather than having to continuously update your metrics for each nth index prediction your are trying to make within a given forecast.
Computational speed is increased.
Based on your need - I would recommend trying to down-sample to a smaller group for your current problem and then create multiple models around bespoke groups within your dataset if performance does not match what you are looking for.
I am still encountering this challenge myself (4 years since your post...).
Here is a list of helpful articles that I have collected while trying to address this:
Long-term forecasting with machine learning models
Sorry ARIMA, but I’m Going Bayesian
Multiple-Output Modelling for Multi-Step-Ahead Time Series
Forecasting
Can we first transform the labels into a class, and then after prediction, transform it back to the original label? for example, i have 3 labels to predict, [y1, y2, y3]. if [y1, y2, y3] = [1, 0, 1], then i give it label = 101 = 5. And during prediction, I predicted the probability of y1 in the following way:
p(y1=1) = p(100) + p(101) + p(110) + p(111). In this way a multi label problem became a multilabel problem
Related
I have a dataset on my own, and the dataset contains two classes, let's say 0 and 1. Besides, there is a large part of nodes which class is unlabeled. My goal is to predict these unlabeled nodes using GCN. But I am confused about how to deal with these unlabeled nodes in Pytorch Geometric.
As far as I can think about, I can label the nodes into 3 classes, 0, 1 and unknown. But if I do it this way, that means I am trying to classify the dataset into three classes? (But that's not what I want since unknown is not a class).
And another way to deal with these node is to ignore them, simply run PyG on the labeled node. But in this way, it seems that these unlabeled node(with feature) is useless in the dataset?
That very much depends on your use case and the data!
Case 1 - Graph Autoencoder
For this case let's assume the task is to find similar tweets. A way of doing this is to train a Graph Autoencoder (see example).
This approach is completely unsupervised and thus does not need any data to be labeled.
The resulting model should be able to generate an embedding for each node (in this case each tweet) so that the distance between similar tweets is lower than between non-similar (measured e. g. by cosine distance).
Case 2 - Semi-Supervised GCN
Another case would be to classify tweets as advertisement vs. non-advertisement. Since the idea behind GCNs is to train in a semi-supervised manner it would be no problem to only have labels for some of the tweets.
In order to tell PYG which ones have labels and should be used for training you can define a train_mask. All nodes with missing labels will still technically need a y-value which can be set to -1.
Source & Credits
I am trying a multi-task regression model. However, the ground-truth labels of different tasks are on different scales. Therefore, I wonder whether it is necessary to normalize the targets. Otherwise, the MSE of some large-scale tasks will be extremely bigger. The figure below is part of my overall targets. You can certainly find that columns like ASA_m2_c have much higher values than some others.
First, I have already tried some weighted loss techniques to balance the concentration of my model when it does gradient backpropagation. The result shows it didn't perform well.
Secondly, I have seen tremendous discussions regarding normalizing the input data, but hardly discovered any particular talking about normalizing the labels. It's partly because most of the people's problems are classification type and a single task. I do know pytorch provides a convenient approach to normalize the vision dataset by transform.normalize, which is still operated on the input rather than the labels.
Similar questions: https://forums.fast.ai/t/normalizing-your-dataset/49799
https://discuss.pytorch.org/t/ground-truth-label-normalization/26981/19
PyTorch - How should you normalize individual instances
Moreover, I think it might be helpful to provide some details of my model architecture. The input is first fed into a feature extractor and then several generators use the shared output representation from that extractor to predict different targets.
I've been working on a Multi-Task Learning problem where one head has an output of ~500 and another between 0 and 1.
I've tried Uncertainty Weighting but in vain. So I'd be grateful if you could give me a little clue about your studies.(If there is any progress)
Thanks.
I would like to use scikit-learn's svm.SVC() estimator to perform classification tasks on multi-dimensional time series - that is, on time series where the points in the series take values in R^d, where d > 1.
The issue with doing this is that svm.SVC() will only take ndarray objects of dimension at most 2, whereas the dimension of such a dataset would be 3. Specifically, the shape of a given dataset would be (n_samples, n_features, d).
Is there a workaround available? One simple solution would just be to reshape the dataset so that it is 2-dimensional, however I imagine this would lead to the classifier not learning from the dataset properly.
Without any further knowledge about the data reshaping is the best you can do. Feature engineering is a very manual art that depends heavily on domain knowledge.
As a rule of thumb: if you don't really know anything about the data throw in the raw data and see if it works. If you have an idea what properties of the data may be beneficial for classification, try to work it in a feature.
Say we want to classify swiping patterns on a touch screen. This closely resembles your data: We acquired many time series of such patterns by recording the 2D position every few milliseconds.
In the raw data, each time series is characterized by n_timepoints * 2 features. We can use that directly for classification. If we have additional knowledge we can use that to create additional/alternative features.
Let's assume we want to distinguish between zig-zag and wavy patterns. In that case smoothness (however that is defined) may be a very informative feature that we can add as a further column to the raw data.
On the other hand, if we want to distinguish between slow and fast patterns, the instantaneous velocity may be a good feature. However, the velocity can be computed as a simple difference along the time axis. Even linear classifiers can model this easily so it may turn out that such features, although good in principle, do not improve classification of raw data.
If you have lots and lots and lots and lots of data (say an internet full of good examples) Deep Learning neural networks can automatically learn features to some extent, but let's say this is rather advanced. In the end, most practical applications come down to try and error. See what features you can come up with and try them out in practice. And beware the overfitting gremlin.
I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.
I'm trying to classify some EEG data using a logistic regression model (this seems to give the best classification of my data). The data I have is from a multichannel EEG setup so in essence I have a matrix of 63 x 116 x 50 (that is channels x time points x number of trials (there are two trial types of 50), I have reshaped this to a long vector, one for each trial.
What I would like to do is after the classification to see which features were the most useful in classifying the trials. How can I do that and is it possible to test the significance of these features? e.g. to say that the classification was drive mainly by N-features and these are feature x to z. So I could for instance say that channel 10 at time point 90-95 was significant or important for the classification.
So is this possible or am I asking the wrong question?
any comments or paper references are much appreciated.
Scikit-learn includes quite a few methods for feature ranking, among them:
Univariate feature selection (http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html)
Recursive feature elimination (http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_digits.html)
Randomized Logistic Regression/stability selection (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RandomizedLogisticRegression.html)
(see more at http://scikit-learn.org/stable/modules/feature_selection.html)
Among those, I definitely recommend giving Randomized Logistic Regression a shot. In my experience, it consistently outperforms other methods and is very stable.
Paper on this: http://arxiv.org/pdf/0809.2932v2.pdf
Edit:
I have written a series of blog posts on different feature selection methods and their pros and cons, which are probably useful for answering this question in more detail:
http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/
http://blog.datadive.net/selecting-good-features-part-ii-linear-models-and-regularization/
http://blog.datadive.net/selecting-good-features-part-iii-random-forests/
http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/