I have thousands of records where each is a unique class. I think what this means is that, which ever ML algorithm I choose, I will not be able to execute a train/test split because the test split cannot be modelled. This is because there is only one record for each class, therefore there is no way the model could have trained for anything that is in the test split.
I can't justify introducing unseen records for prediction if my model cannot be validated.
Each record is similar to a generic accounting record with about 20 features. I thought about using KNN, but KNN would not be able to do a train/test split neither. How do I go about mitigating this? Is KNN the wrong model to use here?
Related
I'm gathering training data for multilabel classification. Some of the data fed into this project will not have enough information to assign it to one of the labels. If I train the model with data that belongs to no label, will it avoid labelling new data that is unclear? Do I need to train it with an "Unclear" label or should I just leave this type of data unlabelled?
I can't seem to find the answer to this question in the spaCy docs.
Assuming you really want multilabel classification, i.e. an instance can have zero or multiple classes, then it's fine to have some data without any label. If the model performs correctly, it should also predict no label for similar instances. Be careful however that no label doesn't mean unclear for the model, it means that none of the possible classes apply (they are considered independently).
Note that in the case of multiclass classification, i.e. an instance always has exactly one class, it is impossible to assign no label to an instance. But it would also be suboptimal to create a class 'unclear', because in multiclass classification the model predicts the most likely class, i.e. relatively to the others. Semantically 'no label' is not a regular label comparable to the others.
Technically this is not a programming question (for future reference, better ask such questions on https://datascience.stackexchange.com/ or https://stats.stackexchange.com/).
I am trying to solve a multi-class single-label document classification problem assigning a single class to a document. Documents are domain-specific technical documents, with technical terms:
Train: I have 19 classes with a single document in each class.
Target: I have 77 documents without labels I want to classify to the 19 known classes.
Documents have between 60-3000 tokens after pre-processing.
My entire corpus (19+77 documents) have 65k terms (uni/bi/tri-grams) with 4.5k terms in common (between train and target)
Currently, I am vectorizing documents using a tf-idf vectorizer and reducing dimensions to common terms. Then doing a cosine similarity between train and target.
I am wondering if there is a better way? I cannot use sklearn classifiers due to a single document in each class in train. Any ideas on a possible improvement/direction? Especially:
Does it make sense to use word-embeddings/doc2vec given the small corpus?
Does it make sense to generate synthetic train data from the terms in the training set?
Any other ideas?
Thanks in advance!
Good to see that you've considered the usual strategies - generating synthetic data, pretrained word embeddings - for a semisupervised text classification scenario. Unfortunately, since you only have one training example per class, no matter how good your feature extraction or how effective your data generation, the classifier you train will almost certainly not generalize. You need more (real) labelled data.
I have a doc2vec model trained on documents with labels. I'm trying to continue training my model with model.train(). The new data comes with new labels as well, but, when I train it on more documents, the new labels aren't being recorded... Does anyone know what my problem might be?
Gensim's Doc2Vec only learns its set of tags at the same time it learns the corpus vocabulary of unique words – during the first call to .build_vocab() on the original corpus.
When you train with additional examples that have either words or tags that aren't already known to the model, those words or tags are simply ignored.
(The .build_vocab(…, update=True) option that's available on Word2Vec to expand its vocabulary has never been fully applied to Doc2Vec, either with respect to tags or with respect to a longstanding crashing bug. So it's not supported on Doc2Vec.)
Note that if it is your aim to create document-vectors that assist in some downstream-classification task, you may not want to supply your known-labels as tags, or at least not as a document's only tag.
The tags you supply to Doc2Vec are the units for which it learns vectors. If you have a million text examples, but only 5 different labels, if you feed those million examples into training each with only the label as a tag, the model is only learning 5 doc-vectors. It is, essentially, like you're training on only 5 mega-documents (passed in in chunks) – and thus 'summarizing' each label down to a single point in vector-space, when it might be far more useful to think of a label as covering a irregularly-shaped "point cloud".
So, you might instead want to use document-IDs rather than labels. (Or, labels and document-IDs.) Then, use the many varied vectors from all individual documents – rather than single vectors per label – to train some downstream classifier or clusterer.
And in that case, the arrival of documents with new labels might not require a full Doc2Vec-retraining. Instead, if the new documents still get useful vectors from inference on the older Doc2Vec model, those per-doc vectors may reflect enough about the new label's documents that downstream classifiers can learn to recognize them.
Ultiamtely, though, if you acquire much more training data, reflecting all new vocabularies & word-senses, the safest approach is to retrain a Doc2Vec model from scratch, using all data. Simply incremental training, even if it had official support, risks pulling those words/tags that appear in new data arbitrarily out-of-comparable-alignment with words/tags that were only trained in the original dataset. It is the interleaved co-training, alongside all other examples equally, which pushes-and-pulls all vectors in a model into useful relative arrangements.
So I have been playing around with Azure ML lately, and I got one dataset where I have multiple values I want to predict. All of them uses different algorithms and when I try to train multiple models within one experiment; it says the “train model can only predict one value”, and there are not enough input ports on the train-model to take in multiple values even if I was to use the same algorithm for each measure. I tried launching the column selector and making rules, but I get the same error as mentioned. How do I predict multiple values and later put the predicted columns together for the web service output so I don’t have to have multiple API’s?
What you would want to do is to train each model and save them as already trained models.
So create a new experiment, train your models and save them by right clicking on each model and they will show up in the left nav bar in the Studio. Now you are able to drag your models into the canvas and have them score predictions where you eventually make them end up in the same output as I have done in my example through the “Add columns” module. I made this example for Ronaldo (Real Madrid CF player) on how he will perform in match after training day. You can see my demo on http://ronaldoinform.azurewebsites.net
For more detailed explanation on how to save the models and train multiple values; you can check out Raymond Langaeian (MSFT) answer in the comment section on this link:
https://azure.microsoft.com/en-us/documentation/articles/machine-learning-convert-training-experiment-to-scoring-experiment/
You have to train models for each variable that you going to predict. Then add all those predicted columns together and get as a single output for the web service.
The algorithms available in ML are only capable of predicting a single variable at a time based on the inputs it's getting.
I am trying to do a timeline detection problem using text classification. As a newbie I am confused as to how I can go about with this. Is this a classification problem? i.e, Can I use the years(timelines) as outcomes and solve this as a classification problem?
You should be able to solve this as a classification problem as you suggest. An option could be to find or build a corpus consisting of texts tagged with the period in which they're set, and train a classification algorithm on this data set.
Another option could be to train a word space model on such a data set, and generate vectors for different periods of time (e.g. the 50s, 60s etc.). You could then create a document vector for the text you wish to classify, and find which of these time vectors yields the best match.
Might not work, but it could be interesting to see what results you get.
Hope this helps!