Multi-class text classification with one training example per class - nlp

I am trying to solve a multi-class single-label document classification problem assigning a single class to a document. Documents are domain-specific technical documents, with technical terms:
Train: I have 19 classes with a single document in each class.
Target: I have 77 documents without labels I want to classify to the 19 known classes.
Documents have between 60-3000 tokens after pre-processing.
My entire corpus (19+77 documents) have 65k terms (uni/bi/tri-grams) with 4.5k terms in common (between train and target)
Currently, I am vectorizing documents using a tf-idf vectorizer and reducing dimensions to common terms. Then doing a cosine similarity between train and target.
I am wondering if there is a better way? I cannot use sklearn classifiers due to a single document in each class in train. Any ideas on a possible improvement/direction? Especially:
Does it make sense to use word-embeddings/doc2vec given the small corpus?
Does it make sense to generate synthetic train data from the terms in the training set?
Any other ideas?
Thanks in advance!

Good to see that you've considered the usual strategies - generating synthetic data, pretrained word embeddings - for a semisupervised text classification scenario. Unfortunately, since you only have one training example per class, no matter how good your feature extraction or how effective your data generation, the classifier you train will almost certainly not generalize. You need more (real) labelled data.

Related

Multilingual free-text-items Text Classification for improving a recommender system

To improve the recomender system for Buyer Material Groups, our company is willing to train a model using customer historial spend data. The model should be trained on historical "Short text descriptions" to predict the appropriate BMG. The dataset has more that 500.000 rows and the text descriptions are multilingual (up to 40 characters).
1.Question: can i use supervised learning if i consider the fact that the descriptions are in multiple languages? If Yes, are classic approaches like multinomial naive bayes or SVM suitable?
2.Question: if i want to improve the first model in case it is not performing well, and use unsupervised multilingual emdedding to build a classifier. how can i train this classifier on the numerical labels later?
if you have other ideas or approaches please feel free :). (It is a matter of a simple text classification problem)
Can I use supervised learning if i consider the fact that the descriptions are in multiple languages?
Yes, this is not a problem except it makes your data more sparse. If you actually only have 40 characters (is that not 40 words?) per item, you may not have enough data. Also the main challenge for supervised learning will be whether you have labels for the data.
If Yes, are classic approaches like multinomial naive bayes or SVM suitable?
They will work as well as they always have, though these days building a vector representation is probably a better choice.
If i want to improve the first model in case it is not performing well, and use unsupervised multilingual emdedding to build a classifier. how can i train this classifier on the numerical labels later?
Assuming the numerical labels are labels on the original data, you can add them as tokens like LABEL001 and the model can learn representations of them if you want to make an unsupervised recommender.
Honestly these days I wouldn't start with Naive Bayes or classical models, I'd go straight to word vectors as a first test for clustering. Using fasttext or word2vec is pretty straightforward. The main problem is that if you really only have 40 characters per item, that just might not be enough data to cluster usefully.

NLP - Best document embedding library

Good day, fellow humans (?).
I have a methodological question that is confused by a deep research in a tiny amount of time.
The question arises from the following problem(s): I need to apply semi-supervised or unsupervised clustering on documents. I have ~300 documents classified with multi-labels and approximately 3400 documents not classified. The number of unsupervised documents could become ~10'000 in the next days.
The main idea is that of applying semi-supervised clustering based on the labels at hands. Alternatively, that of going fully unsupervised for soft clustering.
We thought of creating embeddings for the whole documents, but here lies the confusion: which library is the best for such a task?
I guess the utmost importance needs to lie in the context of the whole document. As far as I know, BERT and FastText provide context-dependent word embedding, but not whole document embedding. On the other hand, Gensim's Doc2Vec is context-agnostic, right?
I think I saw a way to train sentence embeddings with BERT, via the HuggingFace API, and was wondering whether it could be useful to consider the whole document as a single sentence.
Do you have any suggestion? I'm probably exposing my utter ignorance and confusion on the matter, but my brain is melted.
Thank you very much for your time.
Viva!
Edit to answer to #gojomo:
My documents are on average ~180 words. The original task was that of multi-label text classification, i.e. each document can have from 1 to N labels, with the number of labels now being N=18. They are highly imbalanced.
Having only 330 labeled documents so far due to several issues, we asked the documents' provider to give also unlabeled data, that should reach the order of the 10k.
I used FastText classification mode, but the result is obviously atrocious. I also run a K-NN with Doc2Vec document embedding, but the result is obviously still atrocious.
I was going to use biomedical BERT-based models (like BioBERT and SciBERT) to produce a NER tagging (trained on domain-specific datasets) on the documents to later apply a classifier.
Now that we have unlabeled documents at disposal, we wanted to adventure into semi-supervised classification or unsupervised clustering, just to explore possibilities. I have to say that this is just a master thesis.

Using cosine similarity for classifying documents

I have a set of files for five different categories and most of them are not labelled correctly.Objective is to predict the correct category of the file whenever the same is uploaded.I used cosine similarity along with tf -idf to predict the class of the document with which cosine similarity is the maximum as of now i am getting good results but really not sure how well this will work down the road. Also why isnt cosine similarity used in building document classifiers instead of machine learning models when the categories of files are labelled correctly?Would really appreciate your feedback on my approach as well as your answer to the question.
Cosine similarity is used for calculating the angle between two n-dimensional vectors. These vectors are mostly produced by Embeddings. They are pretrained models which produce word embeddings or fixed size vectors.
Cosine similarity is mostly used with vectors produced by word
embeddings. If you are using something like Doc2Vec, then you get a
vector for the whole document. These vectors could be categorized by
using cosine similarity.
In your case, you should try a LSTM text classifier using Embedding layers. 1D Convolution layers can also be useful.
Also, referring to TF-IDF, it is useful for text classification which is dependent on certain words in the corpus. The words with higher term frequency and less document frequency have a higher TF-IDF score. The model learns to classify texts based on such scores.
In most cases, RNNs are the best to classify texts. The use of pretrained embeddings makes the model efficient.
Also, not the least, you can give Bayes text classification a try. It has been super useful in spam classification.
Tip:
You can implement the above methods with each other, creating a text classification system. Following the process like,
Generate embeddings from Doc2Vec.
Comparing the similarity of the input with other texts and thereby determine its class.
Using the embedding in a LSTM network to produce class probabilities.
Apply Bayes text classification.
The steps 2 , 3 , 4 give three predictions. If the majority prediction was CLASS1, then we can make the output of the system as CLASS1!.

I want to classify some sentences on the basis of their semantic meaning.How can I use Doc2Vec in this? Or is there a better approach than this?

I want to implement doc2vec on various reviews which we extracted from a source.And I want to classify these reviews into different classes defined by the user. How can I do this?
I consider this as one of the interesting question. I will be giving you some approaches depending on size of observations/reviews.
You can apply LSA (SVD on DTM (either incidence or TF-IDF vectors) you will be getting three vectors as outputs -- USV. The V transpose is the sentence embedding).
Use this embeddings as input to your model for classification.
I recommend to use LSA when your corpus size is large.
Resources: link
In the similar way instead of using LSA, You can use pre trained embeddings say glove, here you will be getting word embeddings for creating document vectors use inverse weighted frequency method. Use this document vectors for classification.
Resources: link

Feature Construction for Text Classification using Autoencoders

Autoencoders can be used to reduce dimensionallity in feature vectors - as far as I understand. In text classification a feature vector is normally constructed via a dictionary - which tends to be extremely large. I have no experience in using autoencoders, so my questions are:
Could autoencoders be used to reduce dimensionallity in text classification? (Why? / Why not?)
Has anyone already done this? A source would be nice, if so.
The existing works use auto encoder for creating models in the sentence level. Basically after training the model using Autoencode, you can get a vector for a sentence. Since any document consists of sentences you can get a set of vectors for the document, and do the document classification. In my experience with various vector representation (e.g. those generated from autoencodes) doing so might give answers worse than classification with bag of words.

Resources