Best approach for multi-label multi-class texts in 2022? - nlp

I have more than 200k labeled text, all of them labeled in 2 Main-classes an every class has 200 sub-class , some of texts belongs to more than 3 classes ( 1 main-class and 2 sub-clsses )
I'm going to give my model new text, my model should:
first) filter classes by some few questions at first,
second) guess what is topic of new text
(( its sort of topic modeling, multi-label multi-class classification, question answering method ))
what is best approach to do that based on new nlp Models?

Related

Bert Multi-lingual fine-tuning for multilabel classification

I’m trying to make French email sentences multilabel classification with certain categories such as commitmemt, proposition, meeting, request, subjectif, etc. .
The first problem I faced is that I don’t have labeled sentences rather I have french emails as dataset). Based on this I found the BC3 dataset (English emails) which has sentences annotated with some of labels listed above. So I came up with this approah; First finetune a bert multilingual on this BC3 dataset on multilabel classification task and then make a zero-shot transfert learning with the finetuned model (or simply use it in inference) on sentences of my French emails. What do you think about this approach?
So I started by prepropcessing the BC3 dataset and obtain 848 sentences, each of them with their occurences annotations according to each categorty. On the image below, the last 5 columns represent the number of time each annotator labeled a sentence for a specific label.
Are those 848 samples enough to fine tune a Bert multilingual model?
I try to fine tune by representing category as on the image below.
With one epoch, BATCH_SIZE = 4, the loss function did’t converge, rather it oscillates between 0.79 and 0.34.
What kind of advices would you give the solve this kind of problem?
Thanks.

NLP Classification on a dataset

I am trying to learned NLP. I understand the basic concepts from Text Preprocessing to td-idf, and Word Embedding. How do I apply this learning? I have a Data set with two columns: Answer and Gender. I want to use NLP to transform the Answer column to vectors and then use supervised machine learning to train a model that predict where a certain type of answer was given by male or a female.
I dont know how to process after I Pre_processed the text.
You can download datasets which are available in Matlab format.
All of them are divided into train and test datasets.
check my GitHub

How to prevent NER model to overfit entity position with Spacy

I'm building a custom NER model to detect brands in product titles using a well sized dataset of 98k products with their corresponding title, the train contains around 84k records, validation split 10k and test split 3k.
The only problem with the dataset is that 89% of all product titles have the brand as their first words.
When training the NER model from scratch, it gives a good F1 score of 85% after just few epochs (batch size =32), however when testing the model I noticed the follwoign :
The model is strongly biased toward predicting the first word of the title as a brand
The model is very good at detecting brands when they occur as first words, but is quite weak for titles having brands in their middle or end.
I had the idea to solve this by resampling the dataset and removing some brands from some titles as their first word and put it at the end or the middle.
However, I would like to know if there is a technique in NLP that allow the model to not give a large importance to the entity position in the text ? I used dropout of 0.6 but with no success

Multi-class text classification with one training example per class

I am trying to solve a multi-class single-label document classification problem assigning a single class to a document. Documents are domain-specific technical documents, with technical terms:
Train: I have 19 classes with a single document in each class.
Target: I have 77 documents without labels I want to classify to the 19 known classes.
Documents have between 60-3000 tokens after pre-processing.
My entire corpus (19+77 documents) have 65k terms (uni/bi/tri-grams) with 4.5k terms in common (between train and target)
Currently, I am vectorizing documents using a tf-idf vectorizer and reducing dimensions to common terms. Then doing a cosine similarity between train and target.
I am wondering if there is a better way? I cannot use sklearn classifiers due to a single document in each class in train. Any ideas on a possible improvement/direction? Especially:
Does it make sense to use word-embeddings/doc2vec given the small corpus?
Does it make sense to generate synthetic train data from the terms in the training set?
Any other ideas?
Thanks in advance!
Good to see that you've considered the usual strategies - generating synthetic data, pretrained word embeddings - for a semisupervised text classification scenario. Unfortunately, since you only have one training example per class, no matter how good your feature extraction or how effective your data generation, the classifier you train will almost certainly not generalize. You need more (real) labelled data.

Sk-learn LDA for topic extraction, perplexity and score

Hello all!
As apart of a project, I need to build a text classifier with the labeled data I have. A data point is composed of a single sentences and one of 3 categories for each sentence. I have extracted 5 topics from this database with LDA.
What I want to try is that I want to use these topics to determine which class an unseen sentence belongs to. I am thinking about training a supervised model with 5 indicator that show the topic distribution for a sentence given those 5 topics.
The problem is that I can not get separate likelihood for each topic given a sentence. I am confused about what perplexity and score of a LDA model indicates. They seem to return single float value.
Also, I am aware of supervised versions of LDA. I want to know if my approach make sense at all.
Thanks in advance!

Resources