Bert Multi-lingual fine-tuning for multilabel classification - nlp

I’m trying to make French email sentences multilabel classification with certain categories such as commitmemt, proposition, meeting, request, subjectif, etc. .
The first problem I faced is that I don’t have labeled sentences rather I have french emails as dataset). Based on this I found the BC3 dataset (English emails) which has sentences annotated with some of labels listed above. So I came up with this approah; First finetune a bert multilingual on this BC3 dataset on multilabel classification task and then make a zero-shot transfert learning with the finetuned model (or simply use it in inference) on sentences of my French emails. What do you think about this approach?
So I started by prepropcessing the BC3 dataset and obtain 848 sentences, each of them with their occurences annotations according to each categorty. On the image below, the last 5 columns represent the number of time each annotator labeled a sentence for a specific label.
Are those 848 samples enough to fine tune a Bert multilingual model?
I try to fine tune by representing category as on the image below.
With one epoch, BATCH_SIZE = 4, the loss function did’t converge, rather it oscillates between 0.79 and 0.34.
What kind of advices would you give the solve this kind of problem?
Thanks.

Related

How to get the word on which the text classification has been made?

I am doing a multi-label text classification using a pre-trained model of BERT. Here is an example of the prediction that has been made for one sentence-
pred_image
I want to get those words from the sentence on which the prediction has been made. Like this one - right_one
If anyone has any idea, Please enlighten me.
Multi-Label Text Classification (first image) and Token Classification (second image) are two different tasks for each which the model needs to be specifally trained for.
The first one returns a probability for each label considering the entire sentence. The second returns such predictions for each single word in the sentence while usually considering the rest of the sentence as context.
So you can not really use the output from a Text Classifier and use it for Token Classification because the information you get is not detailed enough.
What you can and should do is train a Token Classification model, although you obviously will need token-level-annotated data to do so.

NLP Classification on a dataset

I am trying to learned NLP. I understand the basic concepts from Text Preprocessing to td-idf, and Word Embedding. How do I apply this learning? I have a Data set with two columns: Answer and Gender. I want to use NLP to transform the Answer column to vectors and then use supervised machine learning to train a model that predict where a certain type of answer was given by male or a female.
I dont know how to process after I Pre_processed the text.
You can download datasets which are available in Matlab format.
All of them are divided into train and test datasets.
check my GitHub

Multi-class text classification with one training example per class

I am trying to solve a multi-class single-label document classification problem assigning a single class to a document. Documents are domain-specific technical documents, with technical terms:
Train: I have 19 classes with a single document in each class.
Target: I have 77 documents without labels I want to classify to the 19 known classes.
Documents have between 60-3000 tokens after pre-processing.
My entire corpus (19+77 documents) have 65k terms (uni/bi/tri-grams) with 4.5k terms in common (between train and target)
Currently, I am vectorizing documents using a tf-idf vectorizer and reducing dimensions to common terms. Then doing a cosine similarity between train and target.
I am wondering if there is a better way? I cannot use sklearn classifiers due to a single document in each class in train. Any ideas on a possible improvement/direction? Especially:
Does it make sense to use word-embeddings/doc2vec given the small corpus?
Does it make sense to generate synthetic train data from the terms in the training set?
Any other ideas?
Thanks in advance!
Good to see that you've considered the usual strategies - generating synthetic data, pretrained word embeddings - for a semisupervised text classification scenario. Unfortunately, since you only have one training example per class, no matter how good your feature extraction or how effective your data generation, the classifier you train will almost certainly not generalize. You need more (real) labelled data.

Multilingual free-text-items Text Classification for improving a recommender system

To improve the recomender system for Buyer Material Groups, our company is willing to train a model using customer historial spend data. The model should be trained on historical "Short text descriptions" to predict the appropriate BMG. The dataset has more that 500.000 rows and the text descriptions are multilingual (up to 40 characters).
1.Question: can i use supervised learning if i consider the fact that the descriptions are in multiple languages? If Yes, are classic approaches like multinomial naive bayes or SVM suitable?
2.Question: if i want to improve the first model in case it is not performing well, and use unsupervised multilingual emdedding to build a classifier. how can i train this classifier on the numerical labels later?
if you have other ideas or approaches please feel free :). (It is a matter of a simple text classification problem)
Can I use supervised learning if i consider the fact that the descriptions are in multiple languages?
Yes, this is not a problem except it makes your data more sparse. If you actually only have 40 characters (is that not 40 words?) per item, you may not have enough data. Also the main challenge for supervised learning will be whether you have labels for the data.
If Yes, are classic approaches like multinomial naive bayes or SVM suitable?
They will work as well as they always have, though these days building a vector representation is probably a better choice.
If i want to improve the first model in case it is not performing well, and use unsupervised multilingual emdedding to build a classifier. how can i train this classifier on the numerical labels later?
Assuming the numerical labels are labels on the original data, you can add them as tokens like LABEL001 and the model can learn representations of them if you want to make an unsupervised recommender.
Honestly these days I wouldn't start with Naive Bayes or classical models, I'd go straight to word vectors as a first test for clustering. Using fasttext or word2vec is pretty straightforward. The main problem is that if you really only have 40 characters per item, that just might not be enough data to cluster usefully.

Creating input data for BERT modelling - multiclass text classification

I'm trying to build a keras model to classify text for 45 different classes. I'm a little confused about preparing my data for the input as required by google's BERT model.
Some blog posts insert data as a tf dataset with input_ids, segment ids, and mask ids, as in this guide, but then some only go with input_ids and masks, as in this guide.
Also in the second guide, it notes that the segment mask and attention mask inputs are optional.
Can anyone explain whether or not those two are required for a multiclass classification task?
If it helps, each row of my data can consist of any number of sentences within a reasonably sized paragraph. I want to be able to classify each paragraph/input to a single label.
I can't seem to find many guides/blogs about using BERT with Keras (Tensorflow 2) for a multiclass problem, indeed many of them are for multi-label problems.
I guess it is too late to answer but I had the same question. I went through huggingface code and found that if attention_mask and segment_type ids are None then by default it pays attention to all tokens and all the segments are given id 0.
If you want to check it out, you can find the code here
Let me know if this clarifies it or you think otherwise.

Resources