How to prevent NER model to overfit entity position with Spacy - nlp

I'm building a custom NER model to detect brands in product titles using a well sized dataset of 98k products with their corresponding title, the train contains around 84k records, validation split 10k and test split 3k.
The only problem with the dataset is that 89% of all product titles have the brand as their first words.
When training the NER model from scratch, it gives a good F1 score of 85% after just few epochs (batch size =32), however when testing the model I noticed the follwoign :
The model is strongly biased toward predicting the first word of the title as a brand
The model is very good at detecting brands when they occur as first words, but is quite weak for titles having brands in their middle or end.
I had the idea to solve this by resampling the dataset and removing some brands from some titles as their first word and put it at the end or the middle.
However, I would like to know if there is a technique in NLP that allow the model to not give a large importance to the entity position in the text ? I used dropout of 0.6 but with no success

Related

Bert Multi-lingual fine-tuning for multilabel classification

I’m trying to make French email sentences multilabel classification with certain categories such as commitmemt, proposition, meeting, request, subjectif, etc. .
The first problem I faced is that I don’t have labeled sentences rather I have french emails as dataset). Based on this I found the BC3 dataset (English emails) which has sentences annotated with some of labels listed above. So I came up with this approah; First finetune a bert multilingual on this BC3 dataset on multilabel classification task and then make a zero-shot transfert learning with the finetuned model (or simply use it in inference) on sentences of my French emails. What do you think about this approach?
So I started by prepropcessing the BC3 dataset and obtain 848 sentences, each of them with their occurences annotations according to each categorty. On the image below, the last 5 columns represent the number of time each annotator labeled a sentence for a specific label.
Are those 848 samples enough to fine tune a Bert multilingual model?
I try to fine tune by representing category as on the image below.
With one epoch, BATCH_SIZE = 4, the loss function did’t converge, rather it oscillates between 0.79 and 0.34.
What kind of advices would you give the solve this kind of problem?
Thanks.

Document classification using pretrained models like BERT

I am looking for methods to classify documents. For ex. I have a bunch of documents with text and I want to label the document on whether it belongs to sports, food, politics etc.
Can I use BERT (for documents with words > 500) for this or are there any other models that do this task efficiently?
BERT has a maximum sequence length of 512 tokens (note that this is usually much less than 500 words), so you cannot input a whole document to BERT at once. If you still want to use the model for this task, I would suggest that you
split up each document into chunks that are processable by BERT (e.g. 512 tokens or less)
classify all document chunks individually
classify the whole document according to the most frequently predicted label of the chunks, i.e. take a majority vote
In this case, the only modification you have to make is to add a fully connected layer on top of BERT.
This approach might be quite expensive, though. There is the alternative of representing the text documents as bag of word vectors and then train a classifier on the data. If you are not familiar with BOW, the Wikipedia entry to it is a good starting point. It can serve as a feature vector for all kinds of classifiers, I would suggest you try SVM or kNN.

Training Doc2vec with new data

I have a doc2vec model trained on documents with labels. I'm trying to continue training my model with model.train(). The new data comes with new labels as well, but, when I train it on more documents, the new labels aren't being recorded... Does anyone know what my problem might be?
Gensim's Doc2Vec only learns its set of tags at the same time it learns the corpus vocabulary of unique words – during the first call to .build_vocab() on the original corpus.
When you train with additional examples that have either words or tags that aren't already known to the model, those words or tags are simply ignored.
(The .build_vocab(…, update=True) option that's available on Word2Vec to expand its vocabulary has never been fully applied to Doc2Vec, either with respect to tags or with respect to a longstanding crashing bug. So it's not supported on Doc2Vec.)
Note that if it is your aim to create document-vectors that assist in some downstream-classification task, you may not want to supply your known-labels as tags, or at least not as a document's only tag.
The tags you supply to Doc2Vec are the units for which it learns vectors. If you have a million text examples, but only 5 different labels, if you feed those million examples into training each with only the label as a tag, the model is only learning 5 doc-vectors. It is, essentially, like you're training on only 5 mega-documents (passed in in chunks) – and thus 'summarizing' each label down to a single point in vector-space, when it might be far more useful to think of a label as covering a irregularly-shaped "point cloud".
So, you might instead want to use document-IDs rather than labels. (Or, labels and document-IDs.) Then, use the many varied vectors from all individual documents – rather than single vectors per label – to train some downstream classifier or clusterer.
And in that case, the arrival of documents with new labels might not require a full Doc2Vec-retraining. Instead, if the new documents still get useful vectors from inference on the older Doc2Vec model, those per-doc vectors may reflect enough about the new label's documents that downstream classifiers can learn to recognize them.
Ultiamtely, though, if you acquire much more training data, reflecting all new vocabularies & word-senses, the safest approach is to retrain a Doc2Vec model from scratch, using all data. Simply incremental training, even if it had official support, risks pulling those words/tags that appear in new data arbitrarily out-of-comparable-alignment with words/tags that were only trained in the original dataset. It is the interleaved co-training, alongside all other examples equally, which pushes-and-pulls all vectors in a model into useful relative arrangements.

How to use CNN model to detect object recognized by YOLO

Let start by saying that i have 2 pre-trained models (in hdf5 files):
The first model is a YOLO-based model, trained on dataset A, which is used to locate human in any images (note that: a trained images o this model may contain many people inside)
The second model is a CNN model which is used to detect gender of a person (male or female) based on the image which only contains 1 person.
Suppose that i only want to use these 2 models and do not want to re-train or modify anything on the dataset. How could i locate female person in a picture of Dataset A?
A possible solution that i think could work:
First use the first model to detect, that is to create bounding boxes around persons in the images.
Crop the bounding boxes into unique images. Feed those images to the second model to see if that person is Female/Male
However, this solution is slow in performance. So is there anyway that can festen this solution or perform this task in different ways?

Assign document to a category using document similarity

I'm developing a NLP project in python.
I'm getting "conversation" from social networks. A conversation is made up of post_text + comment_text + reply_text (with comment_text and reply_text as optional).
I've also a list of categories, arguments, and I want to "connect" conversation to an argument (or get a weight for each argument).
For each category, I get the summary on Wikipedia, using wikipedia python package. So, they represent my training documents (right?).
Now, I've writed down some steps to follow, but maybe I'm wrong.
Each training document must be transformed to Vector Space Model. I've to remove stopwords and common words. So, I've a list of vocabulary.
Each conversation must be transformed to vector space model and each token must be assigned to its vocabulary index. I can save all vector space models in a matrix.
Now, I've to perform tf-idf (for example) on all matrix rows.
In tf-idf I've to calculate tf, idf and normalize matrix?
So, each row represents tf-idf for each conversation. Now, I've to perform cosine-similarity (for example) to get similarity between each conversation and one training document. I've to iterate it to get similarity between conversations and each training document.
What do you think about the steps? Is there any guide/how to/book I've to read to understand better this problem?
Instead of getting summary from Wikipedia and matching similarity you can train a classifier that given a summary can predict which document category it is. You can start with simplest Bag of word representation of summery from Wikipedia for classification then analyse the results and accuracy. After that can move forward to more sophisticate approach like word to vector or document to vector for word representation and then train a classifier.
After making classification model, for assigning category to your test document you need to clasify it using classification model.

Resources