I am looking for methods to classify documents. For ex. I have a bunch of documents with text and I want to label the document on whether it belongs to sports, food, politics etc.
Can I use BERT (for documents with words > 500) for this or are there any other models that do this task efficiently?
BERT has a maximum sequence length of 512 tokens (note that this is usually much less than 500 words), so you cannot input a whole document to BERT at once. If you still want to use the model for this task, I would suggest that you
split up each document into chunks that are processable by BERT (e.g. 512 tokens or less)
classify all document chunks individually
classify the whole document according to the most frequently predicted label of the chunks, i.e. take a majority vote
In this case, the only modification you have to make is to add a fully connected layer on top of BERT.
This approach might be quite expensive, though. There is the alternative of representing the text documents as bag of word vectors and then train a classifier on the data. If you are not familiar with BOW, the Wikipedia entry to it is a good starting point. It can serve as a feature vector for all kinds of classifiers, I would suggest you try SVM or kNN.
Related
I have a doc2vec model trained on documents with labels. I'm trying to continue training my model with model.train(). The new data comes with new labels as well, but, when I train it on more documents, the new labels aren't being recorded... Does anyone know what my problem might be?
Gensim's Doc2Vec only learns its set of tags at the same time it learns the corpus vocabulary of unique words – during the first call to .build_vocab() on the original corpus.
When you train with additional examples that have either words or tags that aren't already known to the model, those words or tags are simply ignored.
(The .build_vocab(…, update=True) option that's available on Word2Vec to expand its vocabulary has never been fully applied to Doc2Vec, either with respect to tags or with respect to a longstanding crashing bug. So it's not supported on Doc2Vec.)
Note that if it is your aim to create document-vectors that assist in some downstream-classification task, you may not want to supply your known-labels as tags, or at least not as a document's only tag.
The tags you supply to Doc2Vec are the units for which it learns vectors. If you have a million text examples, but only 5 different labels, if you feed those million examples into training each with only the label as a tag, the model is only learning 5 doc-vectors. It is, essentially, like you're training on only 5 mega-documents (passed in in chunks) – and thus 'summarizing' each label down to a single point in vector-space, when it might be far more useful to think of a label as covering a irregularly-shaped "point cloud".
So, you might instead want to use document-IDs rather than labels. (Or, labels and document-IDs.) Then, use the many varied vectors from all individual documents – rather than single vectors per label – to train some downstream classifier or clusterer.
And in that case, the arrival of documents with new labels might not require a full Doc2Vec-retraining. Instead, if the new documents still get useful vectors from inference on the older Doc2Vec model, those per-doc vectors may reflect enough about the new label's documents that downstream classifiers can learn to recognize them.
Ultiamtely, though, if you acquire much more training data, reflecting all new vocabularies & word-senses, the safest approach is to retrain a Doc2Vec model from scratch, using all data. Simply incremental training, even if it had official support, risks pulling those words/tags that appear in new data arbitrarily out-of-comparable-alignment with words/tags that were only trained in the original dataset. It is the interleaved co-training, alongside all other examples equally, which pushes-and-pulls all vectors in a model into useful relative arrangements.
So I am doing a project on document similarity and right now my features are only the embeddings from Doc2Vec. Since that is not showing any good results, after hyperparameter optimization and word embedding before the doc embedding... What other features can I add, so as to get better results?
My dataset is 150 documents, 500-700 words each, with 10 topics(labels), each document having one topic. Documents are labeled on a document level, and that labeling is currently used only for evaluation purposes.
Edit: The following is answer to gojomo's questions and elaborating on my comment on his answer:
The evaluation of the model is done on the training set. I am comparing if the label is the same as the most similar document from the model. For this I am first getting the document vector using the model's method 'infer_vector' and then 'most_similar' to get the most similar document. The current results I am getting are 40-50% of accuracy. A satisfactory score would be of at least 65% and upwards.
Due to the purpose of this research and it's further use case, I am unable to get a larger dataset, that is why I was recommended by a professor, as this is a university project, to add some additional features to the document embeddings of Doc2Vec. As I had no idea what he ment, I am asking the community of stackoverflow.
The end goal of the model is to do clusterization of the documents, again the labels for now being used only for evaluation purposes.
If I don't get good results with this model, I will try out the simpler ones mentioned by #Adnan S #gojomo such as TF-IDF, Word Mover's Distance, Bag of words, just presumed I would get better results using Doc2Vec.
You should try creating TD-IDF with 2 and 3 grams to generate a vector representation for each document. You will have to train the vocabulary on all the 150 documents. Once you have the TF-IDF vector for each document, you can use cosine similarity between any two of them.
Here is a blog article with more details and doc page for sklearn.
How are you evaluating the results as not good, and how will you know when your results are adequate/good?
Note that just 150 docs of 400-700 words each is a tiny, tiny dataset: typical datasets used published Doc2Vec results include tens-of-thousands to millions of documents, of hundreds to thousands of words each.
It will be hard for any of the Word2Vec/Doc2Vec/etc-style algorithms to do much with so little data. (The gensim Doc2Vec implementation includes a similar toy dataset, of 300 docs of 200-300 words each, as part of its unit-testing framework, and to eke out even vaguely useful results, it must up the number of training epochs, and shrink the vector size, significantly.)
So if intending to use Doc2Vec-like algorithms, your top priority should be finding more training data. Even if, in the end, only ~150 docs are significant, collecting more documents that use similar domain language can help improve the model.
It's unclear what you mean when you say there are 10 topics, and 1 topic per document. Are those human-assigned categories, and are those included as part of the training texts or tags passed to the Doc2Vec algorithm? (It might be reasonable to include it, depending on what your end-goals and document-similarity evaluations consist of.)
Are these topics the same as the labelling you also mention, and are you ultimately trying to predict the topics, or just using the topics as a check of the similarity-results?
As #adnan-s suggests in the other answer, it may also be worth trying more-simple count-based 'bag of words' document representations, including potentially on word n-grams or even character n-grams, or TF-IDF weighted.
If you have adequate word-vectors, as trained from your data or from other compatible sources, the "Word Mover's Distance" measure can be another interesting way to compute pairwise similarities. (However, it may be too expensive to calculate between many-hundred-word texts - working much faster on shorter texts.)
As others have already suggested your training set of 150 documents probably isn't big enough to create good representations. You could, however, try to use a pre-trained model and infer the vectors of your documents.
Here is a link where you can download a (1.4GB) DBOW model trained on English Wikipedia pages, working with 300-dimensional document vectors. I obtained the link from jhlau/doc2vec GitHub repository. After downloading the model you can use it as follows:
from gensim.models import Doc2Vec
# load the downloaded model
model_path = "enwiki_dbow/doc2vec.bin"
model = Doc2Vec.load(model_path)
# infer vector for your document
doc_vector = model.infer_vector(doc_words)
Where doc_words is a list of words in your document.
This, however, may not work for you in case your documents are very specific. But you can still give it a try.
Good day, fellow humans (?).
I have a methodological question that is confused by a deep research in a tiny amount of time.
The question arises from the following problem(s): I need to apply semi-supervised or unsupervised clustering on documents. I have ~300 documents classified with multi-labels and approximately 3400 documents not classified. The number of unsupervised documents could become ~10'000 in the next days.
The main idea is that of applying semi-supervised clustering based on the labels at hands. Alternatively, that of going fully unsupervised for soft clustering.
We thought of creating embeddings for the whole documents, but here lies the confusion: which library is the best for such a task?
I guess the utmost importance needs to lie in the context of the whole document. As far as I know, BERT and FastText provide context-dependent word embedding, but not whole document embedding. On the other hand, Gensim's Doc2Vec is context-agnostic, right?
I think I saw a way to train sentence embeddings with BERT, via the HuggingFace API, and was wondering whether it could be useful to consider the whole document as a single sentence.
Do you have any suggestion? I'm probably exposing my utter ignorance and confusion on the matter, but my brain is melted.
Thank you very much for your time.
Viva!
Edit to answer to #gojomo:
My documents are on average ~180 words. The original task was that of multi-label text classification, i.e. each document can have from 1 to N labels, with the number of labels now being N=18. They are highly imbalanced.
Having only 330 labeled documents so far due to several issues, we asked the documents' provider to give also unlabeled data, that should reach the order of the 10k.
I used FastText classification mode, but the result is obviously atrocious. I also run a K-NN with Doc2Vec document embedding, but the result is obviously still atrocious.
I was going to use biomedical BERT-based models (like BioBERT and SciBERT) to produce a NER tagging (trained on domain-specific datasets) on the documents to later apply a classifier.
Now that we have unlabeled documents at disposal, we wanted to adventure into semi-supervised classification or unsupervised clustering, just to explore possibilities. I have to say that this is just a master thesis.
I am in a case that I don't have any pre-trained words embedding for my domain (Vietnamese food reviews). so I got a though of embedding from the general and specific corpus.
And the point here is can I use the dataset of training, test and validating (did preprocess) as a source for creating my own word embeddings. If don't, hope you can give your experience.
Based on my intuition, and some experiments a wide corpus appears to be better, but I'd like to know if there's relevant research or other relevant results.
can I use the dataset of training, test and validating (did
preprocess) as a source for creating my own word embeddings
Sure, embeddings are not your features for your machine learning model. They are the "computational representation" of your data. In short, they are made of words represented in a vector space. With embeddings, your data is less sparse. Using word embeddings could be considered part of the pre-processing step of NLP.
Usually (I mean, using the most used technique, word2vec), the representation of a word in the vector space is defined by its surroundings (the words that it commonly goes along with).
Therefore, to create embeddings, the larger the corpus, the better, since it can better place a word vector in the vector space (and hence compare it to other similar words).
I'm developing a NLP project in python.
I'm getting "conversation" from social networks. A conversation is made up of post_text + comment_text + reply_text (with comment_text and reply_text as optional).
I've also a list of categories, arguments, and I want to "connect" conversation to an argument (or get a weight for each argument).
For each category, I get the summary on Wikipedia, using wikipedia python package. So, they represent my training documents (right?).
Now, I've writed down some steps to follow, but maybe I'm wrong.
Each training document must be transformed to Vector Space Model. I've to remove stopwords and common words. So, I've a list of vocabulary.
Each conversation must be transformed to vector space model and each token must be assigned to its vocabulary index. I can save all vector space models in a matrix.
Now, I've to perform tf-idf (for example) on all matrix rows.
In tf-idf I've to calculate tf, idf and normalize matrix?
So, each row represents tf-idf for each conversation. Now, I've to perform cosine-similarity (for example) to get similarity between each conversation and one training document. I've to iterate it to get similarity between conversations and each training document.
What do you think about the steps? Is there any guide/how to/book I've to read to understand better this problem?
Instead of getting summary from Wikipedia and matching similarity you can train a classifier that given a summary can predict which document category it is. You can start with simplest Bag of word representation of summery from Wikipedia for classification then analyse the results and accuracy. After that can move forward to more sophisticate approach like word to vector or document to vector for word representation and then train a classifier.
After making classification model, for assigning category to your test document you need to clasify it using classification model.