I have some training sentences generally of warning nature. Now my goal is to predict weather incoming sentence is a warning message or not. I have gone through Sentiment Analysis Using Doc2Vec but according to my understanding it have not considered newly arriving sentence to predict if its positive or negative.
According to my experience I found that the output vector in gensim.doc2vec for each sentence is dependent on other sentences as well, which means we can not directly use the model to generate vector for newly arriving sentence. Please anyone help me with this. Thanks.
One way to generate new vectors is using the infer_vector() function, which will generate a new vector based on a trained model. Since the model is frozen when you use this function, the new vector will be based on the existing sentence vectors, but not change them.
Related
I am doing a multi-label text classification using a pre-trained model of BERT. Here is an example of the prediction that has been made for one sentence-
pred_image
I want to get those words from the sentence on which the prediction has been made. Like this one - right_one
If anyone has any idea, Please enlighten me.
Multi-Label Text Classification (first image) and Token Classification (second image) are two different tasks for each which the model needs to be specifally trained for.
The first one returns a probability for each label considering the entire sentence. The second returns such predictions for each single word in the sentence while usually considering the rest of the sentence as context.
So you can not really use the output from a Text Classifier and use it for Token Classification because the information you get is not detailed enough.
What you can and should do is train a Token Classification model, although you obviously will need token-level-annotated data to do so.
I would like to do some supervised binary classification tasks with sentences, and have been using spaCy because of its ease of use. I used spaCy to convert the text into vectors, and then fed the vectors to a machine learning model (e.g. XGBoost) to perform the classfication. However, the results have not been very satisfactory.
In spaCy, it is easy to load a model (e.g. BERT / Roberta / XLNet) to convert words / sentences to nlp objects. Directly calling the vector of the object will however will default to an average of the token vectors.
Here are two questions:
1) Can we do better than simply getting the average of token vectors, like having context/order-aware sentence vectors using spaCy? For example, can we extract the sentence embedding from the previous layer of the BERT transformer instead of the final token vectors in spaCy?
2) Would it be better to directly use spaCy to train the downstream binary classification task? For example, here discusses how to add a text classifier to a spaCy model. Or is it generally better to apply more powerful machine learning models like XGBoost?
Thanks in advance!
I found this being discussed in the page below. Maybe it helps.
"Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much."
https://github.com/huggingface/transformers/issues/1950
Referring to the documentation of the awesome Transformers library from Huggingface, I came across the add_tokens functions.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
model.resize_token_embeddings(len(tokenizer))
I tried the above by adding previously absent words in the default vocabulary. However, keeping all else constant, I noticed a decrease in accuracy of the fine tuned classifier making use of this updated tokenizer. I was able to replicate similar behavior even when just 10% of the previously absent words were added.
My questions
Am I missing something?
Instead of whole words, is the add_tokens function expecting masked tokens, for example : '##ah', '##red', '##ik', '##si', etc.? If yes, is there a procedure to generate such masked tokens?
Any help would be appreciated.
Thanks in advance.
If you add tokens to the tokenizer, you indeed make the tokenizer tokenize the text differently, but this is not the tokenization BERT was trained with, so you are basically adding noise to the input. The word embeddings are not trained and the rest of the network never saw them in context. You would need a lot of data to teach BERT to deal with the newly added words.
There are also some ways how to compute a single word embedding, such that it would not hurt BERT like in this paper but it seems pretty complicated and should not make any difference.
BERT uses a word-piece-based vocabulary, so it should not really matter if the words are present in the vocabulary as a single token or get split into multiple wordpieces. The model probably saw the split word during pre-training and will know what to do with it.
Regarding the ##-prefixed tokens, those are tokens can only be prepended as a suffix of another wordpiece. E.g., walrus gets split into ['wal', '##rus'] and you need both of the wordpieces to be in the vocabulary, but not ##wal or rus.
I have been doing clustering of a certain corpus, and obtaining results that group sentences together by obtaining their tf-idf, checking similarity weights > a certain threshold value from the gensim model.
tfidf_dic = DocSim.get_tf_idf()
ds = DocSim(model,stopwords=stopwords, tfidf_dict=tfidf_dic)
sim_scores = ds.calculate_similarity(source_doc, target_docs)
The problem is that despite putting high threshold values, sentences of similar topics but opposite polarities get clustered together as such:
Here is an example of the similarity weights obtained between "don't like it" & "i like it"
Are there any other methods, libraries or alternative models that can differentiate the polarities effectively by assigning them very low similarities or opposite vectors?
This is so that the outputs "i like it" and "dont like it" are in separate clusters.
PS: Pardon me if there are any conceptual errors as I am rather new to NLP. Thank you in advance!
The problem is in how you represent your documents. Tf-idf is good for representing long documents where keywords play a more important role. Here, it is probably the idf part of tf-idf that disregards the polarity because negative particles like "no" or "not" will appear in most documents and they will always receive a low weight.
I would recommend trying some neural embeddings that might capture the polarity. If you want to keep using Gensim, you can try doc2vec but you would need quite a lot of training data for that. If you don't have much data to estimate the representation, I would use some pre-trained embeddings.
Even averaging word embeddings (you can load FastText embeddings in Gensim). Alternatively, if you want a stronger model, you can try BERT or another large pre-trained model from the Transformers package.
Unfortunately, simple text representations based merely on the sets-of-words don't distinguish such grammar-driven reversals-of-meaning very well.
The method needs to be sensitive to meaningful phrases, and the hierarchical, grammar-driven inter-word dependencies, to model that.
Deeper neural networks using convolutional/recurrent techniques do better, or methods which tree-model sentence-structure.
For ideas see for example...
"Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank"
...or a more recent summary presentation...
"Representations for Language: From Word Embeddings to Sentence Meanings"
Autoencoders can be used to reduce dimensionallity in feature vectors - as far as I understand. In text classification a feature vector is normally constructed via a dictionary - which tends to be extremely large. I have no experience in using autoencoders, so my questions are:
Could autoencoders be used to reduce dimensionallity in text classification? (Why? / Why not?)
Has anyone already done this? A source would be nice, if so.
The existing works use auto encoder for creating models in the sentence level. Basically after training the model using Autoencode, you can get a vector for a sentence. Since any document consists of sentences you can get a set of vectors for the document, and do the document classification. In my experience with various vector representation (e.g. those generated from autoencodes) doing so might give answers worse than classification with bag of words.