I am able to generate similarity between two sentences loading spacy's core_lg model trained on Glove vectors.
Now, I want to update some domain specific sentences and assign same vectors to both sentences so that they are treated as the same.
So, how do I add vectors like these on top of the model I am using?
Is there any other way to approach this problem?
Related
I would like to do some supervised binary classification tasks with sentences, and have been using spaCy because of its ease of use. I used spaCy to convert the text into vectors, and then fed the vectors to a machine learning model (e.g. XGBoost) to perform the classfication. However, the results have not been very satisfactory.
In spaCy, it is easy to load a model (e.g. BERT / Roberta / XLNet) to convert words / sentences to nlp objects. Directly calling the vector of the object will however will default to an average of the token vectors.
Here are two questions:
1) Can we do better than simply getting the average of token vectors, like having context/order-aware sentence vectors using spaCy? For example, can we extract the sentence embedding from the previous layer of the BERT transformer instead of the final token vectors in spaCy?
2) Would it be better to directly use spaCy to train the downstream binary classification task? For example, here discusses how to add a text classifier to a spaCy model. Or is it generally better to apply more powerful machine learning models like XGBoost?
Thanks in advance!
I found this being discussed in the page below. Maybe it helps.
"Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much."
https://github.com/huggingface/transformers/issues/1950
I have a set of files for five different categories and most of them are not labelled correctly.Objective is to predict the correct category of the file whenever the same is uploaded.I used cosine similarity along with tf -idf to predict the class of the document with which cosine similarity is the maximum as of now i am getting good results but really not sure how well this will work down the road. Also why isnt cosine similarity used in building document classifiers instead of machine learning models when the categories of files are labelled correctly?Would really appreciate your feedback on my approach as well as your answer to the question.
Cosine similarity is used for calculating the angle between two n-dimensional vectors. These vectors are mostly produced by Embeddings. They are pretrained models which produce word embeddings or fixed size vectors.
Cosine similarity is mostly used with vectors produced by word
embeddings. If you are using something like Doc2Vec, then you get a
vector for the whole document. These vectors could be categorized by
using cosine similarity.
In your case, you should try a LSTM text classifier using Embedding layers. 1D Convolution layers can also be useful.
Also, referring to TF-IDF, it is useful for text classification which is dependent on certain words in the corpus. The words with higher term frequency and less document frequency have a higher TF-IDF score. The model learns to classify texts based on such scores.
In most cases, RNNs are the best to classify texts. The use of pretrained embeddings makes the model efficient.
Also, not the least, you can give Bayes text classification a try. It has been super useful in spam classification.
Tip:
You can implement the above methods with each other, creating a text classification system. Following the process like,
Generate embeddings from Doc2Vec.
Comparing the similarity of the input with other texts and thereby determine its class.
Using the embedding in a LSTM network to produce class probabilities.
Apply Bayes text classification.
The steps 2 , 3 , 4 give three predictions. If the majority prediction was CLASS1, then we can make the output of the system as CLASS1!.
I am trying to build a Fake news classifier and I am quite new in this field. I have a column "title_1_en" which has the title for fake news and another column called "title_2_en". There are 3 target labels; "agreed", "disagreed", and "unrelated" if the title of the news in column "title_2_en" agrees, disagrees or is unrelated to that in the first column.
I have tried calculating basic cosine similarity between the two titles after converting the words of the sentences into vectors. This has resulted in the the cosine similarity score but this needs a lot of improvement as synonyms and semantic relationship has not been considered at all.
def L2(vector):
norm_value = np.linalg.norm(vector)
return norm_value
def Cosine(fr1, fr2):
cos = np.dot(fr1, fr2)/(L2(fr1)*L2(fr2))
return cos
The most important thing here is how you convert the two sentences into vectors. There are multiple ways to do that and the most naive way is:
Convert each and every word into a vector - this can be done using standard pre-trained vectors such as word2vec or GloVe.
Now every sentence is just a bag of word vectors. This needs to be converted into a single vector, ie., mapping a full sentence text to a vector. There are many ways to do this too. For a start, just take the average of the bag of vectors in the sentence.
Compute cosine similarity between the two sentence vectors.
Spacy's similarity is a good place to start which does the averaging technique. From the docs:
By default, spaCy uses an average-of-vectors algorithm, using
pre-trained vectors if available (e.g. the en_core_web_lg model). If
not, the doc.tensor attribute is used, which is produced by the
tagger, parser and entity recognizer. This is how the en_core_web_sm
model provides similarities. Usually the .tensor-based similarities
will be more structural, while the word vector similarities will be
more topical. You can also customize the .similarity() method, to
provide your own similarity function, which can be trained using
supervised techniques.
so I wanna ask you which is the best tool used to prepare my text to deep learning?
What is the difference between Word2Vec, Glove, Keras, LSA...
You should use a pre-trained embedding to represent the sentence into a vector or a matrix. There are a lot of sources where you can find pre-trained embeddings that use different dataset (for instance all the Wikipedia) to train their models. These models can have different length, but normally each word is represented with 100 or 300 dimensions.
Pre-trained embeddings
Pre-trained embeddings 2
I want to implement doc2vec on various reviews which we extracted from a source.And I want to classify these reviews into different classes defined by the user. How can I do this?
I consider this as one of the interesting question. I will be giving you some approaches depending on size of observations/reviews.
You can apply LSA (SVD on DTM (either incidence or TF-IDF vectors) you will be getting three vectors as outputs -- USV. The V transpose is the sentence embedding).
Use this embeddings as input to your model for classification.
I recommend to use LSA when your corpus size is large.
Resources: link
In the similar way instead of using LSA, You can use pre trained embeddings say glove, here you will be getting word embeddings for creating document vectors use inverse weighted frequency method. Use this document vectors for classification.
Resources: link