Assign document to a category using document similarity - nlp

I'm developing a NLP project in python.
I'm getting "conversation" from social networks. A conversation is made up of post_text + comment_text + reply_text (with comment_text and reply_text as optional).
I've also a list of categories, arguments, and I want to "connect" conversation to an argument (or get a weight for each argument).
For each category, I get the summary on Wikipedia, using wikipedia python package. So, they represent my training documents (right?).
Now, I've writed down some steps to follow, but maybe I'm wrong.
Each training document must be transformed to Vector Space Model. I've to remove stopwords and common words. So, I've a list of vocabulary.
Each conversation must be transformed to vector space model and each token must be assigned to its vocabulary index. I can save all vector space models in a matrix.
Now, I've to perform tf-idf (for example) on all matrix rows.
In tf-idf I've to calculate tf, idf and normalize matrix?
So, each row represents tf-idf for each conversation. Now, I've to perform cosine-similarity (for example) to get similarity between each conversation and one training document. I've to iterate it to get similarity between conversations and each training document.
What do you think about the steps? Is there any guide/how to/book I've to read to understand better this problem?

Instead of getting summary from Wikipedia and matching similarity you can train a classifier that given a summary can predict which document category it is. You can start with simplest Bag of word representation of summery from Wikipedia for classification then analyse the results and accuracy. After that can move forward to more sophisticate approach like word to vector or document to vector for word representation and then train a classifier.
After making classification model, for assigning category to your test document you need to clasify it using classification model.

Related

How to Prepare Training data for NLP Bag of words model?

I have a Machine Learning Problem: have a set of words: ex, Diameter, Item Number, Phone Number, etc.
When user gives an input Dia, the model should predict the nearest word, Diameter
If user givens an input Part Number, the model should predict: Item Number
How should I prepare training data for this: In this case, are the feature and label the same? Any help? (Bag of words? Hashing)
You don't need to train a machine learning algorithm for this problem. Fuzzy matching is the way to go. It is based on the similarity measure (string distance) of two strings. You can check the distance between input and every word in the vocabulary and select the closest one.
You can also try to train an ML model (not recommended). In case the vocabulary is fixed, you can create the features based on the similarity distance to the vocabulary ([list of string metrics][1]). Otherwise, you can still create features based on letters' count.
The training set can be generated by augmentation of the vocabulary by subsampling part of the correct phrase/flip letters/remove some letters/ add extra letter etc.. ​
​[1]: https://en.wikipedia.org/wiki/String_metri

Document classification using pretrained models like BERT

I am looking for methods to classify documents. For ex. I have a bunch of documents with text and I want to label the document on whether it belongs to sports, food, politics etc.
Can I use BERT (for documents with words > 500) for this or are there any other models that do this task efficiently?
BERT has a maximum sequence length of 512 tokens (note that this is usually much less than 500 words), so you cannot input a whole document to BERT at once. If you still want to use the model for this task, I would suggest that you
split up each document into chunks that are processable by BERT (e.g. 512 tokens or less)
classify all document chunks individually
classify the whole document according to the most frequently predicted label of the chunks, i.e. take a majority vote
In this case, the only modification you have to make is to add a fully connected layer on top of BERT.
This approach might be quite expensive, though. There is the alternative of representing the text documents as bag of word vectors and then train a classifier on the data. If you are not familiar with BOW, the Wikipedia entry to it is a good starting point. It can serve as a feature vector for all kinds of classifiers, I would suggest you try SVM or kNN.

Sentence similarity using word2vev

Basically what I want is to know how similar a specific sentence/document is to my training corpus.
I think I might have half an idea of how to approach this but I'm not too sure.
So my idea is to calculate an average vector for the document and then somehow calculating the similarity like that. I just don't know how I would calculate the similarity then.
So say I have a training corpus filled with text about dogs. If I then want to check how similar the sentence, "The airplane has 100 seats.", is to the training corpus I want is to output a low similarity score.
This is a semantic textual similarity problem. You can have a look at state-of-the-art models here https://nlpprogress.com/english/semantic_textual_similarity.html
Usually you would pass your document in an encoder to create a representation (embedding of the document) then do the same with the sentence (usually using the same encoder). The vectors could be fed into further layers for further processing. A similarity metric like cosine could then be used on the vectors (embeddings) or a joint final representation could be used for classification.
You can use some pretrained language models in the encoding step and fine tune them for your use-case.

How to train a model that will result in the similarity score between two news titles?

I am trying to build a Fake news classifier and I am quite new in this field. I have a column "title_1_en" which has the title for fake news and another column called "title_2_en". There are 3 target labels; "agreed", "disagreed", and "unrelated" if the title of the news in column "title_2_en" agrees, disagrees or is unrelated to that in the first column.
I have tried calculating basic cosine similarity between the two titles after converting the words of the sentences into vectors. This has resulted in the the cosine similarity score but this needs a lot of improvement as synonyms and semantic relationship has not been considered at all.
def L2(vector):
norm_value = np.linalg.norm(vector)
return norm_value
def Cosine(fr1, fr2):
cos = np.dot(fr1, fr2)/(L2(fr1)*L2(fr2))
return cos
The most important thing here is how you convert the two sentences into vectors. There are multiple ways to do that and the most naive way is:
Convert each and every word into a vector - this can be done using standard pre-trained vectors such as word2vec or GloVe.
Now every sentence is just a bag of word vectors. This needs to be converted into a single vector, ie., mapping a full sentence text to a vector. There are many ways to do this too. For a start, just take the average of the bag of vectors in the sentence.
Compute cosine similarity between the two sentence vectors.
Spacy's similarity is a good place to start which does the averaging technique. From the docs:
By default, spaCy uses an average-of-vectors algorithm, using
pre-trained vectors if available (e.g. the en_core_web_lg model). If
not, the doc.tensor attribute is used, which is produced by the
tagger, parser and entity recognizer. This is how the en_core_web_sm
model provides similarities. Usually the .tensor-based similarities
will be more structural, while the word vector similarities will be
more topical. You can also customize the .similarity() method, to
provide your own similarity function, which can be trained using
supervised techniques.

I want to classify some sentences on the basis of their semantic meaning.How can I use Doc2Vec in this? Or is there a better approach than this?

I want to implement doc2vec on various reviews which we extracted from a source.And I want to classify these reviews into different classes defined by the user. How can I do this?
I consider this as one of the interesting question. I will be giving you some approaches depending on size of observations/reviews.
You can apply LSA (SVD on DTM (either incidence or TF-IDF vectors) you will be getting three vectors as outputs -- USV. The V transpose is the sentence embedding).
Use this embeddings as input to your model for classification.
I recommend to use LSA when your corpus size is large.
Resources: link
In the similar way instead of using LSA, You can use pre trained embeddings say glove, here you will be getting word embeddings for creating document vectors use inverse weighted frequency method. Use this document vectors for classification.
Resources: link

Resources