SBERT's Interpretability: can we better understand the final results? - nlp

I'm familiar with SBERT and its pre-trained models and they are amazing! But at the same time, I want to understand how the results are calculated, and I can't find anything more specific in their website.
For example, I have a document and I want to find other documents that are similar to it. I used 2 documents containing 200-250 words each (I changed the model.max_seq_length to 350 so the model can handle bigger texts), and in the end we can see that the cosine-similarity is 0.79. Is that all we can see? Is there a way to extract the main phrases/keywords that made the model return this high value of similarity?
Thanks in advance!

Have you tried to make either a simple word-count-comparison between the two documents and other random documents? Or a tf-idf, if the two documents are part of a bigger corpus?
Another thing you can do, is to look inside the "stored_embeddings" matrix (see code below and here) in which SBERT encodes your sentences (e.g. for 20.000 documents you'll get a 20.000*384 matrix), after having saved it into a pickle file like:
from sentence_transformers import SentenceTransformer
import pickle
embeddings = model.encode(sentences)
with open('embeddings.pkl', "wb") as fOut:
pickle.dump({'sentences': sentences, 'embeddings': embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)
with open('embeddings.pkl', "rb") as fIn:
stored_data = pickle.load(fIn)
stored_embeddings = stored_data['embeddings']
The stored embeddings variable can be handled as a numpy matrix and can therefore be (for example) indexed to access single elements. By looking at the values of the single 384 dimensions (to do this, you can go column by column but in case of a big matrix I suggest you not to .enumerate(), it'll take forever) and compare the values that the two documents take in one precise dimension. You can see which dimension has the highest values or variance, for example.
I'm not saying it'll be interpretable what you'll find, but at least you can try and see what you find.

Related

finding semantic similarity between 2 statements

I am currently working with small application in python and my application has search functionality (currently using difflib) but I want to create Semantic Search which can give top 5 or 10 results from my database, based on user inputted text. It is same as google search engine works. I found some solutions Here.
But the problem is, below two statements from one of solution are semantically incorrect. And I don't care about this. because they are making things too hard which I don't want And also solution will be some pretrained neural network model or library from which I can implement easily.
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station
And also I found some solutions which are showing using gensim and Glove embeddings and finding similarity between words and not sentences.
What I wanted ?
Suppose my db has statement display classes and user inputs show, showed, displayed, displayed class, show types etc are same. And if above 2 statements are given as same then also I don't care. displayed and displayed class already showing in difflib.
Points to be noted
Find from fixed set of statements but user inputted statements can differ
Must work for statements
I think it is not gensim embedding. It is word2vec embedding. Whatever it is.
You need tensorflow_hub
The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
I believe you need here is Text Classification or Semantic Similarity because you want to find nearest top 5 or 10 statements given statement from user.
It is easy to use. But size of model is ≈ 1GB. It works with words, sentences, phrases or short paragraphs. The input is variable length English text and the output is a 512 dimensional vector. You can find more information about it Here
Code
import tensorflow_hub as hub
import numpy as np
# Load model. It will download first time.
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/5"
model = hub.load(module_url)
# first data[0] is your actual value
data = ["display classes", "show", "showed" ,"displayed class", "show types"]
# find high-dimensional vectors.
vecs = model(data)
# find distance between statements using inner product
dists = np.inner(vecs[0], vecs)
# print dists
print(dists)
Output
array([0.9999999 , 0.5633253 , 0.46475542, 0.85303843, 0.61701006],dtype=float32)
Conclusion
First value 0.999999 is distance between display classes and display classes itself. second 0.5633253 is distance between display classes and show and last 0.61701006 is distance between display classes and show types.
Using this, you can find distance between given input and statements in db. then rank them according to distance.
You can use wordnet for finding synonyms and then use these synonyms for finding similar statements.
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
def get_syn_list(gword):
syn_list = []
try:
syn_list.extend(wn.synsets(gword,pos=wn.NOUN))
syn_list.extend(wn.synsets(gword,pos=wn.VERB))
syn_list.extend(wn.synsets(gword,pos=wn.ADJ))
syn_list.extend(wn.synsets(gword,pos=wn.ADV))
except :
print("Something Wrong Happened")
syn_words = []
for i in syn_list:
syn_words.append(i.lemmas()[0].name())
return syn_words
Now use split and split your statements in db. like this
stat = ["display classes"]
syn_dict = {}
for i in stat:
tmp = []
for x in i.split(" "):
tmp.extend(get_syn_list(x))
syn_dict[i] = set(tmp)
Now you have synonyms just compare them with inputted text. And use lemmatizer before comparing words so that displayed become display.
Hey you can use spacy
This answer is from https://medium.com/better-programming/the-beginners-guide-to-similarity-matching-using-spacy-782fc2922f7c
import spacy
nlp = spacy.load("en_core_web_lg")
doc1 = nlp("display classes")
doc2 = nlp("show types")
print(doc1.similarity(doc2))
Output
0.6277548513279427
Edit
Run following command, which will download model.
!python -m spacy download en_core_web_lg

Efficient way to get best matching pairs given a similarity-outputting neural network?

I am trying to come up with a neural network that ranks two short pairs of text (for example, stackexchange title and body). Following the deep learning cookbook's example, the network would look basically like this:
So we have our two inputs (title and body), embed them, then calculate the cosine similarity between embeddings. The inputs of the model would be [title,body], the output is [sim].
Now I'd like the closest matching body for a given title. I am wondering if there's a more efficient way of doing this that doesn't involve iterating over every possible pair of (title,body) and calculating the corresponding similarity? Because for very large datasets this is just not feasible.
Any help is much appreciated!
It is indeed not very efficient to iterate over every possible data pair. Instead you could use your model to extract all the embeddings of your titles and text bodies and save them in a database (or simply a .npy file). So, you don't use your model to output a similarity score but instead use your model to output an embedding (from your embedding layer).
At inference time you can then use a library for efficient similarity search such as faiss. Given a title you would simply look up its embedding and search in the whole embedding space of all body embeddings to see which ones get the highest score. I have used this approach myself and been able to search 1M vectors in just 100 ms.

adding and accessing auxiliary tf.Dataset attributes with Keras

I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:
dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights))
dataset = dataset.map(
lambda filename, label, sample_weight: tuple(tf.py_func(
self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))
The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.
I use tensorflow.python.keras.models.Sequential.fit(...) to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict to predict outputs.
Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.
This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.
What is the best approach to do this?
Thanks
As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn, so it stores (and re-stores) on every iteration through the tf.Dataset. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset from file (i.e., on every epoch) is important.
I still think that it would be helpful to be able to more conveniently extend a tf.Dataset with this information. Also I noticed that when I adding a field to a tf.Dataset like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.
So again in this case I'm wondering how a tf.Dataset can be extended.
On the other question, it looks like the order of tf.Dataset elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...) are ordered as the file ids were presented to my_parse_fn (at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).
Thanks for any insights.

Dataset for RNN-LSTM as Spell checker in python

I have dataset of more than 5 million of records which has many noise features(words) in it So i thought of doing spell correction and abbreviation handling.
When i googled for spell correction packages in python i got packages like autocorrect, textblob, hunspell etc and Peter norvig's method
Below is the sample of my dataset
Id description
1 switvch for air conditioner..............
2 control tfrmr...........
3 coling pad.................
4 DRLG machine
5 hair smothing kit...............
I Tried spell correction function by above packages using the code
dataset['description']=dataset['description'].apply(lambda x: list(set([spellcorrection_function(item) for item in x])))
For entire dataset it took more than 12 hours to complete spell correction and also it introduces few noise( for 20% of total words which are important)
for eg: In last row, "smothing" corrected as "something" but it should be "smoothing" ( i dont get "something" in this context)
Approaching Further
When I observed the dataset not all time the spelling of word is wrong, there were also correct instance of spelling somewhere in dataset.So I tokenize the entire dataset and split correct words and wrong words by using dictionary , applied jarowinkler similarity method between all pair of words and selected pairs which is having similarity value 0.93 and more
Wrong word correct word similarity score
switvch switch 0.98
coling cooling 0.98
smothing smoothing 0.99
I got more than 50k pair of similar words which I put in dictionary with wrong word as key and correct word as value
I also kept words with its abbreviation list( ~3k pairs) in dictionary
key value
tfrmr transformer
drlg drilling
Search and replace key-value pair using code
dataset['description']=dataset['description'].replace(similar_word_dictionary,regex=true)
dataset['description']=dataset['description'].replace(abbreviation_dictionary,regex=true)
This code took more than a day to complete for only 10% of my entire dataset which I found is not efficient one.
Along With Python packages I had also found deep spelling which is something very efficient way of doing spelling correction.There was a very clear explanation of RNN-LSTM as spell checker.
As I dont know much about RNN and LSTM i got very basic understanding of above link.
Question
I am confused how to consider trainset for RNN to my problem,
whether
I need to consider correct words ( without any spelling mistake) in entire dataset as trainset and entire description of my dataset as testset.
or Pair of similar words and abbrievation list as trainset and description of my dataset as testset ( where model find wrong word in description and correct it)
or any other way? could some one please tell me how can I approach further
Could you give some more information about the model you are building?
It makes sense to use a character level sequence to sequence model, similar to the one you would use for translation. There are already some approaches trying to do the same (1, 2, 3).
Maybe draw on them for some inspiration?
Now, with regards to the dataset, It seems that the one you are trying to use mostly has errors? If you don't have the correct version of each phrase, I don't think you can use this dataset.
A simple approach would be to get an existing dataset and introduce random noise in it. The deep spelling blog talks about how you can do that an existing text corpus. Also, a recommendation from myself would be to use small-ish standalone sentences as the training set. A good place to find those is from machine translation datasets (like the tatoeba project) and only use the english phrases. Out of those you can create pairs of (input_phrase, target_phrase) where the input_phrase is potentially noisy (but not always).
With regards to performance, firstly 12hrs training for 1 pass of a 5M dataset sounds about right for a home pc. You can use a GPU or a cloud solution (1, 2) for faster training.
Now for false-positive correction, the dictionary you have created could indeed be handy: if a word exists in this dictionary, don't accept a "correction" on it from the model.

reducing word2vec dimension from Google News Vector Dataset

I loaded google's news vector -300 dataset. Each word is represented with a 300 point vector. I want to use this in my neural network for classification. But 300 for one word seems to be too big. How can i reduce the vector from 300 to say 100 without compromising on the quality.
tl;dr Use a dimensionality reduction technique like PCA or t-SNE.
This is not a trivial operation that you are attempting. In order to understand why, you must understand what these word vectors are.
Word embeddings are vectors that attempt to encode information about what a word means, how it can be used, and more. What makes them interesting is that they manage to store all of this information as a collection of floating point numbers, which is nice for interacting with models that process words. Rather than pass a word to a model by itself, without any indication of what it means, how to use it, etc, we can pass the model a word vector with the intention of providing extra information about how natural language works.
As I hope I have made clear, word embeddings are pretty neat. Constructing them is an area of active research, though there are a couple of ways to do it that produce interesting results. It's not incredibly important to this question to understand all of the different ways, though I suggest you check them out. Instead, what you really need to know is that each of the values in the 300 dimensional vector associated with a word were "optimized" in some sense to capture a different aspect of the meaning and use of that word. Put another way, each of the 300 values corresponds to some abstract feature of the word. Removing any combination of these values at random will yield a vector that may be lacking significant information about the word, and may no longer serve as a good representation of that word.
So, picking the top 100 values of the vector is no good. We need a more principled way to reduce the dimensionality. What you really want is to sample a subset of these values such that as much information as possible about the word is retained in the resulting vector. This is where a dimensionality reduction technique like Principle Component Analysis (PCA) or t-distributed Stochastic Neighbor Embeddings (t-SNE) come into play. I won't describe in detail how these methods work, but essentially they aim to capture the essence of a collection of information while reducing the size of the vector describing said information. As an example, PCA does this by constructing a new vector from the old one, where the entries in the new vector correspond to combinations of the main "components" of the old vector, i.e those components which account for most of the variety in the old data.
To summarize, you should run a dimensionality reduction algorithm like PCA or t-SNE on your word vectors. There are a number of python libraries that implement both (e.g scipy has a PCA algorithm). Be warned, however, that the dimensionality of these word vectors is already relatively low. To see how this is true, consider the task of naively representing a word via its one-hot encoding (a one at one spot and zeros everywhere else). If your vocabulary size is as big as the google word2vec model, then each word is suddenly associated with a vector containing hundreds of thousands of entries! As you can see, the dimensionality has already been reduced significantly to 300, and any reduction that makes the vectors significantly smaller is likely to lose a good deal of information.
#narasimman I suggest that you simply keep the top 100 numbers in the output vector of the word2vec model. The output is of type numpy.ndarray so you can do something like:
>>> word_vectors = KeyedVectors.load_word2vec_format('modelConfig/GoogleNews-vectors-negative300.bin', binary=True)
>>> type(word_vectors["hello"])
<type 'numpy.ndarray'>
>>> word_vectors["hello"][:10]
array([-0.05419922, 0.01708984, -0.00527954, 0.33203125, -0.25 ,
-0.01397705, -0.15039062, -0.265625 , 0.01647949, 0.3828125 ], dtype=float32)
>>> word_vectors["hello"][:2]
array([-0.05419922, 0.01708984], dtype=float32)
I don't think that this will screw up the result if you do it to all the words (not sure though!)

Resources