How to determine which topic each Test Tweet is addressing to? - nlp

I implemented LDA using python and modeled topics for a set of tweets and trying to map each tweet to topic to see which topic does the tweet belong to but I couldn't find any help online.
Although I found that this could be done for NMF but I couldn't find any functions or any specific options in python for this specific case, I'm using gensim for generation of topic using LDA

Using your trained model to get the most-associated topics for any particular text is covered in the Gensim docs for the LdaModel class, under 'Usage Examples' - https://radimrehurek.com/gensim/models/ldamodel.html#usage-examples – but the treatment there may be a little unituitive, because:
it calls any text you might be analyzing an 'unseen' document, but the same process works for documents that were part of the training corpus; and
like many other Gensim classes, the analysis is done via Python's indexed-lookup idiom. That is, bracketed [ ]-accessing (aka the shortcut for calling .__getitem__()), using the right representation of the text as if it were a lookup key, even though in truth here it's an argument to the model's analysis rather than something strictly looking-up some stored response.
So this part of its examples is what you need to follow:
Query, the model using new, unseen documents
>>> # Create a new corpus, made of previously unseen documents.
>>> other_texts = [
... ['computer', 'time', 'graph'],
... ['survey', 'response', 'eps'],
... ['human', 'system', 'computer']
... ]
>>> other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
>>>
>>> unseen_doc = other_corpus[0]
>>> vector = lda[unseen_doc] # get topic probability distribution for a document
Just remember:
the other_texts & unseen_document in this example can also be repeats from the training corpus, to ask the trained model what it thinks those doc's toppics should be; and
you need to convert any text to a bag-of-words representation using the exact same dictionary as was used for training documents, so that word-indexes and frequency-weightings are proper for the bag-of-words representations used for lookups.

Related

How to convert small dataset into word embeddings instead of one-hot encoding?

I have a dataset of 33 words that are a mix of verbs and nouns, for eg. father, sing, etc. I have tried converting them to 1-hot encoding but for my use case, it has been suggested to look into word2vec embedding. I have looked in gensim and glove but struggling to make it work.
How could I convert my data into an embedding? Such that two words that may be semantically closer may have a lesser distance between their respective vectors. How may this be achieved or any helpful material on the same?
Such as this
Since your dataset is quite small, and I'm assuming it doesn't contain any jargon, it's best to use a pre-trained model in order to save up on training time.
With gensim, it's as simple as:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
The 'word2vec-google-news-300' model has been pre-trained on a part of the Google News Dataset and generalizes well enough to most tasks. Following this, you can create word embeddings/vectors like so:
vec = wv['father']
And, finally, for computing word similarity:
similarity_score = wv.similarity('father', 'sing')
Lastly, one major limitation of Word2Vec is it's inability to deal with words that are OOV(out of vocabulary). For such cases, it's best to train a custom model for your corpus.

finding semantic similarity between 2 statements

I am currently working with small application in python and my application has search functionality (currently using difflib) but I want to create Semantic Search which can give top 5 or 10 results from my database, based on user inputted text. It is same as google search engine works. I found some solutions Here.
But the problem is, below two statements from one of solution are semantically incorrect. And I don't care about this. because they are making things too hard which I don't want And also solution will be some pretrained neural network model or library from which I can implement easily.
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station
And also I found some solutions which are showing using gensim and Glove embeddings and finding similarity between words and not sentences.
What I wanted ?
Suppose my db has statement display classes and user inputs show, showed, displayed, displayed class, show types etc are same. And if above 2 statements are given as same then also I don't care. displayed and displayed class already showing in difflib.
Points to be noted
Find from fixed set of statements but user inputted statements can differ
Must work for statements
I think it is not gensim embedding. It is word2vec embedding. Whatever it is.
You need tensorflow_hub
The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
I believe you need here is Text Classification or Semantic Similarity because you want to find nearest top 5 or 10 statements given statement from user.
It is easy to use. But size of model is ≈ 1GB. It works with words, sentences, phrases or short paragraphs. The input is variable length English text and the output is a 512 dimensional vector. You can find more information about it Here
Code
import tensorflow_hub as hub
import numpy as np
# Load model. It will download first time.
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/5"
model = hub.load(module_url)
# first data[0] is your actual value
data = ["display classes", "show", "showed" ,"displayed class", "show types"]
# find high-dimensional vectors.
vecs = model(data)
# find distance between statements using inner product
dists = np.inner(vecs[0], vecs)
# print dists
print(dists)
Output
array([0.9999999 , 0.5633253 , 0.46475542, 0.85303843, 0.61701006],dtype=float32)
Conclusion
First value 0.999999 is distance between display classes and display classes itself. second 0.5633253 is distance between display classes and show and last 0.61701006 is distance between display classes and show types.
Using this, you can find distance between given input and statements in db. then rank them according to distance.
You can use wordnet for finding synonyms and then use these synonyms for finding similar statements.
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
def get_syn_list(gword):
syn_list = []
try:
syn_list.extend(wn.synsets(gword,pos=wn.NOUN))
syn_list.extend(wn.synsets(gword,pos=wn.VERB))
syn_list.extend(wn.synsets(gword,pos=wn.ADJ))
syn_list.extend(wn.synsets(gword,pos=wn.ADV))
except :
print("Something Wrong Happened")
syn_words = []
for i in syn_list:
syn_words.append(i.lemmas()[0].name())
return syn_words
Now use split and split your statements in db. like this
stat = ["display classes"]
syn_dict = {}
for i in stat:
tmp = []
for x in i.split(" "):
tmp.extend(get_syn_list(x))
syn_dict[i] = set(tmp)
Now you have synonyms just compare them with inputted text. And use lemmatizer before comparing words so that displayed become display.
Hey you can use spacy
This answer is from https://medium.com/better-programming/the-beginners-guide-to-similarity-matching-using-spacy-782fc2922f7c
import spacy
nlp = spacy.load("en_core_web_lg")
doc1 = nlp("display classes")
doc2 = nlp("show types")
print(doc1.similarity(doc2))
Output
0.6277548513279427
Edit
Run following command, which will download model.
!python -m spacy download en_core_web_lg

word2vec limit similar_by_vector() result to re-trained corpus

Assume you have a (wikipedia) pre-trained word2vec model, and train it on an additional corpus (very small, 1000 scentences).
Can you imagine a way to limit a vector-search to the "re-trained" corpus only?
For example
model.wv.similar_by_vector()
will simply find the closest word for a given vector, no matter if it is part of the Wikipedia corpus, or the re-trained vocabulary.
On the other hand, for 'word' search the concept exists:
most_similar_to_given('house',['garden','boat'])
I have tried to train based on the small corpus from scratch, and it somewhat works as expected. But of course could be much more powerful if the assigned vectors come from a pre-trained set.
Sharing an efficient way to do this manually:
re-train word2vec on the additional corpus
create full unique word-index of corpus
fetch re-trained vectors for each word in the index
instead of the canned function "similar_by_vector", use scipy.spatial.KDTree.query()
This finds the closest word within the given corpus only and works as expected.
Similar to the approach for creating a subset of doc-vectors in a new KeyedVectors instance suggested here, assuming small_vocab is a list of the words in your new corpus, you could try:
subset_vectors = WordEmbeddingsKeyedVectors(vector_size)
subset_vectors.add(small_vocab, w2v_model.wv[small_vocab])
Then subset_vectors contains just the words you've selected, but supports familiar operations like most_similar().

load Doc2Vec model and get new sentence's vectors for test

I have read lots of examples regarding doc2vec, but I couldn't find any answer. Like a real example, I want to build a model with doc2vec and then train it with some ML models. after that, how can I get the vector of a raw string with the exact trained Doc2vec model? because I need to predict with my ML model with the same size and logical vector
There are a collection of example Jupyter (aka IPython) notebooks in the gensim docs/notebooks directory. You can view them online at:
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
But they'll be in your gensim installation directory, if you can find that for your current working environment.
Those that include doc2vec in their name demonstrate the use of the Doc2Vec class. The most basic intro operates on the 'Lee' corpus that's bundled with gensim for use in its unit tests. (It's really too small for real Doc2Vec success, but by forcing smaller models and many training iterations the notebook just barely manages to get some consistent results.) See:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
It includes a section on inferring a vector for a new text:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Note that inference is performed on a list of string tokens, not a raw string. And those tokens should have been preprocessed/tokenized the same way as the original training data for the model, so that the vocabularies are compatible. (Any unknown words in a new text are silently ignored.)
Note also that especially on short texts, it often helps to provide a much-larger-than-default value of the optional steps parameter to infer_vector() - say 50 or 200 rather than the default 5. It may also help to provide a starting alpha parameter more like the training default of 0.025 than the method-default of 0.1.

Gensim train word2vec on wikipedia - preprocessing and parameters

I am trying to train the word2vec model from gensim using the Italian wikipedia
"http://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2"
However, I am not sure what is the best preprocessing for this corpus.
gensim model accepts a list of tokenized sentences.
My first try is to just use the standard WikipediaCorpus preprocessor from gensim. This extract each article, remove punctuation and split words on spaces. With this tool each sentence would correspond to an entire model, and I am not sure of the impact of this fact on the model.
After this I train the model with default parameters. Unfortunately after training it seems that I do not manage to obtain very meaningful similarities.
What is the most appropriate preprocessing on the Wikipedia corpus for this task? (if this questions are too broad please help me by pointing to a relevant tutorial / article )
This the code of my first trial:
from gensim.corpora import WikiCorpus
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
corpus = WikiCorpus('itwiki-latest-pages-articles.xml.bz2',dictionary=False)
max_sentence = -1
def generate_lines():
for index, text in enumerate(corpus.get_texts()):
if index < max_sentence or max_sentence==-1:
yield text
else:
break
from gensim.models.word2vec import BrownCorpus, Word2Vec
model = Word2Vec()
model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words
model.train(generate_lines(),chunksize=500)
Your approach is fine.
model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words
This could be because of pruning infrequent words (the default is min_count=5).
To speed up computation, you can consider "caching" the preprocessed articles as a plain .txt.gz file, one sentence (document) per line, and then simply using word2vec.LineSentence corpus. This saves parsing the bzipped wiki XML on every iteration.
Why word2vec doesn't produce "meaningful similarities" for Italian wiki, I don't know. English wiki seems to work fine. See also here.
I've been working on a project to massage the wikipedia corpus and get vectors out of it.
I might generate the Italian vectors soon but in case you want to do it on your own take a look at:
https://github.com/idio/wiki2vec

Resources