Gensim: Not able to load the id2word file - nlp

I am working on topic inference on a new corpus given a previously derived lda model. I am able to load the model perfectly, while I am not able to load the id2word file to create the corpora.Dictionary object needed to map the new corpus into numbers: the load method returns a dict attribute error that I don't know why. Below is the minimal code that replicates the situation, and I have attached the code (and packages used) here.
Thank you in advance for your response...
import numpy as np
import os
import pandas as pd
import gensim
from gensim import corpora
import datetime
import nltk
model_name = "lda_sub_full_35"
dictionary_name = "lda_sub_full_35.id2word"
model_for_inference = gensim.models.LdaModel.load(model_name, mmap='r')
print('Successfully load the model')
lda_dictionary = corpora.Dictionary.load(dictionary_name, mmap='r')
I expect to have both the dictionary and the model loaded, but it turns out that when I load the dictionary, I got the below error:
File "topic_inference.py", line 31, in <module>
lda_dictionary = corpora.Dictionary.load(dictionary_name, mmap='r')
File "/topic_modeling/env/lib/python3.8/site-packages/gensim/utils.py", line 487, in load
obj._load_specials(fname, mmap, compress, subname)
AttributeError: 'dict' object has no attribute '_load_specials'```

How were the contents of the lda_sub_full_35.id2word file originally saved?
Only if it was saved by a Gensim corpora.Dictionary object's .save() method should it be loaded as you've tried, with corpora.Dictionary.load().
If, by any chance, it was just a plain Python dict saved via some other method of writing a pickle()-created object, then you would need to load it in a symmetrically-matched way. That might be as simple as:
import pickle
with open(path, 'rb') as f:
lda_dictionary = pickle.load(f)

Related

How to create wordcloud from LDA model?

Following the documentation of ?gensim.models.ldamodel, I want to train an ldamodel and (from this SO answer create a worcloud from it). I am using the following code from both sources:
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
import gensim
import matplotlib.pyplot as plt
from wordcloud import WordCloud
common_dictionary = Dictionary(common_texts) # create corpus
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
lda = gensim.models.LdaModel(common_corpus, num_topics=10) # train model on corpus
for t in range(lda.num_topics):
plt.figure()
plt.imshow(WordCloud().fit_words(lda.show_topic(t, 200)))
plt.axis("off")
plt.title("Topic #" + str(t))
plt.show()
However, I get an AttributeError: 'list' object has no attribute 'items' on the line plt.imshow(...)
Can someone help me out here? (Answers to similar questions have not been working for me and I am trying to compile a minimal pipeline with this.)
From the docs, the method WordCloud.fit_words() expects a dictionary as input.
Your error seems to highlight that it's looking for an attribute 'items', typically an attribute of dictionaries, but instead finds a list object.
So the problem is: lda.show_topic(t, 200) returns a list instead of a dictionary. Use dict() to cast it!
Finally:
plt.imshow(WordCloud().fit_words(dict(lda.show_topic(t, 200))))

Getting error while trying to save and apply existing machine learning model to new dataset?

I am trying to use this model https://github.com/aninda052/Disasters-on-social-media-NLP/blob/master/Disasters%20on%20social%20media.ipynb
, I searched for a way to save this model and use it with new dataset in other application an I find out use pickle, and I add this to code like this
import pickle
model_tfidf=LogisticRegression( C=30.0,class_weight='balanced', solver='newton-cg',
multi_class='multinomial', n_jobs=-1, random_state=5)
model_tfidf.fit(x_train_tfidf, y_train)
predicted_tfidf=model_tfidf.predict(x_test_tfidf)
Pkl_Filename = "Pickle_RL_Model.pkl"
with open(Pkl_Filename, 'wb') as file:
pickle.dump(model_tfidf, file)
after that I tried to create new project to load and use this model and the code is:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import pickle
with open('Pickle_RL_Model.pkl', 'rb') as file:
Pickled_LR_Model = pickle.load(file)
x=["hi disaster","flood disaster","cry sad bad ","srong storm"]
tfd=TfidfVectorizer()
new_data_vec=tfd.fit_transform(x)
Ypredict = Pickled_LR_Model.predict(new_data_vec)
but I got error said:
X has 8 features per sample; expecting 16988
I don't know what I did wrong, any help please.

Can't load HDF5 in python

I am following this tutorial: https://github.com/fastai/fastai/tree/master/courses/dl2/imdb_scripts
I downloaded the pre-trained model in part 3b.
I want to open the .h5 files and look/use the weights. I tried to use python to do this, but it is not opening.
Here’s the code I used:
import tables
import pandas as pd
filename = “…bwd_wt103.h5”
file = tables.open_file(filename)
Here’s the error:
OSError: HDF5 error back trace
File “C:\ci\hdf5_1525883595717\work\src\H5F.c”, line 511, in H5Fopen
unable to open file
File “C:\ci\hdf5_1525883595717\work\src\H5Fint.c”, line 1604, in H5F_open
unable to read superblock
File “C:\ci\hdf5_1525883595717\work\src\H5Fsuper.c”, line 413, in H5F__super_read
file signature not found
End of HDF5 error back trace
Unable to open/create file 'C:/Users/Rishabh/Documents/School and Work/Classes/8
Fall2019/Senior Design/ULMFiT/Wiki Data/wt103/models/bwd_wt103.h5'
I also used The HDF Group HDF Viewer: https://support.hdfgroup.org/products/java/release/download.html
But that didn’t work either. It gave an error saying “Failed to open the file… Unsupported format”
Is there a way to load the weights in Python? I ultimately want to access the last layer of the stacked LSTMS to create word embeddings.
Thanks in advance.
That's because it's a torch model. You can load it on your local machine using torch like so:
>>> import torch
>>> filename = "bwd_wt103.h5"
>>> f = torch.load(filename, map_location=torch.device('cpu'))
Now, let's explore it:
>>> type(f)
OrderedDict
>>> len(f.keys())
15
>>> list(f.keys())
['0.encoder.weight',
'0.encoder_with_dropout.embed.weight',
'0.rnns.0.module.weight_ih_l0',
'0.rnns.0.module.bias_ih_l0',
'0.rnns.0.module.bias_hh_l0',
'0.rnns.0.module.weight_hh_l0_raw',
'0.rnns.1.module.weight_ih_l0',
'0.rnns.1.module.bias_ih_l0',
'0.rnns.1.module.bias_hh_l0',
'0.rnns.1.module.weight_hh_l0_raw',
'0.rnns.2.module.weight_ih_l0',
'0.rnns.2.module.bias_ih_l0',
'0.rnns.2.module.bias_hh_l0',
'0.rnns.2.module.weight_hh_l0_raw',
'1.decoder.weight']
You can access the weights of 0.rnns.2.module.weight_hh_l0_raw like so:
>>> wts = f['0.rnns.2.module.weight_hh_l0_raw']
>>> wts.shape
torch.Size([1600, 400])

Pickle fit-object

I wrote a class where some data are fitted. Since the fitting takes very long when lots of data have to be fitted, I want to save the fit-object of this class so I do not have to repeat the fitting when I want to use the fitted data later. Using pickle, I get the following error calling the save method on an object:
AttributeError: Can't pickle local object 'ConstantModel.__init__.<locals>.constant'
I only have this problem when pickle the fitted data, pickle works if I save the object before fitting.
Is there a way to pickle fitted data or is there a nice workaround?
class pattern:
def fitting(self):
mod_total = lmfit.models.ConstantModel()
pars_total = mod_total.guess(self.y, x=self.x)
self.fit = mod_total.fit(self.y, pars_total, x=self.x)
def save(self, path):
with open(path, 'wb') as filehandler:
pickle.dump(self, filehandler)
I found a solution for this problem: Using dill instead of pickle works (as I want it to do).

LightGBM: loading from json

I am trying to load a LightGBM.Booster from a JSON file pointer, and can't find an example online.
import json ,lightgbm
import numpy as np
X_train = np.arange(0, 200).reshape((100, 2))
y_train = np.tile([0, 1], 50)
tr_dataset = lightgbm.Dataset(X_train, label=y_train)
booster = lightgbm.train({}, train_set=tr_dataset)
model_json = booster.dump_model()
with open('model.json', 'w+') as f:
json.dump(model_json, f, indent=4)
with open('model.json') as f2:
model_json = json.load(f2)
How can I create a lightGBM booster from f2 or model_json? This snippet only shows dumping to JSON. model_from_string might help but seems to require an instance of the booster, which I won't have before loading.
There's no such method for creation of Booster directly from json. No such method in the source code or documentation, also there's no github issue.
Because of it, I just load models from a text file via
gbm.save_model('model.txt') # gbm is trained Booster instance
# ...
bst = lgb.Booster(model_file='model.txt')
or use pickle to dump and load models:
import pickle
pickle.dump(gbm, open('model.pkl', 'wb'))
# ...
gbm = pickle.load(open('model.pkl', 'rb'))
Unforunately, pickle files are unreadable (or, at least, this files are not so clear). But it's better than nothing.

Resources