I am working on topic inference on a new corpus given a previously derived lda model. I am able to load the model perfectly, while I am not able to load the id2word file to create the corpora.Dictionary object needed to map the new corpus into numbers: the load method returns a dict attribute error that I don't know why. Below is the minimal code that replicates the situation, and I have attached the code (and packages used) here.
Thank you in advance for your response...
import numpy as np
import os
import pandas as pd
import gensim
from gensim import corpora
import datetime
import nltk
model_name = "lda_sub_full_35"
dictionary_name = "lda_sub_full_35.id2word"
model_for_inference = gensim.models.LdaModel.load(model_name, mmap='r')
print('Successfully load the model')
lda_dictionary = corpora.Dictionary.load(dictionary_name, mmap='r')
I expect to have both the dictionary and the model loaded, but it turns out that when I load the dictionary, I got the below error:
File "topic_inference.py", line 31, in <module>
lda_dictionary = corpora.Dictionary.load(dictionary_name, mmap='r')
File "/topic_modeling/env/lib/python3.8/site-packages/gensim/utils.py", line 487, in load
obj._load_specials(fname, mmap, compress, subname)
AttributeError: 'dict' object has no attribute '_load_specials'```
How were the contents of the lda_sub_full_35.id2word file originally saved?
Only if it was saved by a Gensim corpora.Dictionary object's .save() method should it be loaded as you've tried, with corpora.Dictionary.load().
If, by any chance, it was just a plain Python dict saved via some other method of writing a pickle()-created object, then you would need to load it in a symmetrically-matched way. That might be as simple as:
import pickle
with open(path, 'rb') as f:
lda_dictionary = pickle.load(f)
I have a document in a text file in 2 lines as shown below. I wanted to apply tf-idf to it and I get the error as shown below, I am not sure where is int object in my file? why would it throw this error?
Env:
Jupter notebook, python 3.7
Error:
AttributeError: 'int' object has no attribute 'lower'
file.txt:
Random person from the random hill came to a running mill and I have a count of the hill. This is my house.
A person is from a great hill and he loves to run a mill.
Sub-disciplines of biology are defined by the research methods employed and the kind of system studied: theoretical biology uses mathematical methods to formulate quantitative models while experimental biology performs empirical experiments.
The objects of our research will be the different forms and manifestations of life, the conditions and laws under which these phenomena occur, and the causes through which they have been effected. The science that concerns itself with these objects we will indicate by the name biology.
Code:
import pandas as pd
import spacy
import csv
import collections
import sys
import itertools
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
from nltk.tokenize import sent_tokenize
from gensim import corpora, models
from stop_words import get_stop_words
from nltk.stem import PorterStemmer
data = pd.read_csv('file.txt', sep="\n", header=None)
data.dtypes
0 object
dtype: object
data.shape()
4, 1
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)
print(X)
I solved it by reading the file like this:
with open('file.txt') as f:
lines = [line.rstrip() for line in f]
This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
There's an extensive discussion about it here, but I don't think anything has been finalized yet.
Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.
From Documentation of ColumnTransformer:
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Try this!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'sales': [10000, 100020, 110000, 100],
'foo' : [1,0,0,1],
'text': ['I will test this',
'need to write more sentence',
'want to keep it simple',
'hope you got that these sentences are junk'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('text', text_pipeline, 'text'),
('simple_transformer', MinMaxScaler(), ['sales']),
]
combined_pipe = ColumnTransformer(
transformers, remainder='passthrough')
transformed_data = combined_pipe.fit_transform(
train.drop('y',1), train['y'])
def get_feature_out(estimator, feature_in):
if hasattr(estimator,'get_feature_names'):
if isinstance(estimator, _VectorizerMixin):
# handling all vectorizers
return [f'vec_{f}' \
for f in estimator.get_feature_names()]
else:
return estimator.get_feature_names(feature_in)
elif isinstance(estimator, SelectorMixin):
return np.array(feature_in)[estimator.get_support()]
else:
return feature_in
def get_ct_feature_names(ct):
# handles all estimators, pipelines inside ColumnTransfomer
# doesn't work when remainder =='passthrough'
# which requires the input column names.
output_features = []
for name, estimator, features in ct.transformers_:
if name!='remainder':
if isinstance(estimator, Pipeline):
current_features = features
for step in estimator:
current_features = get_feature_out(step, current_features)
features_out = current_features
else:
features_out = get_feature_out(estimator, features)
output_features.extend(features_out)
elif estimator=='passthrough':
output_features.extend(ct._feature_names_in[features])
return output_features
pd.DataFrame(transformed_data,
columns=get_ct_feature_names(combined_pipe))
I am learning on how to implement nlp, so i started with data cleaning and now i am trying to Vectorizing Data using bag-of-words, this is my code
import pandas as pd
import numpy as np
import string
import re
import nltk
stopword=nltk.corpus.stopwords.words('english')
wn=nltk.WordNetLemmatizer()
from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer(analyzer=clean_text)
x_count=count_vect.fit_transform(lematizing_words)
print(x_count.shape)
but, when i run this code i get the following error
NameError: name 'clean_text' is not defined
how can i solve this?
i have referred this blog for the nlp implementation
def cleanText(text):
text = "".join([word.lower() for word in text if word not in string.punctuation])
tokens = re.split('\W+', text)
text = [ps.stem(word) for word in tokens if word not in stopwords]
return text
stopwords = nltk.corpus.stopwords.words('english')
Here is the function that the Badreesh put into github but is not in the blog.
The error message describes well your problem. What is clean_text ? Have you defined a clean_text function? Or imported the right python module containing this function?
I am trying to use this model https://github.com/aninda052/Disasters-on-social-media-NLP/blob/master/Disasters%20on%20social%20media.ipynb
, I searched for a way to save this model and use it with new dataset in other application an I find out use pickle, and I add this to code like this
import pickle
model_tfidf=LogisticRegression( C=30.0,class_weight='balanced', solver='newton-cg',
multi_class='multinomial', n_jobs=-1, random_state=5)
model_tfidf.fit(x_train_tfidf, y_train)
predicted_tfidf=model_tfidf.predict(x_test_tfidf)
Pkl_Filename = "Pickle_RL_Model.pkl"
with open(Pkl_Filename, 'wb') as file:
pickle.dump(model_tfidf, file)
after that I tried to create new project to load and use this model and the code is:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import pickle
with open('Pickle_RL_Model.pkl', 'rb') as file:
Pickled_LR_Model = pickle.load(file)
x=["hi disaster","flood disaster","cry sad bad ","srong storm"]
tfd=TfidfVectorizer()
new_data_vec=tfd.fit_transform(x)
Ypredict = Pickled_LR_Model.predict(new_data_vec)
but I got error said:
X has 8 features per sample; expecting 16988
I don't know what I did wrong, any help please.