Why the following tfidf vectorization is failing?

Why the following tfidf vectorization is failing? - scikit-learn

Hello I am making the following experiment, first I created a vectorizer called: tfidf:
tfidf_vectorizer = TfidfVectorizer(min_df=10,ngram_range=(1,3),analyzer='word',max_features=500)
Then I vectorized the following list:
tfidf = tfidf_vectorizer.fit_transform(listComments)
My list of comments looks as follows:
listComments = ["hello this is a test","the car is red",...]
I tried to save the model as follows:
#Saving tfidf
with open('vectorizerTFIDF.pickle','wb') as idxf:
pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
I would like to use my vectorizer to apply the same tfidf to the following list:
lastComment = ["this is a car"]
Opening Model:
with open('vectorizerTFIDF.pickle', 'rb') as infile:
tdf = pickle.load(infile)
vector = tdf.transform(lastComment)
However I am getting:
Traceback (most recent call last):
File "C:/Users/LDA_test/ldaTest.py", line 141, in <module>
vector = tdf.transform(lastComment)
File "C:\Program Files\Anaconda3\lib\site-packages\scipy\sparse\base.py", line 559, in __getattr__
raise AttributeError(attr + " not found")
AttributeError: transform not found
I hope someone could support me with this issue thanks in advance,

You've pickled the vectorized array, not the transformer, you need pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

Related

Copy paste a Naive Bayes example code on vscode but got errors

I copied the code from datacamp to try the Naive Bayes classification on my own on python 3.8 . but when run the code the compiler gives this error
Traceback (most recent call last):
File "c:\Users\USER\Desktop\DATA MINING\NaiveTest.py", line 34, in <module>
model.fit(features,label)
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\naive_bayes.py", line 207, in fit
X, y = self._validate_data(X, y)
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\base.py", line 433, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\validation.py", line 814, in check_X_y
X = check_array(X, accept_sparse=accept_sparse,
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\validation.py", line 630, in check_array
raise ValueError(
ValueError: Expected 2D array, got scalar array instead:
array=<zip object at 0x0F2C4C28>.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I am posting the whole code cause I'm not sure which part that causes this so I'm requesting help to solve this.
# Assigning features and label variables
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny','Rainy','Sunny','Overcast','Overcast','Rainy']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']
play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)
print (weather_encoded)
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)
print ("Temp:",temp_encoded)
print ("Play:",label)
#Combinig weather and temp into single listof tuples
features=zip(weather_encoded,temp_encoded)
print(list(zip(weather_encoded,temp_encoded)))
print([i for i in zip(weather_encoded,temp_encoded)])
from sklearn.naive_bayes import GaussianNB
#Create a Gaussian Classifier
model = GaussianNB()
# Train the model using the training sets
model.fit(features,label)
#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print ("Predicted Value:", predicted)
supposedly the result something like this Predicted Value: [1]
but it gave this error instead

What happens is that features should be a list to be passed to model.fit, currently they are type zip
#Combinig weather and temp into single listof tuples
features=zip(weather_encoded,temp_encoded)
you may need to convert features to list, e.g.
#Combinig weather and temp into single listof tuples
features=list(zip(weather_encoded,temp_encoded))

Python3 Glove Attribute Error: 'generator' object has no attribute 'shape'

I am trying to train my model using glove. My code is as below:
#!/usr/bin/env python3
from __future__ import print_function
import argparse
import pprint
import gensim
from glove import Glove
from tensorflow.python.keras.utils.data_utils import Sequence
def read_corpus(filename):
delchars = [chr(c) for c in range(256)]
delchars = [x for x in delchars if not x.isalnum()]
delchars.remove(' ')
delchars = ''.join(delchars)
with open(filename, 'r') as datafile:
for line in datafile:
yield line.lower().translate(None, delchars).split(' ')
if __name__ == '__main__':
base_path = "/home/hunzala_awan/vocab.pubmed1.txt"
get_data = read_corpus(base_path)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(get_data, epochs=10, verbose=True)
pprint.pprint(glove.most_similar("cancer", number=10))
When I try to run this code, I get the following error:
Traceback (most recent call last):
File "mytest3.py", line 36, in
glove.fit(get_data, epochs=10, verbose=True)
File "/usr/local/lib/python3.5/dist-packages/glove/glove.py", line 86, in fit
shape = matrix.shape
AttributeError: 'generator' object has no attribute 'shape'
What am I missing? Any help in this issue will be highly appreciated.
Thanks in advance

I'm not familiar with Glove, but it seems that it can't fit from genberator function. You can try yield it ahead-of-time and convert to list (it will be more memory-consuming):
glove.fit(list(get_data), epochs=10, verbose=True)

Can I use dictionary in keras customized model?

I recently read a paper about UNet++，and I want to implement this structure with tensorflow-2.0 and keras customized model. As the structure is so complicated, I decided to manage the keras layers by a dictionary. Everything went well in training, but an error occurred while saving the model. Here is a minimum code to show the error:
class DicModel(tf.keras.Model):
def __init__(self):
super(DicModel, self).__init__(name='SequenceEECNN')
self.c = {}
self.c[0] = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3,activation='relu',padding='same'),
tf.keras.layers.BatchNormalization()]
)
self.c[1] = tf.keras.layers.Conv2D(3,3,activation='softmax',padding='same')
def call(self,images):
x = self.c[0](images)
x = self.c[1](x)
return x
X_train,y_train = load_data()
X_test,y_test = load_data()
class_weight.compute_class_weight('balanced',np.ravel(np.unique(y_train)),np.ravel(y_train))
model = DicModel()
model_name = 'test'
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir='logs/'+model_name+'/')
early_stop_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss',patience=100,mode='min')
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
loss=tf.keras.losses.sparse_categorical_crossentropy,
metrics=['accuracy'])
results = model.fit(X_train,y_train,batch_size=4,epochs=5,validation_data=(X_test,y_test),
callbacks=[tensorboard_callback,early_stop_callback],
class_weight=[0.2,2.0,100.0])
model.save_weights('model/'+model_name,save_format='tf')
The error information is:
Traceback (most recent call last):
File "/media/xrzhang/Data/ZHS/Research/CNN-TF2/learn_tf2/test_model.py", line 61, in \<module>
model.save_weights('model/'+model_name,save_format='tf')
File "/media/xrzhang/Data/ZHS/Research/CNN-TF2/venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py", line 1328, in save_weights
self.\_trackable_saver.save(filepath, session=session)
File "/media/xrzhang/Data/ZHS/Research/CNN-TF2/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/util.py", line 1106, in save
file_prefix=file_prefix_tensor, object_graph_tensor=object_graph_tensor)
File "/media/xrzhang/Data/ZHS/Research/CNN-TF2/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/util.py", line 1046, in \_save_cached_when_graph_building
object_graph_tensor=object_graph_tensor)
File "/media/xrzhang/Data/ZHS/Research/CNN-TF2/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/util.py", line 1014, in \_gather_saveables
feed_additions) = self.\_graph_view.serialize_object_graph()
File "/media/xrzhang/Data/ZHS/Research/CNN-TF2/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/graph_view.py", line 379, in serialize_object_graph
trackable_objects, path_to_root = self.\_breadth_first_traversal()
File "/media/xrzhang/Data/ZHS/Research/CNN-TF2/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/graph_view.py", line 199, in \_breadth_first_traversal
for name, dependency in self.list_dependencies(current_trackable):
File "/media/xrzhang/Data/ZHS/Research/CNN-TF2/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/graph_view.py", line 159, in list_dependencies
return obj.\_checkpoint_dependencies
File "/media/xrzhang/Data/ZHS/Research/CNN-TF2/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/data_structures.py", line 690, in \_\_getattribute\_\_
return object.\_\_getattribute\_\_(self, name)
File "/media/xrzhang/Data/ZHS/Research/CNN-TF2/venv/lib/python3.6/site-packages/tensorflow/python/training/tracking/data_structures.py", line 732, in \_checkpoint_dependencies
"ignored." % (self,))
ValueError: Unable to save the object {0: \<tensorflow.python.keras.engine.sequential.Sequential object at 0x7fb5c6c36588>, 1: \<tensorflow.python.keras.layers.convolutional.Conv2D object at 0x7fb5c6c36630>} (a dictionary wrapper constructed automatically on attribute assignment). The wrapped dictionary contains a non-string key which maps to a trackable object or mutable data structure.
If you don't need this dictionary checkpointed, wrap it in a tf.contrib.checkpoint.NoDependency object; it will be automatically un-wrapped and subsequently ignored.
The tf.contrib.checkpoint.NoDependency seems has been removed from Tensorflow-2.0 (https://medium.com/tensorflow/whats-coming-in-tensorflow-2-0-d3663832e9b8). How can I fix this issue? Or should I just give up using dictionary in customized Keras Model. Thank you for your time and helps!

Use string keys. For some reason tensorflow doesn't like int keys.

The exception message was incorrect in Tensorflow 2.0 and has been fixed in 2.2
You can avoid the problem by wrapping the c attribute like this
from tensorflow.python.training.tracking.data_structures import NoDependency
self.c = NoDependency({})
For more details check this issue.

ValueError after MinMaxScaler and Transform

I am experiencing difficulty in this area. I experienced ValueError in the following: (I have tried solutions online but to no avail)
Here's my original code, which returns Convert String to Float error
ValueError: could not convert string to float: '3,1,0,0,0,1,0,1,89874,49.99'):
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
training_data_df = pd.read_csv('./data/sales_data_training.csv')
scaler = MinMaxScaler(feature_range=(0,1))
scaled_training= scaler.fit_transform(training_data_df)
scaled_training_df = pd.DataFrame(scaled_training,columns= training_data_df.columns.values)
My CSV Data:
"critic_rating,is_action,is_exclusive_to_us,is_portable,is_role_playing,is_sequel,is_sports,suitable_for_kids,total_earnings,unit_price"
"3.5,1,0,1,0,1,0,0,132717,59.99"
"4.5,0,0,0,0,1,1,0,83407,49.99"...
'3,1,0,0,0,1,0,1,89874,49.99'
I have 9 columns of data across 1000 rows (~9999 data, with first row being the header).
Regards,
Yuki
The full error is as follows:
Traceback (most recent call last):
File "C:/Users/YukiKawaii/PycharmProjects/PandasTest/module2_NN/test.py", line 6, in <module>
scaled_training= scaler.fit_transform(training_data_df)
File "C:\Users\YukiKawaii\Python\Python35\lib\site-packages\sklearn\base.py", line 517, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "C:\Users\YukiKawaii\Python\Python35\lib\site-packages\sklearn\preprocessing\data.py", line 308, in fit
return self.partial_fit(X, y)
File "C:\Users\YukiKawaii\Python\Python35\lib\site-packages\sklearn\preprocessing\data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "C:\Users\YukiKawaii\Python\Python35\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: '3,1,0,0,0,1,0,1,89874,49.99'

You should remove the "" and '' wrapped around each line in the csv file.
By default pd.read_csv() splits each line by , and thus it cannot convert strings to floats if the "" and '' were there.
So the csv file should look as follows.
critic_rating,is_action,is_exclusive_to_us,is_portable,is_role_playing,is_sequel,is_sports,suitable_for_kids,total_earnings,unit_price
3.5,1,0,1,0,1,0,0,132717,59.99
4.5,0,0,0,0,1,1,0,83407,49.99
3,1,0,0,0,1,0,1,89874,49.99
I just verified by running your code after making the above change.

Error in applying lemmatization

Why am I getting this error, please help.
I am newbie to machine learning.
This is my code and here I've applied lemmatization on 20 newsgroups dataset.
This code aims to get the 500 words with highest counts while applying filtering.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer
def letters_only(astr):
return astr.isalpha()
cv = CountVectorizer(stop_words="english", max_features=500)
groups = fetch_20newsgroups()
cleaned = []
all_names = set(names.words())
lemmatizer = WordNetLemmatizer()
for post in groups.data:
cleaned.append(' '.join([lemmatizer.lemmatize(word.lower()
for word in post.split()
if letters_only(word) and word not in all_names)]))
transformed = cv.fit_transform(cleaned)
print(cv.get_feature_names())
Error:
Traceback (most recent call last):
File "<ipython-input-91-7158a74bae71>", line 18, in <module>
for word in post.split()
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
forms = apply_rules([form])
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
for form in forms
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
if form.endswith(old)]
AttributeError: 'generator' object has no attribute 'endswith'

I'm not sure why, but turning that for loop one liner into regular for loop solved the problem;
for post in groups.data:
for word in post.split():
if letters_only(word) and word not in all_names:
cleaned.append(' '.join([lemmatizer.lemmatize(word.lower())]))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why the following tfidf vectorization is failing? - scikit-learn

You've pickled the vectorized array, not the transformer, you need pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

Related

Copy paste a Naive Bayes example code on vscode but got errors

Python3 Glove Attribute Error: 'generator' object has no attribute 'shape'

Can I use dictionary in keras customized model?

ValueError after MinMaxScaler and Transform

Error in applying lemmatization

Categories

Resources