FastText: Can't get cross_validation - scikit-learn

I am struggling to implement FastText (FTTransformer) into a Pipeline that iterates over different vectorizers. More particular, I can't get cross-validation scores. Following code is used:
%%time
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
from gensim.sklearn_api.ftmodel import FTTransformer
np.random.seed(0)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split(data.text, data.label, random_state=0)
w2v_texts = [simple_preprocess(doc) for doc in X_train]
models = [FTTransformer(size=10, min_count=0, seed=42)]
classifiers = [LogisticRegression(random_state=0)]
for model in models:
for classifier in classifiers:
model.fit(w2v_texts)
classifier.fit(model.transform(X_train), y_train)
pipeline = Pipeline([
('vec', model),
('clf', classifier)
])
print(pipeline.score(X_train, y_train))
#print(model.gensim_model.wv.most_similar('kirk'))
cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
KeyError: 'all ngrams for word "Machine learning can be useful
branding sometimes" absent from model'
How can the problem be solved?
Sidenote: My other pipelines with D2VTransformer or TfIdfVectorizer work just fine. Here, I can simply apply pipeline.fit(X_train, y_train) after defining the pipeline, instead of the two fits as shown above. It seems like FTTransformer doesn't integrate so well with other given vectorizers?

Yes, to be used in a pipeline, FTTransformer needs to be modified to split documents to words inside its fit method. One can do it as follows:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
from gensim.sklearn_api.ftmodel import FTTransformer
np.random.seed(0)
class FTTransformer2(FTTransformer):
def fit(self, x, y):
super().fit([simple_preprocess(doc) for doc in x])
return self
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split(data.text, data.label, random_state=0)
classifiers = [LogisticRegression(random_state=0)]
for classifier in classifiers:
pipeline = Pipeline([
('ftt', FTTransformer2(size=10, min_count=0, seed=0)),
('clf', classifier)
])
score = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
print(score)

Related

LogisticRegression classifier

I need to use Logistic Regression classifier I have dataset the length of each column 2000 this is all my code:
from statistics import mode
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
# Importing the datasets
###Social_Network_Ads
datasets = pd.read_csv('C:/Users/n3.csv',header=None)
X = datasets.iloc[:, 0:5].values
Y = datasets.iloc[:, 5].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
# instantiate the model (using the default parameters)
model = LogisticRegression()
# fit the model with data
model.fit(X_Train, Y_Train)
predicted = cross_val_predict(mode, X_Train, Y_Train, cv=5)
train_acc = model.score(X_Train, Y_Train)
print("The Accuracy for Training Set is {}".format(train_acc*100))
But in I got on this error:
TypeError: Cannot clone object '<function mode at 0x000000FD6579B9D0>'
(type <class 'function'>): it does not seem to be a scikit-learn
estimator as it does not implement a 'get_params' method.
How solve this?
Change this line
predicted = cross_val_predict(mode, X_Train, Y_Train, cv=5)
to
predicted = cross_val_predict(model, X_Train, Y_Train, cv=5)
You have a simple typo. You want to pass your estimator to the function but instead you passed mode which is imported from statistics. That's why the error tells you that it can not clone an object of type function. You are passing a function but it expects an estimator.

IBM Cloud Data Pak

How can I export a trained model using pickle in IBM Cloud Data Pak?
Example:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=10)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train, y_train)
lr.score(x_test, y_test)
# What I did is below and I can't find the model saved anywhere
import pickle
with open('model.pickle', 'wb') as f:
pickle.dump(lr,f)

Is there a way to check(or print) the scaled X_train at RandomizedSearchCV in sklearn?

I wonder if there is a way to check(or to print) the scaled(StandardScaler) X_train
after the RandomizedSearchCV or pipeline.
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# get some data
iris = load_digits()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=777)
# specify parameters and distributions to sample from
param_dist = {
'rbf_svm__C': [1, 10, 100, 1000],
'rbf_svm__gamma': [0.001, 0.0001],
'rbf_svm__kernel': ['rbf', 'linear'],
}
# create pipeline with a scaler
steps01 = [('scaler', StandardScaler()), ('rbf_svm', SVC())]
pipeline01 = Pipeline(steps01)
# do search
search01 = RandomizedSearchCV(pipeline01, param_distributions=param_dist, n_iter=50)
search01.fit(x_train, y_train)
print('best parameters with scaler : ', search01.best_params_)
print('best score with scaler : ', search01.best_score_)

How can I get the history of the KerasRegressor?

I want to get KerasRegressor history but all the time I get (...) object has no attribute 'History'
'''
# Regression Example With Boston Dataset: Standardized and Wider
import numpy as np
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
#from keras.wrappers.scikit_learn import KerasRegressor
from scikeras.wrappers import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import keras.backend as K
# load dataset
dataframe = read_csv("Data 1398-2.csv")
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:10]
Y = dataset[:,10]
############
from sklearn import preprocessing
from sklearn.metrics import r2_score
min_max_scaler = preprocessing.MinMaxScaler()
X_scale = min_max_scaler.fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_val_and_test, Y_train, Y_val_and_test = train_test_split(X_scale, Y, test_size=0.25)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_and_test, Y_val_and_test, test_size=0.55)
##################
# define wider model
def wider_model():
# create model
model = Sequential()
model.add(Dense(40, input_dim=10, kernel_initializer='normal', activation='relu'))
model.add(Dense(20, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error',metrics=['mae'], optimizer='adam')
#history = model.fit(X, Y, epochs=10, batch_size=len(X), verbose=1)
return model
# evaluate model with standardized dataset
from keras.callbacks import History
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp',KerasRegressor(model=wider_model, epochs=100, batch_size=2, verbose=0) ))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=5)
results = cross_val_score(pipeline, X_train, Y_train, cv=kfold)
print("Wider: %.2f (%.2f) MSE" % (results.mean(), results.std()))
import matplotlib.pyplot as plt
#plt.plot(history.history['loss'])
#plt.plot(history.history['val_loss'])
#plt.title('Model loss')
#plt.ylabel('Loss')
#plt.xlabel('Epoch')
#plt.legend(['Train', 'Val'], loc='upper right')
#plt.show()
'''
Model is at index 1 in your case, but you can also find it. Now to get history object:
pipeline.steps[1][1].model.history.history
If you are sure that Keras Model is always the last estimator, you can also use:
pipeline._final_estimator.model.history.history

Confusion Matrix in SkLearn showing error

I am trying to plot a confusion matrix for my classification model given the iris dataset. However, I keep getting an error. I hope someone can guide.Thanks
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
from sklearn.metrics import confusion_matrix
def train_and_predict(train_input_features, train_outputs, prediction_features):
classifier=tree.DecisionTreeClassifier()
classifier.fit(train_input_features,train_outputs)
predictions=classifier.predict(prediction_features)
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,test_size=0.3, random_state=0)
y_pred = train_and_predict(X_train, y_train, X_test)
print(confusion_matrix(y_test, predictions))
OUT: NameError: name 'predictions' is not defined
I found out that I needed to paste the code within the function,i.e.:
def train_and_predict(train_input_features, train_outputs, prediction_features):
classifier=tree.DecisionTreeClassifier()
classifier.fit(train_input_features,train_outputs)
predictions=classifier.predict(prediction_features)
print(predictions)
print('Confusion matrix\n',confusion_matrix(y_test,classifier.predict(X_test)))

Resources