error when exporting predictions of 4 machine learning models - scikit-learn

I am training and testing my date on a kfold equal to 10 with 4 different models. I would like for each models to export the prédictions and the corrected classes for each split.
this is my code and the result :
for train_index, test_index in kf.split(X, labels):
print('TRAIN:', train_index,
'TEST:', test_index)
X_train, X_val = X[train_index], X[test_index]
y_train, y_val = labels[train_index], labels[test_index]
model1 = LinearSVC()
model2 = MultinomialNB()
model3 = LogisticRegression()
model4 = RandomForestClassifier()
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)
model4.fit(X_train, y_train)
result1 = model1.predict(X_val)
result2 = model2.predict(X_val)
result3 = model3.predict(X_val)
result4 = model4.predict(X_val)
df = pd.DataFrame(data = {"id": X_val, "Prediction": y_val})
df.to_excel('result.xlsx')
so far I have this below but it only prints the first lines (1-198) but i do not understand the export , could you help me
I have approximately 2000 sentences.

When you set K in KFold == 10, the .split() method splits your dataset into 10 portions. For each iteration, test_index will be indices of the i-th portion while train_index will be the rest of the 9 portions.
In your original code, the df shows the test set (X_val, Y_val) (instead of the predictions) for each iteration.
I am not sure that you intend to do but if you would like to see the prediction for each model, the following code will do:
df = pd.DataFrame(data={
"id": [],
"ground_true": [],
"original_sentence": [],
"pred_model1": [],
"pred_model2": [],
"pred_model3": [],
"pred_model4": []})
for train_index, test_index in kf.split(X, labels):
print('TRAIN:', train_index,'TEST:', test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = labels[train_index], labels[test_index]
model1 = LinearSVC()
model2 = MultinomialNB()
model3 = LogisticRegression()
model4 = RandomForestClassifier()
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)
model4.fit(X_train, y_train)
result1 = model1.predict(X_val)
result2 = model2.predict(X_val)
result3 = model3.predict(X_val)
result4 = model4.predict(X_val)
temp_df = pd.DataFrame(data={
"id": X_val,
"ground_true": y_val,
"original_sentence": verbatim_train_remove_stop_words[test_index],
"pred_model1": result1,
"pred_model2": result2,
"pred_model3": result3,
"pred_model4": result4})
df = pd.concat([df, temp_df])

Related

log loss computed manually diverging from the cross_validation_score method from scikit-learn

I have a question about how the cross_val_score() from the Scikit-Learn works. I tried divide the dataset in 10 folds with Kfold() and compute the log loss of both training and validation sets for each fold. However I got different answers using the cross_validation_score, setting the parameter scoring = 'neg_log_loss'.
X and y are arrays of shape (1800, 12) and (1800, 1), respectively.
kfold = KFold(n_splits=10)
train_loss = []
val_loss = []
for train_index, val_index in kfold.split(X, y):
clf_logreg = LogisticRegression()
#
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
clf_logreg.fit(X_train, y_train)
y_train_pred = clf_logreg.predict(X_train)
y_val_pred = clf_logreg.predict(X_val)
train_loss.append(log_loss(y_train, y_train_pred))
val_loss.append(log_loss(y_val, y_val_pred))
clf_logreg.fit(X,y)
y_error = cross_val_score(clf_logreg, X, y, cv=kfold, scoring='neg_log_loss')
print("cross_val log_loss: ", -y_error)
print("\ntrain_loss: ", train_loss)
print("\nval_loss: ", val_loss)
The answers I got:
cross_val log_loss: [0.18546779 0.18002459 0.21591202 0.15872213 0.22852112 0.18766844
0.28641203 0.14923009 0.21446935 0.20373971]
train_loss: [2.79298449379999, 2.7290223160363962, 2.558457002245472, 2.835624958485065, 2.5797806896386337, 2.622420660745048, 2.5797797024813125, 2.6224201671663874, 2.5797782217453302, 2.6863818513513213]
val_loss: [1.9188431218680995, 2.1107385395747733, 3.645826363693089, 2.110734097366828, 3.2620355282797417, 2.686367043991502, 3.453913177154633, 2.4944849529086657, 2.8782624616981765, 2.4944938373245567]
As Ben Reiniger noted in the comment log_loss expects probabilities in y_train_pred, y_val_pred. So you need to change
clf_logreg.predict
to:
clf_logreg.predict_proba
Example:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.model_selection import KFold, cross_val_score
X, y = load_iris(return_X_y=True)
y = y == 1
kfold = KFold(n_splits=10, random_state=1, shuffle=True)
train_loss = []
val_loss = []
for train_index, val_index in kfold.split(X, y):
clf_logreg = LogisticRegression()
#
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
clf_logreg.fit(X_train, y_train)
y_train_pred = clf_logreg.predict_proba(X_train)
y_val_pred = clf_logreg.predict_proba(X_val)
train_loss.append(log_loss(y_train, y_train_pred))
val_loss.append(log_loss(y_val, y_val_pred))
clf_logreg.fit(X, y)
y_error = cross_val_score(clf_logreg, X, y, cv=kfold, scoring="neg_log_loss")
print("cross_val log_loss: ", -y_error)
print("\nval_loss: ", val_loss)
Results:
cross_val log_loss: [0.53548503 0.54200945 0.60324094 0.64781483 0.43323992 0.37625601
0.55101127 0.46172226 0.50216316 0.64359642]
val_loss: [0.5354850268015129, 0.5420094471965571, 0.6032409439788419, 0.647814828089315, 0.43323991804482626, 0.3762560144867495, 0.5510112741331039, 0.46172225526408, 0.5021631570133954, 0.6435964210060579]

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType' in implementing ABC

I am implementing artificial bee colony optimization in ANN using [this][1] api. but i am getting this error. This is my code:
def ANN(optimizer = "adam", neurons = 32, batch_size = 32, epochs = 50, activation = "relu", patience =5, loss = 'mse'):
model = Sequential()
model.add(Dense(neurons, input_dim=look_back, activation= activation))
model.add(Dense(neurons, activation= activation))
model.add(Dense(1))
model.compile(optimizer = optimizer, loss = loss)
early_stopping = EarlyStopping(monitor = "loss", patience = patience)
history = model.fit(x_train, y_train, batch_size = batch_size, epochs = epochs, callbacks = [early_stopping], verbose = 0)
return model
boundaries = [(0,2), (0,2), (0,2), (0,2), (10,100), (20,50), (3,20)]
def performance(x_train, y_train, x_test, y_test, optimizer = None, activation = None, loss = None, batch_size = None, neurons = None, epochs = None, patience=None):
model = ANN(optimizer=optimizer, activation= activation, loss=loss, batch_size=batch_size, neurons= neurons, epochs = epochs, patience=patience)
trainScore = model.evaluate(x_train, y_train, verbose=0)
print('Train Score: %.2f MSE (%.2f RMSE)' % (trainScore, math.sqrt(trainScore)))
testScore = model.evaluate(x_test, y_test, verbose=0)
print('Test Score: %.2f MSE (%.2f RMSE)' % (testScore, math.sqrt(testScore)))
trainPredict = model.predict(x_train)
testPredict = model.predict(x_test)
#calculate mean absolute percent error
trainMAPE = mean_absolute_error(y_train, trainPredict)
testMAPE = mean_absolute_error(y_test, testPredict)
return print('testMAPE: %.2f MAPE' % trainMAPE), print('testMAPE: %.2f MAPE' % testMAPE)
writer = pd.ExcelWriter('/content/Scores.xlsx')
for sheetNum in range(1,5):
dataframe = pd.read_excel('Fri.xlsx',sheet_name='Sheet'+str(sheetNum))
# load the dataset
dataset = dataframe.values
dataset = dataset.astype('float32')
train_size = int(len(dataset) * 0.48)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size, :], dataset[train_size:len(dataset), :]
# reshape into X=t and Y=t+1
look_back = 10
x_train, y_train = create_dataset(train, look_back)
x_test, y_test = create_dataset(test, look_back)
# normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
abc_obj = abc(performance(x_train, y_train, x_test, y_test), boundaries)
abc_obj.fit()
#Get solution obtained after fit() execution:
solution = abc_obj.get_solution()
This is my error:
TypeError Traceback (most recent call last)
<ipython-input-38-f9098d8d18fc> in <module>()
23 x_train = scaler.fit_transform(x_train)
24 x_test = scaler.fit_transform(x_test)
---> 25 abc_obj = abc(performance(x_train, y_train, x_test, y_test), boundaries)
26 abc_obj.fit()
27
2 frames
/usr/local/lib/python3.7/dist-packages/keras/layers/core.py in __init__(self, units, activation, use_bias, kernel_initializer, bias_initializer, kernel_regularizer, bias_regularizer, activity_regularizer, kernel_constraint, bias_constraint, **kwargs)
1144 activity_regularizer=activity_regularizer, **kwargs)
1145
-> 1146 self.units = int(units) if not isinstance(units, int) else units
1147 self.activation = activations.get(activation)
1148 self.use_bias = use_bias
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
can you help me with this, please? I think i am not defining the function "performance" correctly. but I don't understand how can I make it better.
[1]: https://pypi.org/project/beecolpy/

Predicting Future values with Keras LSTM

I have created an LSTM sales prediction model that works really well on the train and test sets. I would now like to predict beyond the dates in the entire dataset.
I have tried following this answer how to use the Keras model to forecast for future dates or events? but I really can't figure out how to adjust my code to do future predictions.
Also, I changed my code from
X_train, y_train = train_set_scaled[:, 1:], train_set_scaled[:, 0:1]
X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test, y_test = test_set_scaled[:, 1:], test_set_scaled[:, 0:1]
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])
to
X_train, y_train = train_set_scaled[:, 1:], train_set_scaled[:, 1:8]
X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test, y_test = test_set_scaled[:, 1:], test_set_scaled[:, 1:8]
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])
after trying the solution in Keras time series can I predict next 6 month in one time
Here is the code where training and modelling happens:
# changed to initial
for df in m:
train_set, test_set = m[df][0:-6].values, m[df][-6:].values
#apply Min Max Scaler
scaler = MinMaxScaler(feature_range=(-1, 1))
scaler = scaler.fit(train_set)
# reshape training set
train_set = train_set.reshape(train_set.shape[0], train_set.shape[1])
train_set_scaled = scaler.transform(train_set)
# reshape test set
test_set = test_set.reshape(test_set.shape[0], test_set.shape[1])
test_set_scaled = scaler.transform(test_set)
#build the LSTM Model
X_train, y_train = train_set_scaled[:, 1:], train_set_scaled[:, 0:1]
X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test, y_test = test_set_scaled[:, 1:], test_set_scaled[:, 0:1]
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])
print('Fitting model for: {}'.format(df))
#fit our LSTM Model
model = Sequential()
model.add(LSTM(4, batch_input_shape=(1, X_train.shape[1], X_train.shape[2]), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, nb_epoch=500, batch_size=1, verbose=1, shuffle=False)
# model.save('lstm_model.h5')
print('Predictions for: {}'.format(df))
#check prediction
y_pred = model.predict(X_test,batch_size=1)
print('Inverse Transform for: {}'.format(df))
#inverse transformation to see actual sales
#reshape y_pred
y_pred = y_pred.reshape(y_pred.shape[0], 1, y_pred.shape[1])
#rebuild test set for inverse transform
pred_test_set = []
for index in range(0,len(y_pred)):
print (np.concatenate([y_pred[index],X_test[index]],axis=1))
pred_test_set.append(np.concatenate([y_pred[index],X_test[index]],axis=1))
#reshape pred_test_set
pred_test_set = np.array(pred_test_set)
pred_test_set = pred_test_set.reshape(pred_test_set.shape[0], pred_test_set.shape[2])
#inverse transform
pred_test_set_inverted = scaler.inverse_transform(pred_test_set)
I would like the predictions to go beyond the data in the dataset.
UPDATE: I trained the model and took its predictions on the test set. Use these as input for another LSTM model to fit and predict for 12 months. It worked for me. Also changed my last Dense layer (above) to predict 1 point at a time instead of 7 as I had before.
Below is the code:
from numpy import array
for df in d:
if df in list_df:
# df_ADIDAS DYN PUL DEO 150 FCA5421
#KEEP
result_list = []
sales_dates = list(d["{}".format(df)][-7:].Month)
act_sales = list(d["{}".format(df)][-7:].Sale)
for index in range(0,len(pred_test_set_inverted)):
result_dict = {}
result_dict['pred_value'] = int(pred_test_set_inverted[index][0] + act_sales[index]) #change to 0 ffrom act_sales[index]
result_dict['date'] = sales_dates[index] #>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>REVIEW
result_list.append(result_dict)
df_result = pd.DataFrame(result_list)
predictions = list(df_result['pred_value'])
forecasts = []
result_list
for i in range(len(result_list)):
forecasts.append(result_list[i]['pred_value'])
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
# choose a number of time steps
n_steps = 4
# split into samples
X, y = split_sequence(forecasts, n_steps)
# summarize the data
# for i in range(len(X)):
# print(X[i], y[i])
n_features = 1
X = X.reshape((X.shape[0], X.shape[1], n_features))
# define model
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(n_steps, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
# fit model
model.fit(X, y, epochs=200, verbose=0)
# demonstrate prediction
x_input = array(predictions[-4:])
x_input = x_input.reshape((1, n_steps, n_features))
yhat = model.predict(x_input, verbose=0)
#print(yhat)
currentStep = yhat[:, -1:]
print('Twelve Month Prediction for {}'.format(df))
for i in range(12):
if i == 0:
x_input = x_input.reshape((1, n_steps, n_features))
yhat = model.predict(x_input, verbose=0)
print(yhat)
else:
x0_input = np.append(x_input, [currentStep[i-1]])
x0_input = x0_input.reshape((1, n_steps+1, n_features))
x_input = x0_input[:,1:]
yhat = model.predict(x_input)
currentStep = np.append(currentStep, yhat[:,-1:])
print(yhat)
Your last Dense layer says that you are predicting 7 points at a time. Save those predictions and feed them to the model again to predict next 7. That makes it 14 predictions simultaneously. And so on. Or change the number of nodes and shape of y from 7 to corresponding number and train again.

Always getting accuracy of 1 how to fix it?

I'm trying to apply logistic regression on my dataset but its giving accuracy of 1
df = pd.read_csv("train.csv", header=0)
df = df[["PassengerId", "Survived", "Sex", "Age", "Embarked"]]
df.dropna(inplace=True)
X = df[["Sex", "Age"]]
X_train = np.array(X)
Y = df["Survived"]
Y_train = np.array(Y)
clf = LogisticRegression()
clf.fit(X_train, Y_train)
df1 = pd.read_csv("test.csv", header=0)
df1 = df1[["PassengerId", "Survived", "Sex", "Age", "Embarked"]]
df1.dropna(inplace=True)
X = df1[["Sex", "Age"]]
X_test = np.array(X)
Y = df1["Survived"]
Y_test = np.array(Y)
X_test = X_test.astype(float)
Y_test = Y_test.astype(float)
#to convert string data to float
accuracy = clf.score(X_test, Y_test)
print("Accuracy = ", accuracy)
I expect the output between 0 and 1, but always getting 1.0

invalid literal for int() with base 10 with GRU module

My input is simply a csv file with 50K rows and two columns for Arabic sentiment analyses : but im keep getting error while trying to train my data on a stacked GRU model
keep getting the below error
ValueError: invalid literal for int() with base 10: 'اللهم اني احسن
التدبير فادبر امري'
X_train, X_test, y_train, y_test = train_test_split(df.text, df.sentiment, test_size=0.1, random_state=37)
assert X_train.shape[0] == y_train.shape[0]
assert X_test.shape[0] == y_test.shape[0]
tk = Tokenizer(num_words=NB_WORDS,
filters='!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n',
lower=True,
split=" ")
tk.fit_on_texts(X_train)
def one_hot_seq(seqs, nb_features = NB_WORDS):
ohs = np.zeros((len(seqs), nb_features))
for i, s in enumerate(seqs):
ohs[i, s] = 1.
return ohs
X_train_oh = one_hot_seq(X_train_seq)
X_test_oh = one_hot_seq(X_test_seq)
X_train_seq = tk.texts_to_sequences(X_train)
X_test_seq = tk.texts_to_sequences(X_test)
assert X_valid.shape[0] == y_valid.shape[0]
assert X_train_rest.shape[0] == y_train_rest.shape[0]
max_words = 500
top_words = 5000
X_train = sequence.pad_sequences(X_train , maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)
model = Sequential()
model.add(Embedding(top_words, 100, input_length=max_words))
model.add(GRU(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
# Train
model.fit(X_train_oh, y_train_oh, epochs=3, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test_oh, y_test_oh, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
# Predict the label for test data
y_predict = model.predict(X_test)

Resources