Keras Multiclass Classification (Dense model) - Confusion Matrix Incorrect - keras

I have a labeled dataset. last column (78) contains 4 types of attack. following codes confusion matrix is correct for two types of attack. can any one help to modify the code for keras multiclass attack detection and correction for get correct confusion matrix? and for correct code for precision, FPR,TPR for multiclass. Thanks.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from keras.utils.np_utils import to_categorical
dataset_original = pd.read_csv('./XYZ.csv')
# Dron NaN value from Data Frame
dataset = dataset_original.dropna()
# data cleansing
X = dataset.iloc[:, 0:78]
print(X.info())
print(type(X))
y = dataset.iloc[:, 78] #78 is labeled column contains 4 anomaly type
print(y)
# encode the labels to 0, 1 respectively
print(y[100:110])
encoder = LabelEncoder()
y = encoder.fit_transform(y)
print([y[100:110]])
# Split the dataset now
XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size=0.2, random_state=0)
# feature scaling
scalar = StandardScaler()
XTrain = scalar.fit_transform(XTrain)
XTest = scalar.transform(XTest)
# modeling
model = Sequential()
model.add(Dense(units=16, kernel_initializer='uniform', activation='relu', input_dim=78))
model.add(Dense(units=8, kernel_initializer='uniform', activation='relu'))
model.add(Dense(units=6, kernel_initializer='uniform', activation='relu'))
model.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(XTrain, yTrain, batch_size=1000, epochs=10)
history = model.fit(XTrain, yTrain, batch_size=1000, epochs=10, verbose=1, validation_data=(XTest,
yTest))
yPred = model.predict(XTest)
yPred = [1 if y > 0.5 else 0 for y in yPred]
matrix = confusion_matrix(yTest, yPred)`enter code here`
print(matrix)
accuracy = (matrix[0][0] + matrix[1][1]) / (matrix[0][0] + matrix[0][1] + matrix[1][0] + matrix[1][1])
print("Accuracy: " + str(accuracy * 100) + "%")

If i understand correctly, you are trying to solve a multiclass classification problem where your target label belongs to 4 different attacks. Therefore, you should use the output Dense layer having 4 units instead of 1 with a 'softmax' activation function (not 'sigmoid' activation). Additionally, you should use 'categorical_crossentropy' loss in place of 'binary_crossentropy' while compiling your model.
Furthermore, with this setting, applying argmax on prediction result (that has 4 class probability values for each test sample) you will get the final label/class.
[Edit]
Your confusion matrix and high accuracy indicates that you are working with an imbalanced dataset. May be very high number of samples are from class 0 and few samples are from the remaining 3 classes. To handle this you may want to apply weighting samples or over-sampling/under-sampling approaches.

Related

Questions about Multitask deep neural network modeling using Keras

I'm trying to develop a multitask deep neural network (MTDNN) to make prediction on small molecule bioactivity against kinase targets and something is definitely wrong with my model structure but I can't figure out what.
For my training data (highly imbalanced data with 0 as inactive and 1 as active), I have 423 unique kinase targets (tasks) and over 400k unique compounds. I first calculate the ECFP fingerprint using smiles, and then I randomly split the input data into train, test, and valid sets based on 8:1:1 ratio using RandomStratifiedSplitter from deepchem package. After training my model using the train set and I want to make prediction on the test set to check model performance.
Here's what my data looks like (screenshot example):
(https://i.stack.imgur.com/8Hp36.png)
Here's my code:
# Import Packages
import numpy as np
import pandas as pd
import deepchem as dc
from sklearn.metrics import roc_auc_score, roc_curve, auc, confusion_matrix
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import initializers, regularizers
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Input, Dropout, Reshape
from tensorflow.keras.optimizers import SGD
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
# Build Model
inputs = keras.Input(shape = (1024, ))
x = keras.layers.Dense(2000, activation='relu', name="dense2000",
kernel_initializer=initializers.RandomNormal(stddev=0.02),
bias_initializer=initializers.Ones(),
kernel_regularizer=regularizers.L2(l2=.0001))(inputs)
x = keras.layers.Dropout(rate=0.25)(x)
x = keras.layers.Dense(500, activation='relu', name='dense500')(x)
x = keras.layers.Dropout(rate=0.25)(x)
x = keras.layers.Dense(846, activation='relu', name='output1')(x)
logits = Reshape([423, 2])(x)
outputs = keras.layers.Softmax(axis=2)(logits)
Model1 = keras.Model(inputs=inputs, outputs=outputs, name='MTDNN')
Model1.summary()
opt = keras.optimizers.SGD(learning_rate=.0003, momentum=0.9)
def loss_function (output, labels):
loss = tf.nn.softmax_cross_entropy_with_logits(output,labels)
return loss
loss_fn = loss_function
Model1.compile(loss=loss_fn, optimizer=opt,
metrics=[keras.metrics.Accuracy(),
keras.metrics.AUC(),
keras.metrics.Precision(),
keras.metrics.Recall()])
for train, test, valid in split2:
trainX = pd.DataFrame(train.X)
trainy = pd.DataFrame(train.y)
trainy2 = tf.one_hot(trainy,2)
testX = pd.DataFrame(test.X)
testy = pd.DataFrame(test.y)
testy2 = tf.one_hot(testy,2)
validX = pd.DataFrame(valid.X)
validy = pd.DataFrame(valid.y)
validy2 = tf.one_hot(validy,2)
history = Model1.fit(x=trainX, y=trainy2,
shuffle=True,
epochs=10,
verbose=1,
batch_size=100,
validation_data=(validX, validy2))
y_pred = Model1.predict(testX)
y_pred2 = y_pred[:, :, 1]
y_pred3 = np.round(y_pred2)
# Check the # of nonzero in assay
(y_pred3!=0).sum () #all 0s
My questions are:
The roc and precision recall are all extremely high (>0.99), but the prediction result of test set contains all 0s, no actives at all. I also use the randomized dataset with same active:inactive ratio for each task to test if those values are too good to be true, and turns out all values are still above 0.99, including roc which is expected to be 0.5.
Can anyone help me to identify what is wrong with my model and how should I fix it please?
Can I use built-in functions in sklearn to calculate roc/accuracy/precision-recall? Or should I manually calculate the metrics based on confusion matrix on my own for multitasking purpose. Why and why not?

Is there any way to fit and apply deep learning algorithm on chemical smiles data in sequential model?

I have written a code for this where my input as an X
X : c1ccccc1 and Y value is water/methanol as classification category.
# multi-class classification with Keras
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
# load dataset
dataframe = pandas.read_csv("ADLV3_1.csv", header=None)
dataset = dataframe.values
X = dataset[:,2:3]
Y = dataset[:,3:4]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)
# define baseline model
def baseline_model():
# create model
model = Sequential()
model.add(Dense(8, input_dim=4, activation='relu'))
model.add(Dense(3, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=0)
kfold = KFold(n_splits=10, shuffle=True)
results = cross_val_score(estimator, X, dummy_y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
Code running successfully but getting warning as
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py:235: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py:268: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).
FitFailedWarning)
Baseline: nan% (nan%)
Is there any solution to make the algorithm workable? I can't predict any values

TensorFlow custom loss ValueError: No gradients provided for any variable:

I am implementing a custom loss function as in the code below for a simple classification. However, when I run the code I get the error ValueError: No gradients provided for any variable:
import os
os.environ['KERAS_BACKEND'] = "tensorflow"
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import statistics as st
import tensorflow as tf
from keras.utils import np_utils
# if the probability is greater than 0.75 then set the value to 1 for buy or sell else set it to None
# convert the y_pred to 0 and 1 using argmax function
# add the two matrices y_pred and y_true
# if value is 2 then set that to 0
# multiply by misclassification matrix
# add the losses to give a unique number
def custom_loss(y_true, y_pred):
y_pred = y_pred.numpy()
y_pred_dummy = np.zeros_like(y_pred)
y_pred_dummy[np.arange(len(y_pred)), y_pred.argmax(1)] = 1
y_pred = y_pred_dummy
y_true = y_true.numpy()
y_final = y_pred + y_true
y_final[y_final == 2] = 0
w_array = [[1,1,5],[1,1,1],[5,1,1]]
return tf.convert_to_tensor(np.sum(np.dot(y_final, w_array)))
model = keras.Sequential()
model.add(layers.Dense(32, input_dim=4, activation='relu'))
model.add(layers.Dense(16, input_dim=4, activation='relu'))
model.add(layers.Dense(8, input_dim=4, activation='relu'))
model.add(layers.Dense(3, activation='softmax'))
model.compile(loss=custom_loss, optimizer='adam', run_eagerly=True)
I do not understand what I am doing incorrectly over here. I read through the issues on tensorflow and one of the reasons is that the link between the loss function and input variables is broken. But I am using y_true in the loss function
Thanks
You can not use numpy within custom loss function. this function is a part of graph and should deal with tensors, not arrays. Numpy doesn't support backpropagation of gradients.

Model trained using LSTM is predicting only same value for all

I have a dataset with 4000 rows and two columns. The first column contains some sentences and the second column contains some numbers for it.
There are some 4000 sentences and they are categorized by some 100 different numbers. For example:
Sentences Codes
Google headquarters is in California 87390
Steve Jobs was a great man 70214
Steve Jobs has done great technology innovations 70214
Google pixel is a very nice phone 87390
Microsoft is another great giant in technology 67012
Bill Gates founded Microsoft 67012
Similarly, there are a total of 4000 rows containing these sentences and these rows are classified with 100 such codes
I have tried the below code but when I am predicting, it is predicting one same value for all. IN othr words y_pred is giving an array of same values.
May I know where is the code going wrong
import pandas as pd
import numpy as np
xl = pd.ExcelFile("dataSet.xlsx")
df = xl.parse('Sheet1')
#df = df.sample(frac=1).reset_index(drop=True)# shuffling the dataframe
df = df.sample(frac=1).reset_index(drop=True)# shuffling the dataframe
X = df.iloc[:, 0].values
Y = df.iloc[:, 1].values
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import pickle
count_vect = CountVectorizer()
X = count_vect.fit_transform(X)
tfidf_transformer = TfidfTransformer()
X = tfidf_transformer.fit_transform(X)
X = X.toarray()
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
y = Y.reshape(-1, 1) # Because Y has only one column
onehotencoder = OneHotEncoder(categories='auto')
Y = onehotencoder.fit_transform(y).toarray()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
inputDataLength = len(X_test[0])
outputDataLength = len(Y[0])
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.layers import Dropout
# fitting the model
embedding_vector_length = 100
model = Sequential()
model.add(Embedding(outputDataLength,embedding_vector_length, input_length=inputDataLength))
model.add(Dropout(0.2))
model.add(LSTM(outputDataLength))
model.add(Dense(outputDataLength, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=20)
y_pred = model.predict(X_test)
invorg = model.inverse_transform(y_test)
y_test = labelencoder_Y.inverse_transform(invorg)
inv = onehotencoder.inverse_transform(y_pred)
y_pred = labelencoder_Y.inverse_transform(inv)
You are using binary_crossentropy eventhough you have 100 classes. Which is not the right thing to do. You have to use categorical_crossentropy for this task.
Compile your model like this,
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Also, you are predicting with the model and converting to class labels like this,
y_pred = model.predict(X_test)
inv = onehotencoder.inverse_transform(y_pred)
y_pred = labelencoder_Y.inverse_transform(inv)
Since your model is activated with softmax inorder to get the class label, you have to find the argmax of the predictions.
For example, if the prediction was [0.2, 0.3, 0.0005, 0.99] you have to take argmax, which will give you output 3. The class that have high probability.
So you have to modify the prediction code like this,
y_pred = model.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)
y_pred = labelencoder_Y.inverse_transform(y_pred)
invorg = np.argmax(y_test, axis=1)
invorg = labelencoder_Y.inverse_transform(invorg)
Now you will have the actual class labels in invorg and predicted class labels at y_pred

How do I implement multilabel classification neural network with keras

I am attempting to implement a neural network using Keras with a problem that involves multilabel classification. I understand that one way to tackle the problem is to transform it to several binary classification problems. I have implemented one of these, but am not sure how to proceed with the others, mostly how do I go about combining them? My data set has 5 input variables and 5 labels. Generally a single sample of data would have 1-2 labels. It is rare to have more than two labels.
Here is my code (thanks to machinelearningmastery.com):
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataframe = pandas.read_csv("Realdata.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:5].astype(float)
Y = dataset[:,5]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
scores = model.evaluate(X, encoded_Y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
#Make predictions....change the model.predict to whatever you want instead of X
predictions = model.predict(X)
# round predictions
rounded = [round(x[0]) for x in predictions]
print(rounded)
return model
# evaluate model with standardized dataset
estimator = KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Results: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
The approach you are referring to is the one-versus-all or the one-versus-one strategy for multi-label classification. However, when using a neural network, the easiest solution for a multi-label classification problem with 5 labels is to use a single model with 5 output nodes. With keras:
model = Sequential()
model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(5, kernel_initializer='normal', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='sgd')
You can provide the training labels as binary-encoded vectors of length 5. For instance, an example that corresponds to classes 2 and 3 would have the label [0 1 1 0 0].

Resources