Model trained using LSTM is predicting only same value for all - python-3.x

I have a dataset with 4000 rows and two columns. The first column contains some sentences and the second column contains some numbers for it.
There are some 4000 sentences and they are categorized by some 100 different numbers. For example:
Sentences Codes
Google headquarters is in California 87390
Steve Jobs was a great man 70214
Steve Jobs has done great technology innovations 70214
Google pixel is a very nice phone 87390
Microsoft is another great giant in technology 67012
Bill Gates founded Microsoft 67012
Similarly, there are a total of 4000 rows containing these sentences and these rows are classified with 100 such codes
I have tried the below code but when I am predicting, it is predicting one same value for all. IN othr words y_pred is giving an array of same values.
May I know where is the code going wrong
import pandas as pd
import numpy as np
xl = pd.ExcelFile("dataSet.xlsx")
df = xl.parse('Sheet1')
#df = df.sample(frac=1).reset_index(drop=True)# shuffling the dataframe
df = df.sample(frac=1).reset_index(drop=True)# shuffling the dataframe
X = df.iloc[:, 0].values
Y = df.iloc[:, 1].values
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import pickle
count_vect = CountVectorizer()
X = count_vect.fit_transform(X)
tfidf_transformer = TfidfTransformer()
X = tfidf_transformer.fit_transform(X)
X = X.toarray()
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
y = Y.reshape(-1, 1) # Because Y has only one column
onehotencoder = OneHotEncoder(categories='auto')
Y = onehotencoder.fit_transform(y).toarray()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
inputDataLength = len(X_test[0])
outputDataLength = len(Y[0])
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.layers import Dropout
# fitting the model
embedding_vector_length = 100
model = Sequential()
model.add(Embedding(outputDataLength,embedding_vector_length, input_length=inputDataLength))
model.add(Dropout(0.2))
model.add(LSTM(outputDataLength))
model.add(Dense(outputDataLength, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=20)
y_pred = model.predict(X_test)
invorg = model.inverse_transform(y_test)
y_test = labelencoder_Y.inverse_transform(invorg)
inv = onehotencoder.inverse_transform(y_pred)
y_pred = labelencoder_Y.inverse_transform(inv)

You are using binary_crossentropy eventhough you have 100 classes. Which is not the right thing to do. You have to use categorical_crossentropy for this task.
Compile your model like this,
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Also, you are predicting with the model and converting to class labels like this,
y_pred = model.predict(X_test)
inv = onehotencoder.inverse_transform(y_pred)
y_pred = labelencoder_Y.inverse_transform(inv)
Since your model is activated with softmax inorder to get the class label, you have to find the argmax of the predictions.
For example, if the prediction was [0.2, 0.3, 0.0005, 0.99] you have to take argmax, which will give you output 3. The class that have high probability.
So you have to modify the prediction code like this,
y_pred = model.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)
y_pred = labelencoder_Y.inverse_transform(y_pred)
invorg = np.argmax(y_test, axis=1)
invorg = labelencoder_Y.inverse_transform(invorg)
Now you will have the actual class labels in invorg and predicted class labels at y_pred

Related

Is there any way to fit and apply deep learning algorithm on chemical smiles data in sequential model?

I have written a code for this where my input as an X
X : c1ccccc1 and Y value is water/methanol as classification category.
# multi-class classification with Keras
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
# load dataset
dataframe = pandas.read_csv("ADLV3_1.csv", header=None)
dataset = dataframe.values
X = dataset[:,2:3]
Y = dataset[:,3:4]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)
# define baseline model
def baseline_model():
# create model
model = Sequential()
model.add(Dense(8, input_dim=4, activation='relu'))
model.add(Dense(3, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=0)
kfold = KFold(n_splits=10, shuffle=True)
results = cross_val_score(estimator, X, dummy_y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
Code running successfully but getting warning as
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py:235: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py:268: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).
FitFailedWarning)
Baseline: nan% (nan%)
Is there any solution to make the algorithm workable? I can't predict any values

Unable to calculate Model performance for Decision Tree Regressor

Although my code run fine on repl and did giving me results but it miserably fails on the Katacoda testing environment.
I am attaching the repl file here for your review as well, which also contains the question which is commented just above the code I have written.
Kindly review and let me know what mistakes I am making here.
Repl Link
https://repl.it/repls/WarmRobustOolanguage
Also sharing code below
Commented is Question Instructions
#Import two modules sklearn.datasets, and #sklearn.model_selection.
#Import numpy and set random seed to 100.
#Load popular Boston dataset from sklearn.datasets module #and assign it to variable boston.
#Split boston.data into two sets names X_train and X_test. #Also, split boston.target into two sets Y_train and Y_test.
#Hint: Use train_test_split method from #sklearn.model_selection; set random_state to 30.
#Print the shape of X_train dataset.
#Print the shape of X_test dataset.
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
import numpy as np
np.random.seed(100)
max_depth = range(2, 6)
boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
#Import required module from sklearn.tree.
#Build a Decision tree Regressor model from X_train set and #Y_train labels, with default parameters. Name the model as #dt_reg.
#Evaluate the model accuracy on training data set and print #it's score.
#Evaluate the model accuracy on testing data set and print it's score.
#Predict the housing price for first two samples of X_test #set and print them.(Hint : Use predict() function)
dt_reg = DecisionTreeRegressor(random_state=1)
dt_reg = dt_reg.fit(X_train, Y_train)
print('Accuracy of Train Data :', cross_val_score(dt_reg, X_train,Y_train, cv=10 ))
print('Accuracy of Test Data :', cross_val_score(dt_reg, X_test,Y_test, cv=10 ))
predicted = dt_reg.predict(X_test[:2])
print(predicted)
#Fit multiple Decision tree regressors on X_train data and #Y_train labels with max_depth parameter value changing from #2 to 5.
#Evaluate each model accuracy on testing data set.
#Hint: Make use of for loop
#Print the max_depth value of the model with highest accuracy.
dt_reg = DecisionTreeRegressor()
random_grid = {'max_depth': max_depth}
dt_random = RandomizedSearchCV(estimator = dt_reg, param_distributions = random_grid,
n_iter = 90, cv = 3, verbose=2, random_state=42, n_jobs = -1)
dt_random.fit(X_train, Y_train)
dt_random.best_params_
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors / test_labels)
accuracy = 100 - mape
print('Model Performance')
print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
print('Accuracy = {:0.2f}%.'.format(accuracy))
return accuracy
best_random = dt_random.best_estimator_
random_accuracy = evaluate(best_random, X_test,Y_test)
print("Accuracy Scores of the Model ",random_accuracy)
best_parameters = (dt_random.best_params_['max_depth']);
print(best_parameters)
The question is asking for default values. Try to remove random_state=1
Current Line:
dt_reg = DecisionTreeRegressor(random_state=1)
Update Line:
dt_reg = DecisionTreeRegressor()
I think it should Work!!!
# ================================================================================
# Machine Learning Using Scikit-Learn | 3 | Decision Trees ================================================================================
import sklearn.datasets as datasets
import sklearn.model_selection as model_selection
import numpy as np
from sklearn.tree import DecisionTreeRegressor
np.random.seed(100)
# Load popular Boston dataset from sklearn.datasets module and assign it to variable boston.
boston = datasets.load_boston()
# print(boston)
# Split boston.data into two sets names X_train and X_test. Also, split boston.target into two sets Y_train and Y_test
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(boston.data, boston.target, random_state=30)
# Print the shape of X_train dataset
print(X_train.shape)
# Print the shape of X_test dataset.
print(X_test.shape)
# Build a Decision tree Regressor model from X_train set and Y_train labels, with default parameters. Name the model as dt_reg
dt_Regressor = DecisionTreeRegressor()
dt_reg = dt_Regressor.fit(X_train, Y_train)
print(dt_reg.score(X_train,Y_train))
print(dt_reg.score(X_test,Y_test))
predicted = dt_reg.predict(X_test[:2])
print(predicted)
# Get the max depth
maxdepth = 2
maxscore = 0
for x in range(2, 6):
dt_Regressor = DecisionTreeRegressor(max_depth=x)
dt_reg = dt_Regressor.fit(X_train, Y_train)
score = dt_reg.score(X_test, Y_test)
if(maxscore < score):
maxdepth = x
maxscore = score
print(maxdepth)

My keras neural network model is giving me accuracy 0.0000e+00

I am trying to code a simple neural network to predict total number of corona cases given a multiple of factors related to each country.
However when using the dataset I created, the accuracy in 0.0000e+00. Although I tried this code on a different dataset I downloaded online concerning house pricing and the accuracy went up to 60%.
Both datasets are around 200 rows.
Here is my code below.
import pandas as pd
df = pd.read_excel (r'Dataset2.xlsx', sheet_name='class 2')
df.head()
dataset = df.values
X = dataset[:,1:7]
Y = dataset[:,7]
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_scale = min_max_scaler.fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.1, random_state=4)
from keras.models import Sequential
from keras.layers import Dense
model = Sequential([ Dense(32, activation='relu', input_shape=(6,)), Dense(32, activation='relu'), Dense(1, activation='sigmoid'),])
model.compile(optimizer='sgd', loss='mse',metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=25, batch_size=32, verbose=1, validation_data=(X_test,y_test))
Also here is a screenshot of my dataset.
Accuracy is a metric for classification, while from your description your task is regression. Use different metrics suitable for regression task, such as MAE or MSE.

Keras Multiclass Classification (Dense model) - Confusion Matrix Incorrect

I have a labeled dataset. last column (78) contains 4 types of attack. following codes confusion matrix is correct for two types of attack. can any one help to modify the code for keras multiclass attack detection and correction for get correct confusion matrix? and for correct code for precision, FPR,TPR for multiclass. Thanks.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from keras.utils.np_utils import to_categorical
dataset_original = pd.read_csv('./XYZ.csv')
# Dron NaN value from Data Frame
dataset = dataset_original.dropna()
# data cleansing
X = dataset.iloc[:, 0:78]
print(X.info())
print(type(X))
y = dataset.iloc[:, 78] #78 is labeled column contains 4 anomaly type
print(y)
# encode the labels to 0, 1 respectively
print(y[100:110])
encoder = LabelEncoder()
y = encoder.fit_transform(y)
print([y[100:110]])
# Split the dataset now
XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size=0.2, random_state=0)
# feature scaling
scalar = StandardScaler()
XTrain = scalar.fit_transform(XTrain)
XTest = scalar.transform(XTest)
# modeling
model = Sequential()
model.add(Dense(units=16, kernel_initializer='uniform', activation='relu', input_dim=78))
model.add(Dense(units=8, kernel_initializer='uniform', activation='relu'))
model.add(Dense(units=6, kernel_initializer='uniform', activation='relu'))
model.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(XTrain, yTrain, batch_size=1000, epochs=10)
history = model.fit(XTrain, yTrain, batch_size=1000, epochs=10, verbose=1, validation_data=(XTest,
yTest))
yPred = model.predict(XTest)
yPred = [1 if y > 0.5 else 0 for y in yPred]
matrix = confusion_matrix(yTest, yPred)`enter code here`
print(matrix)
accuracy = (matrix[0][0] + matrix[1][1]) / (matrix[0][0] + matrix[0][1] + matrix[1][0] + matrix[1][1])
print("Accuracy: " + str(accuracy * 100) + "%")
If i understand correctly, you are trying to solve a multiclass classification problem where your target label belongs to 4 different attacks. Therefore, you should use the output Dense layer having 4 units instead of 1 with a 'softmax' activation function (not 'sigmoid' activation). Additionally, you should use 'categorical_crossentropy' loss in place of 'binary_crossentropy' while compiling your model.
Furthermore, with this setting, applying argmax on prediction result (that has 4 class probability values for each test sample) you will get the final label/class.
[Edit]
Your confusion matrix and high accuracy indicates that you are working with an imbalanced dataset. May be very high number of samples are from class 0 and few samples are from the remaining 3 classes. To handle this you may want to apply weighting samples or over-sampling/under-sampling approaches.

Training a Neural Network for Word Embedding

Attached is the link file for Entities. I want to train a Neural Network to represent each entity into a vector. Attach is my code for training
import pandas as pd
import numpy as np
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.models import Model
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Input
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
file_path = '/content/drive/My Drive/Colab Notebooks/Deep Learning/NLP/Data/entities.txt'
df = pd.read_csv(file_path, delimiter = '\t', engine='python', quoting = 3, header = None)
df.columns = ['Entity']
Entity = df['Entity']
X_train, X_test = train_test_split(Entity, test_size = 0.10)
print('Total Entities: {}'.format(len(Entity)))
print('Training Entities: {}'.format(len(X_train)))
print('Test Entities: {}'.format(len(X_test)))
vocab_size = len(Entity)
X_train_encode = [one_hot(d, vocab_size,lower=True, split=' ') for d in X_train]
X_test_encode = [one_hot(d, vocab_size,lower=True, split=' ') for d in X_test]
model = Sequential()
model.add(Embedding(input_length=1,input_dim=vocab_size, output_dim=100))
model.add(Flatten())
model.add(Dense(vocab_size, activation='softmax'))
model.compile(optimizer='adam', loss='mse', metrics=['acc'])
print(model.summary())
model.fit(X_train_encode, X_train_encode, epochs=20, batch_size=1000, verbose=1)
The following error encountered when I am trying to execute the code.
Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 array(s), but instead got the following list of 34826 arrays:
You are passing list of numpy arrays for model.fit. The following code produces list of arrays for x_train_encode and X_test_encode.
X_train_encode = [one_hot(d, vocab_size,lower=True, split=' ') for d in X_train]
X_test_encode = [one_hot(d, vocab_size,lower=True, split=' ') for d in X_test]
Change these lists into numpy array when passing to model.fit method.
X_train_encode = np.array(X_train_encode)
X_test_encode = np.array(X_test_encode)
And I don't see the need to one_hot encode the X_train and X_test, embedding layer expects integer(in your case word indexes) not one hot encoded value of the the words' indexes. So if X_train and X_test are array of the indexes of the words then you can directly feed this into the model.fit method.
EDIT:
Currently 'mse' loss is being used. Since the last layer is softmax layer cross entropy loss is more applicable here. And also the outputs are integer values of a class(words) sparse categorical should be used for loss.
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])

Resources