Keras LSTM batch input shape - keras

Hello I can not seem to figure out the relationship between the reshapping of X,Y with the batch input shape of Keras when dealing with a LSTM.
current database is a 84119,190 pandas dataframe i am bringing in. from there break out to X and Y. so features is 189. If you could point out where i am wrong as it relates to the (sequence, timestep, dimensions) it would be appreciated.
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import LSTM
# load dataset
training_data_df = pd.read_csv("C:/Users/####/python_folders/stock_folder/XYstore/Big_data22.csv")
X = training_data_df.drop('Change Month End Stock Price', axis=1).values
Y = training_data_df[['Change Month End Stock Price']].values
data_dim = 189
timesteps = 4
numberofSequence = 1
X=X.reshape(numberofSequence,timesteps,data_dim)
Y=Y.reshape(numberofSequence,timesteps, 1)
model = Sequential()
model.add(LSTM(200, return_sequences=True,batch_input_shape=(timesteps,data_dim)))
model.add(LSTM(200, return_sequences=True))
model.add(LSTM(100,return_sequences=True))
model.add(LSTM(1,return_sequences=False, activation='linear'))
model.compile(loss='mse',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(X,Y,epochs=100)
Edit to fix issue
thanks to the help below. Both helped me think through the problem. still have some work to do to really understand it.
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import LSTM
training_data_df = pd.read_csv("C:/Users/TurnerJ/python_folders/stock_folder/XYstore/Big_data22.csv")
training_data_df.replace(np.nan,value=0,inplace=True)
training_data_df.replace(np.inf,value=0,inplace=True)
training_data_df = training_data_df.loc[279:,:]
X = training_data_df.drop('Change Month End Stock Price', axis=1).values
Y = training_data_df[['Change Month End Stock Price']].values
data_dim = 189
timesteps = 1
numberofSequence = 83840
X=X.reshape(numberofSequence,timesteps,data_dim)
Y=Y.reshape(numberofSequence,timesteps, 1)
model = Sequential()
model.add(LSTM(200, return_sequences=True,batch_input_shape=(32,timesteps,data_dim)))
model.add(LSTM(200, return_sequences=True))
model.add(LSTM(100,return_sequences=True))
model.add(LSTM(1,return_sequences=True, activation='linear'))
model.compile(loss='mse',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(X,Y,epochs=100)

batch_input_shape needs the size of the batch: (numberofSequence,timesteps,data_dim)
input_shape needs only the shape of a sample: (timesteps,data_dim).
Now, there is a problem.... 84199 is not a multiple of 4, how can we expect to reshape it in steps of 4?
There may also be other issues, such as:
If you have one single long sequence, why divide it in steps?
If you intend to use a sliding window case, you need to prepare your data as this: Sample 1 = [step1,step2,step3,step4]; Sample 2 = [step2,step3,step4,step5]; and so on. This will imply in numberofSquence > 1 (something near the 90000)
If you intend to have a single sequence divided because of memory/performance issues, you should be using stateful=True and call model.reset_states() at the beginning of every epoch.

Related

Questions about Multitask deep neural network modeling using Keras

I'm trying to develop a multitask deep neural network (MTDNN) to make prediction on small molecule bioactivity against kinase targets and something is definitely wrong with my model structure but I can't figure out what.
For my training data (highly imbalanced data with 0 as inactive and 1 as active), I have 423 unique kinase targets (tasks) and over 400k unique compounds. I first calculate the ECFP fingerprint using smiles, and then I randomly split the input data into train, test, and valid sets based on 8:1:1 ratio using RandomStratifiedSplitter from deepchem package. After training my model using the train set and I want to make prediction on the test set to check model performance.
Here's what my data looks like (screenshot example):
(https://i.stack.imgur.com/8Hp36.png)
Here's my code:
# Import Packages
import numpy as np
import pandas as pd
import deepchem as dc
from sklearn.metrics import roc_auc_score, roc_curve, auc, confusion_matrix
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import initializers, regularizers
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Input, Dropout, Reshape
from tensorflow.keras.optimizers import SGD
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
# Build Model
inputs = keras.Input(shape = (1024, ))
x = keras.layers.Dense(2000, activation='relu', name="dense2000",
kernel_initializer=initializers.RandomNormal(stddev=0.02),
bias_initializer=initializers.Ones(),
kernel_regularizer=regularizers.L2(l2=.0001))(inputs)
x = keras.layers.Dropout(rate=0.25)(x)
x = keras.layers.Dense(500, activation='relu', name='dense500')(x)
x = keras.layers.Dropout(rate=0.25)(x)
x = keras.layers.Dense(846, activation='relu', name='output1')(x)
logits = Reshape([423, 2])(x)
outputs = keras.layers.Softmax(axis=2)(logits)
Model1 = keras.Model(inputs=inputs, outputs=outputs, name='MTDNN')
Model1.summary()
opt = keras.optimizers.SGD(learning_rate=.0003, momentum=0.9)
def loss_function (output, labels):
loss = tf.nn.softmax_cross_entropy_with_logits(output,labels)
return loss
loss_fn = loss_function
Model1.compile(loss=loss_fn, optimizer=opt,
metrics=[keras.metrics.Accuracy(),
keras.metrics.AUC(),
keras.metrics.Precision(),
keras.metrics.Recall()])
for train, test, valid in split2:
trainX = pd.DataFrame(train.X)
trainy = pd.DataFrame(train.y)
trainy2 = tf.one_hot(trainy,2)
testX = pd.DataFrame(test.X)
testy = pd.DataFrame(test.y)
testy2 = tf.one_hot(testy,2)
validX = pd.DataFrame(valid.X)
validy = pd.DataFrame(valid.y)
validy2 = tf.one_hot(validy,2)
history = Model1.fit(x=trainX, y=trainy2,
shuffle=True,
epochs=10,
verbose=1,
batch_size=100,
validation_data=(validX, validy2))
y_pred = Model1.predict(testX)
y_pred2 = y_pred[:, :, 1]
y_pred3 = np.round(y_pred2)
# Check the # of nonzero in assay
(y_pred3!=0).sum () #all 0s
My questions are:
The roc and precision recall are all extremely high (>0.99), but the prediction result of test set contains all 0s, no actives at all. I also use the randomized dataset with same active:inactive ratio for each task to test if those values are too good to be true, and turns out all values are still above 0.99, including roc which is expected to be 0.5.
Can anyone help me to identify what is wrong with my model and how should I fix it please?
Can I use built-in functions in sklearn to calculate roc/accuracy/precision-recall? Or should I manually calculate the metrics based on confusion matrix on my own for multitasking purpose. Why and why not?

Speed up the Keras sequential model in for loop

I am trying to decrease the execution time of the Keras sequential model that runs in a loop several times.
My training dataset shape: (1,9526,32736,1) (1,ntimes,ngrid,1)
and test data shape is (1,1059,32736,1)
The test data time dimension is not fixed (variable) but the ngrid is fixed.
I created a dummy dimension in the end so that when I call the training data in the for loop the dimension shape will be (1,ntimes,1)
This is the description of what model does:
First, the model does the convolution along the time axis for a single grid point.
Subtracts the output of the convolution from the input data.
Does the convolution (along the time axis) of the output from the second layer.
The above steps are repeated 32736 ngrid times.
Here is the code:
import tensorflow.keras as keras
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input,Conv1D,subtract
import tensorflow as tf
print(tf.__version__)
2.4.1
import tensorflow.keras as keras
print(keras.__version__)
2.4.0
no_epochs = 1000
validation_split = 0
verbosity = 0
pred = np.ones(xtest.shape[1:3])
for i in tqdm(range(ngrid)):
keras.backend.clear_session()
inputs = Input(shape=(None,1),batch_size=1,name='input_layer')
smoth1 = Conv1D(1, kernel_size=90,padding='same',activation='linear')(inputs)
diff = subtract([inputs, smoth1])
smoth2 = Conv1D(1, kernel_size=30,padding='same',activation='linear')(diff)
model = Model(inputs=inputs, outputs=smoth2)
model.compile(optimizer='adam', loss='mse')
model.fit(xtrain[:,:,i,:],ytrain[:,:,i,:],epochs=no_epochs,validation_split=validation_split,verbose=verbosity)
pred[:,i] = model.predict(xtest[:,:,i,:]).squeeze()
del model
I am looking for other alternatives that can speed up my code. Any suggestions are greatly appreciated.

Keras Multiclass Classification (Dense model) - Confusion Matrix Incorrect

I have a labeled dataset. last column (78) contains 4 types of attack. following codes confusion matrix is correct for two types of attack. can any one help to modify the code for keras multiclass attack detection and correction for get correct confusion matrix? and for correct code for precision, FPR,TPR for multiclass. Thanks.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from keras.utils.np_utils import to_categorical
dataset_original = pd.read_csv('./XYZ.csv')
# Dron NaN value from Data Frame
dataset = dataset_original.dropna()
# data cleansing
X = dataset.iloc[:, 0:78]
print(X.info())
print(type(X))
y = dataset.iloc[:, 78] #78 is labeled column contains 4 anomaly type
print(y)
# encode the labels to 0, 1 respectively
print(y[100:110])
encoder = LabelEncoder()
y = encoder.fit_transform(y)
print([y[100:110]])
# Split the dataset now
XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size=0.2, random_state=0)
# feature scaling
scalar = StandardScaler()
XTrain = scalar.fit_transform(XTrain)
XTest = scalar.transform(XTest)
# modeling
model = Sequential()
model.add(Dense(units=16, kernel_initializer='uniform', activation='relu', input_dim=78))
model.add(Dense(units=8, kernel_initializer='uniform', activation='relu'))
model.add(Dense(units=6, kernel_initializer='uniform', activation='relu'))
model.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(XTrain, yTrain, batch_size=1000, epochs=10)
history = model.fit(XTrain, yTrain, batch_size=1000, epochs=10, verbose=1, validation_data=(XTest,
yTest))
yPred = model.predict(XTest)
yPred = [1 if y > 0.5 else 0 for y in yPred]
matrix = confusion_matrix(yTest, yPred)`enter code here`
print(matrix)
accuracy = (matrix[0][0] + matrix[1][1]) / (matrix[0][0] + matrix[0][1] + matrix[1][0] + matrix[1][1])
print("Accuracy: " + str(accuracy * 100) + "%")
If i understand correctly, you are trying to solve a multiclass classification problem where your target label belongs to 4 different attacks. Therefore, you should use the output Dense layer having 4 units instead of 1 with a 'softmax' activation function (not 'sigmoid' activation). Additionally, you should use 'categorical_crossentropy' loss in place of 'binary_crossentropy' while compiling your model.
Furthermore, with this setting, applying argmax on prediction result (that has 4 class probability values for each test sample) you will get the final label/class.
[Edit]
Your confusion matrix and high accuracy indicates that you are working with an imbalanced dataset. May be very high number of samples are from class 0 and few samples are from the remaining 3 classes. To handle this you may want to apply weighting samples or over-sampling/under-sampling approaches.

How Kera LSTM inputs corresponds with output

I am trying to predict the stock price with help of investor sentiments and previous stock price.
head of data frame is as under:
time_p close sent_sum output
2007-01-03 10:00:00 10.837820 0.4 10.6838
2007-01-03 11:00:00 10.849175 0.6 10.8062
2007-01-03 12:00:00 10.823942 -0.3 10.7898
2007-01-03 13:00:00 10.810063 -0.2 10.7747
2007-01-03 14:00:00 10.680111 0.1 10.7078
How I preprocess Data?
Above df contains stock data where,time_p is hourly datetime(not included in model) that coresponds to houly closing price close, sent_sum is invostor sentiment and output is labels for model. output is shifted upword with df.output.shitf(-8) in other words I want to predict +1 hour into future based upon -7 hours close(price) plus -7hours sent_sum (investor sentimnets).
I am trying to fit a model like this:
import tensorflow as tf
from pandas_datareader import data
import urllib.request, json
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.recurrent import LSTM
from keras import optimizers
import math
import keras as k
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('AAPL_final.csv')
raw= data.iloc[:,[2,3]].values
raw2= data.iloc[:,[4]].values
#############scalling fo data######
scaler = MinMaxScaler(feature_range=(-1, 1))
scaler_y = MinMaxScaler(feature_range=(-1, 1))
scaled_x = scaler.fit_transform(raw)
scaled_y = scaler_y.fit_transform(raw2)
########tran test set##############
train= scaled_x[:14000].reshape(2000,7,2) # Train_X data
train_= scaled_y[:14000].reshape(2000,7,1) #train_Y
test_xdata= scaled_x[14000:17542].reshape(506,7,2)# Test_x
test_ydata= scaled_y[14000:17542].reshape(506,7,1)#Test_y
train_x,train_y= train, train_
test_x, test_y = test_xdata, test_ydata
print('shapes of tranx,teainy,testx and testy',train_x.shape, train_y.shape, test_x.shape, test_y.shape)
model = Sequential()
model.add(LSTM(100,input_shape=(7,2),return_sequences=True))
model.add(Dropout(0.1))
model.add(LSTM(100,return_sequences=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='sgd',metrics=['accuracy', 'mae', 'mape', 'cosine'])#sgd#rmsprop
My Questiton: I suspect that once I have alreay shifted the label data with -7 points round the way matched current inputs with +7 hours in future time period is it ok to write train_= scaled_y[:14000].reshape(2000,7,1) #train_Y in (2000,7,1) shape or I am doing something worge.
Secondly, I am confused with how keras_lstm matches input with labels, I mean how input_shape really works?
Is there any good way to fit this model?, please suggest.
I shall be grateful for the help.
You can do something like following on the scaled_x and scaled_y
I used toy dataset to show an example, here data and labels are of shape ((150, 4), (150,)) initially, using the following script:
seq_length = 10
dataX = []
dataY = []
for i in range(0, 150 - seq_length, 1):
dataX.append(data[i:i+seq_length])
dataY.append(labels[i+seq_length-1])
import numpy as np
dataX = np.reshape(dataX, (-1, seq_length, 4))
dataY = np.reshape(dataY, (-1, 1))
# dataX.shape, dataY.shape
Output: ((140, 10, 4), (140, 1))
Like this example you can create the sequence with the 7 days data, with target for next day.
Keras LSTM layer expects the input to be 3 dims as (batch_size, seq_length, input_dims) as this
input_dims = # an integer
seq_length = #an integer
model = Sequential()
model.add(LSTM(128, activation='relu', input_shape=(seq_length, input_dims), return_sequences=True))
Note: batch_size is not used on defining the layer, model will fill itself while fit.

Keras - Text Classification - LSTM - How to input text?

Im trying to understand how to use LSTM to classify a certain dataset that i have.
I researched and found this example of keras and imdb :
https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py
However, im confused about how the data set must be processed to input.
I know keras has pre-processing text methods, but im not sure which to use.
The x contain n lines with texts and the y classify the text by happiness/sadness. Basically, 1.0 means 100% happy and 0.0 means totally sad. the numbers may vary, for example 0.25~~ and so on.
So my question is, How i input x and y properly? Do i have to use bag of words?
Any tip is appreciated!
I coded this below but i keep getting the same error:
#('Bad input argument to theano function with name ... at index 1(0-based)',
'could not convert string to float: negative')
import keras.preprocessing.text
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
print('Loading data...')
import pandas
thedata = pandas.read_csv("dataset/text.csv", sep=', ', delimiter=',', header='infer', names=None)
x = thedata['text']
y = thedata['sentiment']
x = x.iloc[:].values
y = y.iloc[:].values
###################################
tk = keras.preprocessing.text.Tokenizer(nb_words=2000, filters=keras.preprocessing.text.base_filter(), lower=True, split=" ")
tk.fit_on_texts(x)
x = tk.texts_to_sequences(x)
###################################
max_len = 80
print "max_len ", max_len
print('Pad sequences (samples x time)')
x = sequence.pad_sequences(x, maxlen=max_len)
#########################
max_features = 20000
model = Sequential()
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128, input_length=max_len, dropout=0.2))
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(x, y=y, batch_size=200, nb_epoch=1, verbose=1, validation_split=0.2, show_accuracy=True, shuffle=True)
# at index 1(0-based)', 'could not convert string to float: negative')
Review how you are using your CSV parser to read the text in. Ensure that the fields are in the format Text, Sentiment if you want to to make use of the parser as you've written it in your code.

Resources