How Kera LSTM inputs corresponds with output - python-3.x

I am trying to predict the stock price with help of investor sentiments and previous stock price.
head of data frame is as under:
time_p close sent_sum output
2007-01-03 10:00:00 10.837820 0.4 10.6838
2007-01-03 11:00:00 10.849175 0.6 10.8062
2007-01-03 12:00:00 10.823942 -0.3 10.7898
2007-01-03 13:00:00 10.810063 -0.2 10.7747
2007-01-03 14:00:00 10.680111 0.1 10.7078
How I preprocess Data?
Above df contains stock data where,time_p is hourly datetime(not included in model) that coresponds to houly closing price close, sent_sum is invostor sentiment and output is labels for model. output is shifted upword with df.output.shitf(-8) in other words I want to predict +1 hour into future based upon -7 hours close(price) plus -7hours sent_sum (investor sentimnets).
I am trying to fit a model like this:
import tensorflow as tf
from pandas_datareader import data
import urllib.request, json
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.recurrent import LSTM
from keras import optimizers
import math
import keras as k
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('AAPL_final.csv')
raw= data.iloc[:,[2,3]].values
raw2= data.iloc[:,[4]].values
#############scalling fo data######
scaler = MinMaxScaler(feature_range=(-1, 1))
scaler_y = MinMaxScaler(feature_range=(-1, 1))
scaled_x = scaler.fit_transform(raw)
scaled_y = scaler_y.fit_transform(raw2)
########tran test set##############
train= scaled_x[:14000].reshape(2000,7,2) # Train_X data
train_= scaled_y[:14000].reshape(2000,7,1) #train_Y
test_xdata= scaled_x[14000:17542].reshape(506,7,2)# Test_x
test_ydata= scaled_y[14000:17542].reshape(506,7,1)#Test_y
train_x,train_y= train, train_
test_x, test_y = test_xdata, test_ydata
print('shapes of tranx,teainy,testx and testy',train_x.shape, train_y.shape, test_x.shape, test_y.shape)
model = Sequential()
model.add(LSTM(100,input_shape=(7,2),return_sequences=True))
model.add(Dropout(0.1))
model.add(LSTM(100,return_sequences=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='sgd',metrics=['accuracy', 'mae', 'mape', 'cosine'])#sgd#rmsprop
My Questiton: I suspect that once I have alreay shifted the label data with -7 points round the way matched current inputs with +7 hours in future time period is it ok to write train_= scaled_y[:14000].reshape(2000,7,1) #train_Y in (2000,7,1) shape or I am doing something worge.
Secondly, I am confused with how keras_lstm matches input with labels, I mean how input_shape really works?
Is there any good way to fit this model?, please suggest.
I shall be grateful for the help.

You can do something like following on the scaled_x and scaled_y
I used toy dataset to show an example, here data and labels are of shape ((150, 4), (150,)) initially, using the following script:
seq_length = 10
dataX = []
dataY = []
for i in range(0, 150 - seq_length, 1):
dataX.append(data[i:i+seq_length])
dataY.append(labels[i+seq_length-1])
import numpy as np
dataX = np.reshape(dataX, (-1, seq_length, 4))
dataY = np.reshape(dataY, (-1, 1))
# dataX.shape, dataY.shape
Output: ((140, 10, 4), (140, 1))
Like this example you can create the sequence with the 7 days data, with target for next day.
Keras LSTM layer expects the input to be 3 dims as (batch_size, seq_length, input_dims) as this
input_dims = # an integer
seq_length = #an integer
model = Sequential()
model.add(LSTM(128, activation='relu', input_shape=(seq_length, input_dims), return_sequences=True))
Note: batch_size is not used on defining the layer, model will fill itself while fit.

Related

Questions about Multitask deep neural network modeling using Keras

I'm trying to develop a multitask deep neural network (MTDNN) to make prediction on small molecule bioactivity against kinase targets and something is definitely wrong with my model structure but I can't figure out what.
For my training data (highly imbalanced data with 0 as inactive and 1 as active), I have 423 unique kinase targets (tasks) and over 400k unique compounds. I first calculate the ECFP fingerprint using smiles, and then I randomly split the input data into train, test, and valid sets based on 8:1:1 ratio using RandomStratifiedSplitter from deepchem package. After training my model using the train set and I want to make prediction on the test set to check model performance.
Here's what my data looks like (screenshot example):
(https://i.stack.imgur.com/8Hp36.png)
Here's my code:
# Import Packages
import numpy as np
import pandas as pd
import deepchem as dc
from sklearn.metrics import roc_auc_score, roc_curve, auc, confusion_matrix
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import initializers, regularizers
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Input, Dropout, Reshape
from tensorflow.keras.optimizers import SGD
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
# Build Model
inputs = keras.Input(shape = (1024, ))
x = keras.layers.Dense(2000, activation='relu', name="dense2000",
kernel_initializer=initializers.RandomNormal(stddev=0.02),
bias_initializer=initializers.Ones(),
kernel_regularizer=regularizers.L2(l2=.0001))(inputs)
x = keras.layers.Dropout(rate=0.25)(x)
x = keras.layers.Dense(500, activation='relu', name='dense500')(x)
x = keras.layers.Dropout(rate=0.25)(x)
x = keras.layers.Dense(846, activation='relu', name='output1')(x)
logits = Reshape([423, 2])(x)
outputs = keras.layers.Softmax(axis=2)(logits)
Model1 = keras.Model(inputs=inputs, outputs=outputs, name='MTDNN')
Model1.summary()
opt = keras.optimizers.SGD(learning_rate=.0003, momentum=0.9)
def loss_function (output, labels):
loss = tf.nn.softmax_cross_entropy_with_logits(output,labels)
return loss
loss_fn = loss_function
Model1.compile(loss=loss_fn, optimizer=opt,
metrics=[keras.metrics.Accuracy(),
keras.metrics.AUC(),
keras.metrics.Precision(),
keras.metrics.Recall()])
for train, test, valid in split2:
trainX = pd.DataFrame(train.X)
trainy = pd.DataFrame(train.y)
trainy2 = tf.one_hot(trainy,2)
testX = pd.DataFrame(test.X)
testy = pd.DataFrame(test.y)
testy2 = tf.one_hot(testy,2)
validX = pd.DataFrame(valid.X)
validy = pd.DataFrame(valid.y)
validy2 = tf.one_hot(validy,2)
history = Model1.fit(x=trainX, y=trainy2,
shuffle=True,
epochs=10,
verbose=1,
batch_size=100,
validation_data=(validX, validy2))
y_pred = Model1.predict(testX)
y_pred2 = y_pred[:, :, 1]
y_pred3 = np.round(y_pred2)
# Check the # of nonzero in assay
(y_pred3!=0).sum () #all 0s
My questions are:
The roc and precision recall are all extremely high (>0.99), but the prediction result of test set contains all 0s, no actives at all. I also use the randomized dataset with same active:inactive ratio for each task to test if those values are too good to be true, and turns out all values are still above 0.99, including roc which is expected to be 0.5.
Can anyone help me to identify what is wrong with my model and how should I fix it please?
Can I use built-in functions in sklearn to calculate roc/accuracy/precision-recall? Or should I manually calculate the metrics based on confusion matrix on my own for multitasking purpose. Why and why not?

Value Error problem from using kernel regularizer

I got a ValueError when using TensorFlow to create a model. Based on the error there is a problem that occurs with the kernel regularizer applied on the Conv2D layer and the mean squared error function. I used the L1 regularizer provided by the TensorFlow keras package. I've tried setting different values for the L1 regularization factor and even setting the value to 0, but I get the same error.
Context: Creating a model that predicts phenotype traits given genotypes and phenotypes datasets. The genotype input data has 4276 samples, and the input shape that the model takes is (28220,1). My labels represent the phenotype data. The labels include 4276 samples with 20 as the number of phenotype traits in the dataset. In this model we use differential privacy(DP) and add it to a CNN model which uses the Mean squared error loss function and the DPKerasAdamOptimizer to add DP. I'm just wondering if MSE would be a good choice as a loss function?
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
!pip install tensorflow-privacy
import numpy as np
import tensorflow as tf
from tensorflow_privacy import *
import tensorflow_privacy
from matplotlib import pyplot as plt
import pylab as pl
import numpy as np
import pandas as pd
from tensorflow.keras.models import Model
from tensorflow.keras import datasets, layers, models, losses
from tensorflow.keras import backend as bke
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l1, l2, l1_l2 #meaning of norm
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
batch_size = 32
epochs = 4
microbatches = 8
inChannel = 1
kr = 0#1e-5
num_kernels=8
drop_perc=0.25
dim = 1
l2_norm_clip = 1.5
noise_multiplier = 1.3
learning_rate = 0.25
latent_dim = 0
def print_datashape():
print('genotype data: ', genotype_data.shape)
print('phenotype data: ', single_pheno.shape)
genotype_data = tf.random.uniform([4276, 28220],1,3, dtype=tf.dtypes.int32)
phenotype_data = tf.random.uniform([4276, 20],-4.359688,34,dtype=tf.dtypes.float32)
genotype_data = genotype_data.numpy()
phenotype_data = phenotype_data.numpy()
small_geno = genotype_data
single_pheno = phenotype_data[:, 1]
print_datashape()
df = small_geno
min_max_scaler = preprocessing.MinMaxScaler()
df = min_max_scaler.fit_transform(df)
scaled_pheno = min_max_scaler.fit_transform(single_pheno.reshape(-1,1)).reshape(-1)
feature_size= df.shape[1]
df = df.reshape(-1, feature_size, 1, 1)
print("df: ", df.shape)
print("scaled: ", scaled_pheno.shape)
# split train to train and valid
train_data,test_data,train_Y,test_Y = train_test_split(df, scaled_pheno, test_size=0.2, random_state=13)
train_X,valid_X,train_Y,valid_Y = train_test_split(train_data, train_Y, test_size=0.2, random_state=13)
def print_shapes():
print('train_X: {}'.format(train_X.shape))
print('train_Y: {}'.format(train_Y.shape))
print('valid_X: {}'.format(valid_X.shape))
print('valid_Y: {}'.format(valid_Y.shape))
input_shape= (feature_size, dim, inChannel)
predictor = tf.keras.Sequential()
predictor.add(layers.Conv2D(num_kernels, (5,1), padding='same', strides=(12, 1), activation='relu', kernel_regularizer=tf.keras.regularizers.L1(kr),input_shape= input_shape))
predictor.add(layers.AveragePooling2D(pool_size=(2,1)))
predictor.add(layers.Dropout(drop_perc))
predictor.add(layers.Flatten())
predictor.add(layers.Dense(int(feature_size / 4), activation='relu'))
predictor.add(layers.Dropout(drop_perc))
predictor.add(layers.Dense(int(feature_size / 10), activation='relu'))
predictor.add(layers.Dropout(drop_perc))
predictor.add(layers.Dense(1))
mse = tf.keras.losses.MeanSquaredError(reduction=tf.keras.losses.Reduction.NONE)
optimizer = DPKerasAdamOptimizer(learning_rate=learning_rate, l2_norm_clip=l2_norm_clip, noise_multiplier=noise_multiplier, num_microbatches=microbatches)
# compile
predictor.compile(loss=mse, optimizer=optimizer, metrics=['mse'])
#summary
predictor.summary()
print_shapes()
predictor.fit(train_X, train_Y,batch_size=batch_size,epochs=epochs,verbose=1, validation_data=(valid_X, valid_Y))
ValueError: Shapes must be equal rank, but are 1 and 0
From merging shape 0 with other shapes. for '{{node AddN}} = AddN[N=2, T=DT_FLOAT](mean_squared_error/weighted_loss/Mul, conv2d_2/kernel/Regularizer/mul)' with input shapes: [?], [].

Keras Multiclass Classification (Dense model) - Confusion Matrix Incorrect

I have a labeled dataset. last column (78) contains 4 types of attack. following codes confusion matrix is correct for two types of attack. can any one help to modify the code for keras multiclass attack detection and correction for get correct confusion matrix? and for correct code for precision, FPR,TPR for multiclass. Thanks.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from keras.utils.np_utils import to_categorical
dataset_original = pd.read_csv('./XYZ.csv')
# Dron NaN value from Data Frame
dataset = dataset_original.dropna()
# data cleansing
X = dataset.iloc[:, 0:78]
print(X.info())
print(type(X))
y = dataset.iloc[:, 78] #78 is labeled column contains 4 anomaly type
print(y)
# encode the labels to 0, 1 respectively
print(y[100:110])
encoder = LabelEncoder()
y = encoder.fit_transform(y)
print([y[100:110]])
# Split the dataset now
XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size=0.2, random_state=0)
# feature scaling
scalar = StandardScaler()
XTrain = scalar.fit_transform(XTrain)
XTest = scalar.transform(XTest)
# modeling
model = Sequential()
model.add(Dense(units=16, kernel_initializer='uniform', activation='relu', input_dim=78))
model.add(Dense(units=8, kernel_initializer='uniform', activation='relu'))
model.add(Dense(units=6, kernel_initializer='uniform', activation='relu'))
model.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(XTrain, yTrain, batch_size=1000, epochs=10)
history = model.fit(XTrain, yTrain, batch_size=1000, epochs=10, verbose=1, validation_data=(XTest,
yTest))
yPred = model.predict(XTest)
yPred = [1 if y > 0.5 else 0 for y in yPred]
matrix = confusion_matrix(yTest, yPred)`enter code here`
print(matrix)
accuracy = (matrix[0][0] + matrix[1][1]) / (matrix[0][0] + matrix[0][1] + matrix[1][0] + matrix[1][1])
print("Accuracy: " + str(accuracy * 100) + "%")
If i understand correctly, you are trying to solve a multiclass classification problem where your target label belongs to 4 different attacks. Therefore, you should use the output Dense layer having 4 units instead of 1 with a 'softmax' activation function (not 'sigmoid' activation). Additionally, you should use 'categorical_crossentropy' loss in place of 'binary_crossentropy' while compiling your model.
Furthermore, with this setting, applying argmax on prediction result (that has 4 class probability values for each test sample) you will get the final label/class.
[Edit]
Your confusion matrix and high accuracy indicates that you are working with an imbalanced dataset. May be very high number of samples are from class 0 and few samples are from the remaining 3 classes. To handle this you may want to apply weighting samples or over-sampling/under-sampling approaches.

Keras LSTM batch input shape

Hello I can not seem to figure out the relationship between the reshapping of X,Y with the batch input shape of Keras when dealing with a LSTM.
current database is a 84119,190 pandas dataframe i am bringing in. from there break out to X and Y. so features is 189. If you could point out where i am wrong as it relates to the (sequence, timestep, dimensions) it would be appreciated.
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import LSTM
# load dataset
training_data_df = pd.read_csv("C:/Users/####/python_folders/stock_folder/XYstore/Big_data22.csv")
X = training_data_df.drop('Change Month End Stock Price', axis=1).values
Y = training_data_df[['Change Month End Stock Price']].values
data_dim = 189
timesteps = 4
numberofSequence = 1
X=X.reshape(numberofSequence,timesteps,data_dim)
Y=Y.reshape(numberofSequence,timesteps, 1)
model = Sequential()
model.add(LSTM(200, return_sequences=True,batch_input_shape=(timesteps,data_dim)))
model.add(LSTM(200, return_sequences=True))
model.add(LSTM(100,return_sequences=True))
model.add(LSTM(1,return_sequences=False, activation='linear'))
model.compile(loss='mse',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(X,Y,epochs=100)
Edit to fix issue
thanks to the help below. Both helped me think through the problem. still have some work to do to really understand it.
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import LSTM
training_data_df = pd.read_csv("C:/Users/TurnerJ/python_folders/stock_folder/XYstore/Big_data22.csv")
training_data_df.replace(np.nan,value=0,inplace=True)
training_data_df.replace(np.inf,value=0,inplace=True)
training_data_df = training_data_df.loc[279:,:]
X = training_data_df.drop('Change Month End Stock Price', axis=1).values
Y = training_data_df[['Change Month End Stock Price']].values
data_dim = 189
timesteps = 1
numberofSequence = 83840
X=X.reshape(numberofSequence,timesteps,data_dim)
Y=Y.reshape(numberofSequence,timesteps, 1)
model = Sequential()
model.add(LSTM(200, return_sequences=True,batch_input_shape=(32,timesteps,data_dim)))
model.add(LSTM(200, return_sequences=True))
model.add(LSTM(100,return_sequences=True))
model.add(LSTM(1,return_sequences=True, activation='linear'))
model.compile(loss='mse',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(X,Y,epochs=100)
batch_input_shape needs the size of the batch: (numberofSequence,timesteps,data_dim)
input_shape needs only the shape of a sample: (timesteps,data_dim).
Now, there is a problem.... 84199 is not a multiple of 4, how can we expect to reshape it in steps of 4?
There may also be other issues, such as:
If you have one single long sequence, why divide it in steps?
If you intend to use a sliding window case, you need to prepare your data as this: Sample 1 = [step1,step2,step3,step4]; Sample 2 = [step2,step3,step4,step5]; and so on. This will imply in numberofSquence > 1 (something near the 90000)
If you intend to have a single sequence divided because of memory/performance issues, you should be using stateful=True and call model.reset_states() at the beginning of every epoch.

Running LSTM on keras.io for multivariable time-series data (LONG)

I have a data-set of quarterly Economic variables and I want to try and predict some binary variable (government of the day) by utilising an LSTM framework.
The data-set would look something like:
Date X1 X2 X3 X4 X5 Y1
------------------------------------------
01-Jan-1954 0.1 2.4 5.5 0.66 0.2 1
01-April-1954 0.3 2.0 5.0 0.67 0.4 1
.
.
.
01-July-2016 0.4 5.0 1.0 0.88 0.4 0
that is: I want to use X1-X5 in periods 0,...,t to be able to predict Y1 in period t+1.
I have got thus far using the sequential model...
import pandas as pd
from random import random
import numpy as np
from keras.models import Sequential
from keras.layers import Input, Embedding, LSTM, Dense, merge
import scipy
from sklearn.preprocessing import MinMaxScaler
from keras.layers import Input, Embedding, LSTM, Dense, merge
Macro=pd.read_csv('MACROUS2.csv')
Macro=Macro.ix[:,Macro.columns!= 'DATE']
'''
The model architecture is made to pull in
'''
def _load_data(Macro, n_prev = 16):
"""
data should be pd.DataFrame()
"""
docX, docY = [], []
for i in range(len(Macro)-n_prev):
docX.append(Macro.iloc[i:i+n_prev].as_matrix())
docY.append(Macro.iloc[i+n_prev].as_matrix())
alsX = np.array(docX)
alsY = np.array(docY)
return alsX, alsY
def train_test_split(df, test_size=0.1):
"""
This just splits data to training and testing parts
"""
ntrn = round(len(df) * (1 - test_size))
X_train, y_train = _load_data(df.iloc[0:ntrn])
X_test, y_test = _load_data(df.iloc[ntrn:])
return (X_train, y_train[:][0]), (X_test, y_test[:][0])
(X_train, y_train), (X_test, y_test) = train_test_split(Macro)
my data has the shape:
X_train has shape: (209, 16, 6)
y_train has shape: (6,)
X_test has shape: (9, 16, 6)
y_test has shape: (6,)
I had it this shape to reflect a single 'time piece' being 16 quarters (4 years).
I define the architecture in the following way:
in_neurons = np.shape(X_train)[2]
out_neurons = np.shape(y_train)[0]
hidden_neurons = 20
model = Sequential()
model.add(LSTM(output_dim=hidden_neurons, input_dim=in_neurons, return_sequences=False))
model.add(Dense(output_dim=out_neurons, input_dim=hidden_neurons))
model.add(Activation("linear"))
model.compile(loss="mse", optimizer="adam")
However, upon trying to train the model I encounter some problems:
(X_train, y_train), (X_test, y_test) = train_test_split(Macro) # retrieve data
model.fit(X_train, y_train, batch_size=450, nb_epoch=1, validation_split=0.05)
predicted = model.predict(X_test)
rmse = np.sqrt(((predicted - y_test) ** 2).mean(axis=0))
I get the following exception:
Error when checking model target: expected activation_1 to have shape (None, 6) but got array with shape (6, 1)
I am quite new to making RNN's so excuse me if I am ignorant in my questions or how I attempted this problem. Any advice on how to continue and to resolve this Exception would be greatly appreciated!
credit to http://danielhnyk.cz/predicting-sequences-vectors-keras-using-rnn-lstm/ as a major source for how I tried to do this.

Resources