Keras - Text Classification - LSTM - How to input text? - theano

Im trying to understand how to use LSTM to classify a certain dataset that i have.
I researched and found this example of keras and imdb :
https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py
However, im confused about how the data set must be processed to input.
I know keras has pre-processing text methods, but im not sure which to use.
The x contain n lines with texts and the y classify the text by happiness/sadness. Basically, 1.0 means 100% happy and 0.0 means totally sad. the numbers may vary, for example 0.25~~ and so on.
So my question is, How i input x and y properly? Do i have to use bag of words?
Any tip is appreciated!
I coded this below but i keep getting the same error:
#('Bad input argument to theano function with name ... at index 1(0-based)',
'could not convert string to float: negative')
import keras.preprocessing.text
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
print('Loading data...')
import pandas
thedata = pandas.read_csv("dataset/text.csv", sep=', ', delimiter=',', header='infer', names=None)
x = thedata['text']
y = thedata['sentiment']
x = x.iloc[:].values
y = y.iloc[:].values
###################################
tk = keras.preprocessing.text.Tokenizer(nb_words=2000, filters=keras.preprocessing.text.base_filter(), lower=True, split=" ")
tk.fit_on_texts(x)
x = tk.texts_to_sequences(x)
###################################
max_len = 80
print "max_len ", max_len
print('Pad sequences (samples x time)')
x = sequence.pad_sequences(x, maxlen=max_len)
#########################
max_features = 20000
model = Sequential()
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128, input_length=max_len, dropout=0.2))
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(x, y=y, batch_size=200, nb_epoch=1, verbose=1, validation_split=0.2, show_accuracy=True, shuffle=True)
# at index 1(0-based)', 'could not convert string to float: negative')

Review how you are using your CSV parser to read the text in. Ensure that the fields are in the format Text, Sentiment if you want to to make use of the parser as you've written it in your code.

Related

Questions about Multitask deep neural network modeling using Keras

I'm trying to develop a multitask deep neural network (MTDNN) to make prediction on small molecule bioactivity against kinase targets and something is definitely wrong with my model structure but I can't figure out what.
For my training data (highly imbalanced data with 0 as inactive and 1 as active), I have 423 unique kinase targets (tasks) and over 400k unique compounds. I first calculate the ECFP fingerprint using smiles, and then I randomly split the input data into train, test, and valid sets based on 8:1:1 ratio using RandomStratifiedSplitter from deepchem package. After training my model using the train set and I want to make prediction on the test set to check model performance.
Here's what my data looks like (screenshot example):
(https://i.stack.imgur.com/8Hp36.png)
Here's my code:
# Import Packages
import numpy as np
import pandas as pd
import deepchem as dc
from sklearn.metrics import roc_auc_score, roc_curve, auc, confusion_matrix
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import initializers, regularizers
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Input, Dropout, Reshape
from tensorflow.keras.optimizers import SGD
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
# Build Model
inputs = keras.Input(shape = (1024, ))
x = keras.layers.Dense(2000, activation='relu', name="dense2000",
kernel_initializer=initializers.RandomNormal(stddev=0.02),
bias_initializer=initializers.Ones(),
kernel_regularizer=regularizers.L2(l2=.0001))(inputs)
x = keras.layers.Dropout(rate=0.25)(x)
x = keras.layers.Dense(500, activation='relu', name='dense500')(x)
x = keras.layers.Dropout(rate=0.25)(x)
x = keras.layers.Dense(846, activation='relu', name='output1')(x)
logits = Reshape([423, 2])(x)
outputs = keras.layers.Softmax(axis=2)(logits)
Model1 = keras.Model(inputs=inputs, outputs=outputs, name='MTDNN')
Model1.summary()
opt = keras.optimizers.SGD(learning_rate=.0003, momentum=0.9)
def loss_function (output, labels):
loss = tf.nn.softmax_cross_entropy_with_logits(output,labels)
return loss
loss_fn = loss_function
Model1.compile(loss=loss_fn, optimizer=opt,
metrics=[keras.metrics.Accuracy(),
keras.metrics.AUC(),
keras.metrics.Precision(),
keras.metrics.Recall()])
for train, test, valid in split2:
trainX = pd.DataFrame(train.X)
trainy = pd.DataFrame(train.y)
trainy2 = tf.one_hot(trainy,2)
testX = pd.DataFrame(test.X)
testy = pd.DataFrame(test.y)
testy2 = tf.one_hot(testy,2)
validX = pd.DataFrame(valid.X)
validy = pd.DataFrame(valid.y)
validy2 = tf.one_hot(validy,2)
history = Model1.fit(x=trainX, y=trainy2,
shuffle=True,
epochs=10,
verbose=1,
batch_size=100,
validation_data=(validX, validy2))
y_pred = Model1.predict(testX)
y_pred2 = y_pred[:, :, 1]
y_pred3 = np.round(y_pred2)
# Check the # of nonzero in assay
(y_pred3!=0).sum () #all 0s
My questions are:
The roc and precision recall are all extremely high (>0.99), but the prediction result of test set contains all 0s, no actives at all. I also use the randomized dataset with same active:inactive ratio for each task to test if those values are too good to be true, and turns out all values are still above 0.99, including roc which is expected to be 0.5.
Can anyone help me to identify what is wrong with my model and how should I fix it please?
Can I use built-in functions in sklearn to calculate roc/accuracy/precision-recall? Or should I manually calculate the metrics based on confusion matrix on my own for multitasking purpose. Why and why not?

Is there any way to fit and apply deep learning algorithm on chemical smiles data in sequential model?

I have written a code for this where my input as an X
X : c1ccccc1 and Y value is water/methanol as classification category.
# multi-class classification with Keras
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
# load dataset
dataframe = pandas.read_csv("ADLV3_1.csv", header=None)
dataset = dataframe.values
X = dataset[:,2:3]
Y = dataset[:,3:4]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)
# define baseline model
def baseline_model():
# create model
model = Sequential()
model.add(Dense(8, input_dim=4, activation='relu'))
model.add(Dense(3, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=0)
kfold = KFold(n_splits=10, shuffle=True)
results = cross_val_score(estimator, X, dummy_y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
Code running successfully but getting warning as
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py:235: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py:268: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).
FitFailedWarning)
Baseline: nan% (nan%)
Is there any solution to make the algorithm workable? I can't predict any values

Keras Embedding Layer not accepting input

I am trying to create a forecasting model which takes the lag features along with embeddings to predict next 10 days cumulative value. Embedding layer is trained by using the order basket with gensim.
Below is my network:
from keras.layers import Embedding, Flatten, Input, Dense, Dropout, Flatten, Activation
inp = Input(shape=(1, )) #ucode length will be 1
x = Embedding(len(model.wv.vocab), WV_DIM,
weights=[model.wv.vectors],
trainable=False)(inp)
x = Flatten()(x)
x = Dense(32, activation='relu', name='Embedding_out')(x)
features_input = Input(shape=(122,)) ##lag Features
concat = concatenate([features_input, x],name="ConcatenatedwFeatures")
output = Dense(256, activation="relu",name="L1_Relu")(concat)
output = Dense(128, activation="relu",name="L2_Relu")(output)
output = Dense(1)(output)
EmbeddingModel = Model(inputs=[inp,features_input], outputs=output)
EmbeddingModel.summary()
adam = optimizers.adam(clipvalue=1.,lr=3e-4)
EmbeddingModel.compile(loss='mse',
optimizer=adam,
metrics = ['mae', 'mse'])
hist = EmbeddingModel.fit([ucode_array[20:25],X_train[20:25]], [y_train[20:25]], validation_split=0.05,
epochs=10, batch_size=32)
Error:
ValueError: could not convert string to float: 'I33946'
Input Values:
ucode_array=sales_train_grid['ucode']
ucode_array[20:25]
15683 I33946
15685 I33946
15687 I33946
15688 126310
15689 126310
Name: ucode, dtype: object
Testing if value is present in embedding layer:
test1=model.wv.most_similar(positive=['I00731'], topn=10)
display(test1)
[x[0] for x in test1]
Returns 10 similar objects. Returns none if i had pasted any random values.
Following things tried:
1. ucode_array[20:25].values
2. ucode_array[20:25].values.tolist()
gensim version: 3.4.0
TensorFlow version: 1.12.0
Usually, we have to feed numerical values to the training process.
Making sure converting all the object and strings to embeddings will solve the problem.
putting down basic preprocessing operation which before we actually, for others to find this helpful.
sample code.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(list(model.wv.vocab.keys())
encoded_ucode = tokenizer.texts_to_sequences(ucode_array)
Check this,
use float to convert a string into a float. I guess this will solve your problem.

Training a Neural Network for Word Embedding

Attached is the link file for Entities. I want to train a Neural Network to represent each entity into a vector. Attach is my code for training
import pandas as pd
import numpy as np
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.models import Model
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Input
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
file_path = '/content/drive/My Drive/Colab Notebooks/Deep Learning/NLP/Data/entities.txt'
df = pd.read_csv(file_path, delimiter = '\t', engine='python', quoting = 3, header = None)
df.columns = ['Entity']
Entity = df['Entity']
X_train, X_test = train_test_split(Entity, test_size = 0.10)
print('Total Entities: {}'.format(len(Entity)))
print('Training Entities: {}'.format(len(X_train)))
print('Test Entities: {}'.format(len(X_test)))
vocab_size = len(Entity)
X_train_encode = [one_hot(d, vocab_size,lower=True, split=' ') for d in X_train]
X_test_encode = [one_hot(d, vocab_size,lower=True, split=' ') for d in X_test]
model = Sequential()
model.add(Embedding(input_length=1,input_dim=vocab_size, output_dim=100))
model.add(Flatten())
model.add(Dense(vocab_size, activation='softmax'))
model.compile(optimizer='adam', loss='mse', metrics=['acc'])
print(model.summary())
model.fit(X_train_encode, X_train_encode, epochs=20, batch_size=1000, verbose=1)
The following error encountered when I am trying to execute the code.
Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 array(s), but instead got the following list of 34826 arrays:
You are passing list of numpy arrays for model.fit. The following code produces list of arrays for x_train_encode and X_test_encode.
X_train_encode = [one_hot(d, vocab_size,lower=True, split=' ') for d in X_train]
X_test_encode = [one_hot(d, vocab_size,lower=True, split=' ') for d in X_test]
Change these lists into numpy array when passing to model.fit method.
X_train_encode = np.array(X_train_encode)
X_test_encode = np.array(X_test_encode)
And I don't see the need to one_hot encode the X_train and X_test, embedding layer expects integer(in your case word indexes) not one hot encoded value of the the words' indexes. So if X_train and X_test are array of the indexes of the words then you can directly feed this into the model.fit method.
EDIT:
Currently 'mse' loss is being used. Since the last layer is softmax layer cross entropy loss is more applicable here. And also the outputs are integer values of a class(words) sparse categorical should be used for loss.
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])

Conv2d input parameter mismatch

I am giving variable size images (all 278 images of different size 139 of each category) input to my cnn model. As a fact that cnn required fixed size images, so from here i got solution for this is to make input_shape=(None,Nonen,1) (for tensorflow backend and gray scale). but this solution doesnot work with flatten layer, so from their only i got solution of using GlobleMaxpooling or Globalaveragepooling. So from uses these facrts i am making a cnn model in keras to train my network with following code:
import os,cv2
import numpy as np
from sklearn.utils import shuffle
from keras import backend as K
from keras.utils import np_utils
from keras.models import Sequential
from keras.optimizers import SGD,RMSprop,adam
from keras.layers import Conv2D, MaxPooling2D,BatchNormalization,GlobalAveragePooling2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras import regularizers
from keras import initializers
from skimage.io import imread_collection
from keras.preprocessing import image
from keras import Input
import keras
from keras import backend as K
#%%
PATH = os.getcwd()
# Define data path
data_path = PATH+'/current_exp'
data_dir_list = os.listdir(data_path)
img_rows=None
img_cols=None
num_channel=1
# Define the number of classes
num_classes = 2
img_data_list=[]
for dataset in data_dir_list:
img_list=os.listdir(data_path+'/'+ dataset)
print ('Loaded the images of dataset-'+'{}\n'.format(dataset))
for img in img_list:
input_img=cv2.imread(data_path + '/'+ dataset + '/'+ img,0)
img_data_list.append(input_img)
img_data = np.array(img_data_list)
if num_channel==1:
if K.image_dim_ordering()=='th':
img_data= np.expand_dims(img_data, axis=1)
print (img_data.shape)
else:
img_data= np.expand_dims(img_data, axis=4)
print (img_data.shape)
else:
if K.image_dim_ordering()=='th':
img_data=np.rollaxis(img_data,3,1)
print (img_data.shape)
#%%
num_classes = 2
#Total 278 sample, 139 for 0 category and 139 for category 1
num_of_samples = img_data.shape[0]
labels = np.ones((num_of_samples,),dtype='int64')
labels[0:138]=0
labels[138:]=1
x,y = shuffle(img_data,labels, random_state=2)
y = keras.utils.to_categorical(y, 2)
model = Sequential()
model.add(Conv2D(32,(2,2),input_shape=(None,None,1),activation='tanh',kernel_initializer=initializers.glorot_uniform(seed=100)))
model.add(Conv2D(32, (2,2),activation='tanh'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (2,2),activation='tanh'))
model.add(Conv2D(64, (2,2),activation='tanh'))
model.add(MaxPooling2D())
model.add(Dropout(0.25))
#model.add(Flatten())
model.add(GlobalAveragePooling2D())
model.add(Dense(256,activation='tanh'))
model.add(Dropout(0.25))
model.add(Dense(2,activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['accuracy'])
model.fit(x, y,batch_size=1,epochs=5,verbose=1)
but i am getting following error:
ValueError: Error when checking input: expected conv2d_1_input to have 4 dimensions, but got array with shape (278, 1)
how to solve it.
In the docs for Conv2D it says that the input tensor has to be in this format:
(samples, channels, rows, cols)
I believe you can't have a variable input size unless your network is fully convolutional.
Maybe what you want to do is to keep it to a fixed input size, and just resize the image to that size before feeding it into your network?
Your array with input data cannot have variable dimensions (this is a numpy limitation).
So the array, instead of being a regular array of numbers with 4 dimensions is being created as an array of arrays.
You should fit each image individually because of this limitation.
for epoch in range(epochs):
for img,class in zip(x,y):
#expand the first dimension of the image to have a batch size
img = img.reshape((1,) + img.shape)) #print and check there are 4 dimensions, like (1, width, height, 1).
class = class.reshape((1,) + class.shape)) #print and check there are two dimensions, like (1, classes).
model.train_on_batch(img,class,....)

Resources