Why is keras tokenizer applying lowercase() to it's own tokens? - keras

I am running my first cnn text-classifier using the IMDB dataset with the in-built
tf.keras.datasets.imdb.load_data()
I understand the AttributeError: 'int' object has no attribute 'lower' error indicates that a lowercase function is being applied to int objects (seems to be from the tokenizer). However, I don't know why it is throwing this in this case as I am loading it directly though the in-built tf.keras.datasets.imdb.load_data().
I am not experienced with using embedding in text-classification.
The code excluding the CNN model is:
import tensorflow as tf
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding, LSTM
from keras.layers import Conv1D, Flatten, MaxPooling1D
from keras.datasets import imdb
import wandb
from wandb.keras import WandbCallback
import numpy as np
from keras.preprocessing import text
import imdb
wandb.init(mode="disabled") # disabled for debugging
config = wandb.config
# set parameters:
config.vocab_size = 1000
config.maxlen = 1000
config.batch_size = 32
config.embedding_dims = 10
config.filters = 16
config.kernel_size = 3
config.hidden_dims = 250
config.epochs = 10
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.imdb.load_data()
tokenizer = text.Tokenizer(num_words=config.vocab_size)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_matrix(X_train)
X_test = tokenizer.texts_to_matrix(X_test)
X_train = sequence.pad_sequences(X_train, maxlen=config.maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=config.maxlen)
Line 34 referred to in the error is tokenizer = text.Tokenizer(num_words=config.vocab_size)
The exact error thrown (includes Deprecation warnings) is:
C:\Users\Keegan\anaconda3\envs\oldK\lib\site-
packages\tensorflow_core\python\keras\datasets\imdb.py:129:
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-
or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If
you meant to do this, you must specify 'dtype=object' when creating the ndarray.
x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
C:\Users\Keegan\anaconda3\envs\oldK\lib\site-
packages\tensorflow_core\python\keras\datasets\imdb.py:130: VisibleDeprecationWarning: Creating
an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or
ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must
specify 'dtype=object' when creating the ndarray.
x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])
Traceback (most recent call last):
File "imdb-cnn.py", line 34, in <module>
tokenizer.fit_on_texts(X_train)
File "C:\Users\Keegan\anaconda3\envs\oldK\lib\site-packages\keras_preprocessing\text.py",
line 217, in fit_on_texts
text = [text_elem.lower() for text_elem in text]
File "C:\Users\Keegan\anaconda3\envs\oldK\lib\site-packages\keras_preprocessing\text.py", line 217, in <listcomp>
text = [text_elem.lower() for text_elem in text]
AttributeError: 'int' object has no attribute 'lower'
The Anaconda venv has Python 3.7.1, Tensorflow 2.1.0 and Keras 2.3.1

The Keras tokenizer has an attribute lower which can be set either to True or False.
I guess the reason why the pre-packaged IMDB data is by default lower-cased is that the dataset is pretty small. If you did not lower-case it the capitalized and lower-cased words would get different embeddings, but the capitalized forms probably are not frequently enough in the training data to train the embeddings appropriately. This of course changes, once you use pre-trained embeddings or pre-trained contextualized models such as BERT which were pre-trained on large data.

Related

How can I use LSTM with pretrained static word vectors on aclImdb dataset

I am trying to use Sentiment Classification with LSTM and pre-trained BERT embeddings, and later language translation with Transformer
first of all I downloaded
!pip install ktrain
!pip install tensorflow_text
And I imported the necessary lib
import pathlib
import random
import numpy as np
from typing import Tuple, List
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
# tensoflow imports
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import (
TextVectorization, LSTM, Dense, Embedding, Dropout,
Layer, Input, MultiHeadAttention, LayerNormalization)
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.initializers import Constant
from tensorflow.keras import backend as K
import tensorflow_text as tf_text
import ktrain
from ktrain import text
And I downloaded and extracted Large Movie dataset from Stanford
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xzf aclImdb_v1.tar.gz
1- I try to use LSTM with train by Creating the training and test sets with the texts_from_folder function of the ktrain.text module
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";
DATADIR ='/content/aclImdb'
trn, val, preproc = text.texts_from_folder(DATADIR,max_features=20000, maxlen=400, ngram_range=1, preprocess_mode='standard', train_test_names=['train', 'test'],classes=['pos', 'neg'])
And I am trying to build LSTM model her
K.clear_session()
def build_LSTM_model(
embedding_size: int,
total_words: int,
lstm_hidden_size: int,
dropout_rate: float) -> Sequential:
model.add(Embedding(input_dim = total_words,output_dim=embedding_size,input_length=total_words))
model.add(LSTM(lstm_hidden_size,return_sequences=True,name="lstm_layer"))
model.add(GlobalMaxPool1D())
# model.add(Dense(total_words, activation='softmax'))
model.add(Dropout(dropout_rate))
model.add(Dense(MAX_SEQUENCE_LEN, activation="relu"))
# adam = Adam(lr=0.01)
model.compile(loss='CategoricalCrossentropy', optimizer=Adam(lr=0.01), metrics=['CategoricalAccuracy'])
model.summary()
model = Sequential()
with the following requirements for a sequential model The model should include:
One Embedding layer at the beginning. (Watch out for proper parameterization!)
At least one LSTM layer.At least one Dropout layer for regularization.One final Dense layer mapping to the outputs.
compile model, with categorical_crossentropy loss and the adam optimizer. or might want to add other types of metrics for example CategoricalAccuracy makes sense here.
And then I want to use the ktrain library's get_learner method to create an easily trainable version of the previous model. and to use test set as the val_data, to see the performance. ( not include the proper train-validation-test split, but it could be extended if required.)
I am using the learner's lr_find and lr_plot methods to determine the most effective learning rate for the model. by Specifying the max_epochs parameter of lr_find to limit the time this takes. a couple of epochs! to determine the best learning rate based on the plot. and find balance between the fastest convergence and stability
learner: ktrain.Learner
model = text.text_classifier('bert', trn , preproc=preproc)
learner.lr_find()
learner.lr_plot()
learner.fit_onecycle(1e-4, 1)
I faced following errors
ValueError Traceback (most recent call last)
in ()
6 # workers=8, use_multiprocessing=False, batch_size=64)
7
----> 8 model = text.text_classifier('bert', trn , preproc=preproc)
10 # learner.lr_find()
1 frames
/usr/local/lib/python3.7/dist-packages/ktrain/text/models.py in _text_model(name, train_data, preproc, multilabel, classification, metrics, verbose)
109 raise ValueError(
110 "if '%s' is selected model, then preprocess_mode='%s' should be used and vice versa"
--> 111 % (BERT, BERT)
112 ) 113 is_huggingface = U.is_huggingface(data=train_data)
ValueError: if 'bert' is selected model, then preprocess_mode='bert' should be used and vice versa
And next step to make it with LSTM with pretrained static word vectors
If you're using BERT for pretrained word vectors supplied as features to an LSTM, then you don't need to build a separate BERT classification model. You can use TransformerEmbedding to generate word vectors for your dataset (or use sentence-transformers):
In [1]: from ktrain.text import TransformerEmbedding
In [2]: te = TransformerEmbedding('bert-base-cased')
In [3]: te.embed('George Washington went to Washington .').shape
Out[3]: (1, 6, 768)
This is what the included NER models in ktrain do under-the-hood.
Also, the input feature format for a BERT model is completely different than input features for an LSTM. As the error message indicates, to preprocess your texts for BERT classification model, you'll need to supply preprocess_mode='bert' to texts_from_folder.

Using an Embedding layer in the Keras functional API

Doing basic things with the Keras Functional API seems to produce errors. For example, the following fails:
from keras.layers import InputLayer, Embedding
input = InputLayer(name="input", input_shape=(1, ))
embedding = Embedding(10000, 64)(input)
This produces the error:
AttributeError: 'str' object has no attribute 'base_dtype'
I can then "cheat" by using the input_length argument but this then fails when I try to concatenate two such embeddings:
from keras.layers import InputLayer, Embedding, Concatenate
embedding1 = Embedding(10000, 64, input_length=1)
embedding2 = Embedding(10000, 64, input_length=1)
concat = Concatenate()([embedding1 , embedding2])
This gives the error:
TypeError: 'NoneType' object is not subscriptable
Same error when I use "concatenate" (lower case) instead (some sources seem to say that this should be used instead if using the functional API).
What am I doing wrong?
I am on tensorflow version 2.3.1, keras version 2.4.3, python version 3.6.7
I strongly suggest to use tf.keras and not keras.
It doesn't work because InputLayer is an instance of keras.Layer, whereas keras.layers.Input is an instance of Tensor. The argument to layer.__call__() should be Tensor and not keras.Layer.
import tensorflow as tf
inputs = tf.keras.layers.Input((1,))
print(type(inputs)) # <class 'tensorflow.python.framework.ops.Tensor'>
input_layer = tf.keras.layers.InputLayer(input_shape=(1,))
print(type(input_layer)) # <class 'tensorflow.python.keras.engine.input_layer.InputLayer'>
You use InputLayer with Sequential API. When you use functional API you should use tf.keras.layers.Input() instead:
import tensorflow as tf
inputs = tf.keras.layers.Input((1, ), name="input", )
embedding = tf.keras.layers.Embedding(10000, 64)(inputs)
Same with the second example:
import tensorflow as tf
inputs = tf.keras.layers.Input((1, ), name="input", )
embedding1 = tf.keras.layers.Embedding(10000, 64)(inputs)
embedding2 = tf.keras.layers.Embedding(10000, 64)(inputs)
concat = tf.keras.layers.Concatenate()([embedding1, embedding2])

Calling K.eval() on input_tensor inside keras custom loss function?

I'm trying to convert the input tensor to a numpy array inside a custom keras loss function, after following the instructions here.
The above code runs on my machine with no errors. Now, I want to extract a numpy array with values from the input tensor. However, I get the following error:
"tensorflow.python.framework.errors_impl.InvalidArgumentError: You
must feed a value for placeholder tensor 'input_1' with dtype float
[[Node: input_1 = Placeholderdtype=DT_FLOAT, shape=[],
_device="/job:localhost/replica:0/task:0/cpu:0"]]"
I need to convert to a numpy array because I have other keras models that must operate on the input - I haven't shown those lines below in joint_loss, but even the code sample below doesn't run at all.
import numpy as np
from keras.models import Model, Sequential
from keras.layers import Dense, Activation, Input
import keras.backend as K
def joint_loss_wrapper(x):
def joint_loss(y_true, y_pred):
x_val = K.eval(x)
return y_true - y_pred
return joint_loss
input_tensor = Input(shape=(6,))
hidden1 = Dense(30, activation='relu')(input_tensor)
hidden2 = Dense(40, activation='sigmoid')(hidden1)
out = Dense(1, activation='sigmoid')(hidden2)
model = Model(input_tensor, out)
model.compile(loss=joint_loss_wrapper(input_tensor), optimizer='adam')
I figured it out!
What you want to do is use the Functional API for Keras.
Then your submodels outputs as tensors can be obtained as y_pred_submodel = submodel(x).
This is similar to how a Keras layer operates on a tensor.
Manipulate only tensors within the loss function. That should work fine.

Keras: Functional API -- Layer Datatype Error

I am trying to separate each of the outputs of a keras Conv2D layer using a for loop, and then adding another layer to it through the Functional API, but I am getting a type error. The code is:
import keras
from keras.models import Sequential, Model
from keras.layers import Flatten, Dense, Dropout, Input, Activation
from keras.layers.convolutional import Conv2D, MaxPooling2D, ZeroPadding2D
from keras.layers.merge import Add
from keras.optimizers import SGD
import cv2, numpy as np
import glob
import csv
def conv_layer:
input = Input(shape=(3,224,224))
k = 64
x = np.empty(k, dtype=object)
y = np.empty(k, dtype=object)
z = np.empty(k, dtype=object)
for i in range(0,k):
x[i] = Conv2D(1, (3,3), data_format='channels_first', padding='same')(input)
y[i] = Conv2D(1, (3,3), data_format='channels_first', padding='same')(x[i])
z[i] = keras.layers.add([x[i], y[i]])
out = Activation('relu')(z)
model = Model(inputs, out, name='split-layer-model')
return model
But, it is throwing the following error:
Traceback (most recent call last):
File "vgg16-local-connections.py", line 352, in <module>
model = VGG_16_local_connections()
File "vgg16-local-connections.py", line 40, in VGG_16_local_connections
out = Activation('relu')(z)
File "/Users/klab/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 519, in __call__
input_shapes.append(K.int_shape(x_elem))
File "/Users/klab/anaconda2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 409, in int_shape
shape = x.get_shape()
AttributeError: 'numpy.ndarray' object has no attribute 'get_shape'
So, the datatype of z does not match the one of the Functional API. How can I fix this? Any help will be deeply appreciated!
I think you meant:
out = Activation('relu')(z[k - 1])
Your code sets the whole vector z with all layers to be an input to Activation which Keras does not know how to handle.
Since I had defined z[i]-s as separate layers, I thought z would effectively be a stack of those z[i]-s. But, they basically had to be concatenated to make the stack I wanted,
z = keras.layers.concatenate([z[i] for i in range (0,k)], axis=1)
out = Activation('relu')(z)
Since I was using data_format='channels_first', the concatenation was done with axis=1, but for the more common, data_format='channels_last', the concatenation has to be done with axis=3.

Using model.pop() changes the model's summary but does not effect the output

I am trying to remove the top layers from a model I have previously trained.
This is the code I use:
import os
import h5py
import numpy as np
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.layers import Activation, Dropout, Flatten, Dense
# KERAS_BACKEND=theano python
import keras
keras.backend.set_image_dim_ordering("th")
img_width, img_height = 150, 150
data_dir = '//shared_directory/projects/try_CD/data/validation'
nb_train_samples = 2000
nb_validation_samples = 800
nb_epoch = 50
def make_bottleneck_features(model):
datagen = ImageDataGenerator(rescale=1./255)
generator = datagen.flow_from_directory(
data_dir,
target_size=(img_width, img_height),
batch_size=32,
class_mode=None,
shuffle=False)
bottleneck_features = model.predict_generator(generator, nb_validation_samples)
return (bottleneck_features)
model=keras.models.load_model('/shared_directory/projects/think_exp/CD_M1.h5')
A = make_bottleneck_features(model)
model.summary()
for i in range (6):
model.pop()
B = make_bottleneck_features(model)
model.summary()
Judging comparing the results of the two calls to model.summary(), I can see that indeed the 6 topmost layers were removed.
However, the model's output (saved to A and B) does not change after discarding these layers.
What is the source of that discrepancy?
How can I retrieve the output of the desired layer instead of that of the entire model?
Thanks in advance!
You can't drop layers like that, in order for it to have an effect, you need to recompile the model (AKA model.compile).
But that's not the best way to obtain outputs from intermediate layers, you can just use K.function (where K is keras.backend) to build a function from the input to one of the layers and then call the function. More details are available in this answer.

Resources