Padding sequences in tensorflow with tf.pad - python-3.x

I am trying to load imdb dataset in python. I want to pad the sequences so that each sequence is of same length. I am currently doing it with numpy. What is a good way to do it in tensorflow with tf.pad. I saw the given here but I dont know how to apply it with a 2 d matrix.
Here is my current code
import tensorflow as tf
from keras.datasets import imdb
max_features = 5000
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
def padSequence(dataset,max_length):
dataset_p = []
for x in dataset:
if(len(x) <=max_length):
dataset_p.append(np.pad(x,pad_width=(0,max_length-len(x)),mode='constant',constant_values=0))
else:
dataset_p.append(x[0:max_length])
return np.array(x_train_p)
max_length = max(len(x) for x in x_train)
x_train_p = padSequence(x_train,max_length)
x_test_p = padSequence(x_test,max_length)
print("input x shape: " ,x_train_p.shape)
Can someone please help ?
I am using tensorflow 1.0
In Response to the comment:
The padding dimensions are given by
# 'paddings' is [[1, 1,], [2, 2]].
I have a 2 d matrix where every row is of different length. I want to be able to pad to to make them of equal length. In my padSequence(dataset,max_length) function, I get the length of every row with len(x) function. Should I just do the same with tf ? Or is there a way to do it like Keras Function
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)

If you want to use tf.pad, according to me you have to iterate for each row.
Code will be something like this:
max_length = 250
number_of_samples = 5
padded_data = np.ndarray(shape=[number_of_samples, max_length],dtype=np.int32)
sess = tf.InteractiveSession()
for i in range(number_of_samples):
reviewToBePadded = dataSet[i] #dataSet numpy array
paddings = [[0,0], [0, maxLength-len(reviewToBePadded)]]
data_tf = tf.convert_to_tensor(reviewToBePadded,tf.int32)
data_tf = tf.reshape(data_tf,[1,len(reviewToBePadded)])
data_tf = tf.pad(data_tf, paddings, 'CONSTANT')
padded_data[i] = data_tf.eval()
print(padded_data)
sess.close()
New to Python, possibly not the best code. But I just want to explain the concept.

Related

using a list as a value in pandas.DataFeame and tensorFlow

Im tring to use list as a value in pandas.DataFrame
but Im getting Exception when trying to use use the adapt function in on the Normalization layer with the NumPy array
this is the error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
and this is the code:
import pandas as pd
import numpy as np
# Make NumPy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)
import tensorflow as tf
from tensorflow.keras import layers
data = [[45.975, 45.81, 45.715, 45.52, 45.62, 45.65, 4],
[55.67, 55.975, 55.97, 56.27, 56.23, 56.275, 5],
[86.87, 86.925, 86.85, 85.78, 86.165, 86.165, 3],
[64.3, 64.27, 64.285, 64.29, 64.325, 64.245, 6],
[35.655, 35.735, 35.66, 35.69, 35.665, 35.63, 5]
]
lables = [0, 1, 0, 1, 1]
def do():
d_1 = None
for l, d in zip(lables, data):
if d_1 is None:
d_1 = pd.DataFrame({'lable': l, 'close_price': [d]})
else:
d_1 = d_1.append({'lable': l, 'close_price': d}, ignore_index=True)
dataset = d_1.copy()
print(dataset.isna().sum())
dataset = dataset.dropna()
print(dataset.keys())
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)
print(train_dataset.describe().transpose())
train_features = train_dataset.copy()
test_features = test_dataset.copy()
train_labels = train_features.pop('lable')
test_labels = test_features.pop('lable')
print(train_dataset.describe().transpose()[['mean', 'std']])
normalizer = tf.keras.layers.Normalization(axis=-1)
ar = np.array(train_features)
normalizer.adapt(ar)
print(normalizer.mean.numpy())
first = np.array(train_features[:1])
with np.printoptions(precision=2, suppress=True):
print('First example:', first)
print()
print('Normalized:', normalizer(first).numpy())
diraction = np.array(train_features)
diraction_normalizer = layers.Normalization(input_shape=[1, ], axis=None)
diraction_normalizer.adapt(diraction)
diraction_model = tf.keras.Sequential([
diraction_normalizer,
layers.Dense(units=1)
])
print(diraction_model.summary())
print(diraction_model.predict(diraction[:10]))
diraction_model.compile(
optimizer=tf.optimizers.Adam(learning_rate=0.1),
loss='mean_absolute_error')
print(train_features['close_price'])
history = diraction_model.fit(
train_features['close_price'],
train_labels,
epochs=100,
# Suppress logging.
verbose=0,
# Calculate validation results on 20% of the training data.
validation_split=0.2)
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
print(hist.tail())
test_results = {}
test_results['diraction_model'] = diraction_model.evaluate(
test_features,
test_labels, verbose=0)
x = tf.linspace(0.0, 250, 251)
y = diraction_model.predict(x)
print("end")
def main():
do()
if __name__ == "__main__":
main()
I think it is not the usual practice to shrink your features into one column.
Quick-fix is you may put the following line
train_features = np.array(train_features['close_price'].to_list())
before
normalizer = tf.keras.layers.Normalization(axis=-1)
to get rid of the error, but now because your train_features has changed from a DataFrame into a np.array, your subsequent code may suffer, so you need to take care of that too.
If I were you, however, I would have constructed the DataFrame this way
df = pd.DataFrame(data)
df['label'] = lables
Please consider.

How to get a specific sample from pytorch DataLoader?

In Pytorch, is there any way of loading a specific single sample using the torch.utils.data.DataLoader class? I'd like to do some testing with it.
The tutorial uses
trainloader = torch.utils.data.DataLoader(...)
images, labels = next(iter(trainloader))
to fetch a random batch of samples. Is there are way, using DataLoader, to get a specific sample?
Cheers
Turn off the shuffle in DataLoader
Use batch_size to calculate the batch in which the desired sample you are looking for falls in
Iterate to the desired batch
Code
import torch
import numpy as np
import itertools
X= np.arange(100)
batch_size = 2
dataloader = torch.utils.data.DataLoader(X, batch_size=batch_size, shuffle=False)
sample_at = 5
k = int(np.floor(sample_at/batch_size))
my_sample = next(itertools.islice(dataloader, k, None))
print (my_sample)
Output:
tensor([4, 5])
if you want to get a specific signle sample from your dataset you can
you should check Subset class.(https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset)
something like this:
indices = [0,1,2] # select your indices here as a list
subset = torch.utils.data.Subset(train_set, indices)
trainloader = DataLoader(subset , batch_size = 16 , shuffle =False) #set shuffle to False
for image , label in trainloader:
print(image.size() , '\t' , label.size())
print(image[0], '\t' , label[0]) # index the specific sample
here is a useful link if you want to learn more about the Pytorch data loading utility
(https://pytorch.org/docs/stable/data.html)

BERT binary Textclassification get different results every run

I do binary text classification with BERT from the Simpletransformer.
I work in Colab with GPU runtime type.
I have generated train and test set with the sklearn StratifiedKFold Method. I have two files with the dictionaries containing my folds.
I run my classification in the following while loop:
from sklearn.metrics import matthews_corrcoef, f1_score
import sklearn
counter = 0
resultatos = []
while counter != len(trainfolds):
model = ClassificationModel('bert', 'bert-base-multilingual-cased',args={'num_train_epochs': 4, 'learning_rate': 1e-5, 'fp16': False,
'max_seq_length': 160, 'train_batch_size': 24,'eval_batch_size': 24 ,
'warmup_ratio': 0.0,'weight_decay': 0.00,
'overwrite_output_dir': True})
print("start with fold_{}".format(counter))
trainfolds["{}_fold".format(counter)].to_csv("/content/data/train.tsv", sep="\t", index = False, header=False)
print("{}_fold Train als train.tsv exportiert". format(counter))
testfolds["{}_fold".format(counter)].to_csv("/content/data/dev.tsv", sep="\t", index = False, header=False)
print("{}_fold test als train.tsv exportiert". format(counter))
train_df = pd.read_csv("/content/data/train.tsv", delimiter='\t', header=None)
eval_df = df = pd.read_csv("/content/data/dev.tsv", delimiter='\t', header=None)
train_df = pd.DataFrame({
'text': train_df[3].replace(r'\n', ' ', regex=True),
'label':train_df[1]})
eval_df = pd.DataFrame({
'text': eval_df[3].replace(r'\n', ' ', regex=True),
'label':eval_df[1]})
model.train_model(train_df)
result, model_outputs, wrong_predictions = model.eval_model(eval_df, f1 = sklearn.metrics.f1_score)
print(result)
resultatos.append(result)
shutil.rmtree("outputs")
shutil.rmtree("cache_dir")
#shutil.rmtree("runs")
counter += 1
And i get different Results Running this code for the same Folds:
Here for example the F1 Scores for two runs:
0.6237942122186495
0.6189111747851003
0.6172839506172839
0.632183908045977
0.6182965299684542
0.5942492012779553
0.6025641025641025
0.6153846153846154
0.6390532544378699
0.6627906976744187
The F1 Score is: 0.6224511646974427
0.6064516129032258
0.6282420749279539
0.6402439024390244
0.5971014492753622
0.6135693215339232
0.6191950464396285
0.6382978723404256
0.6388059701492537
0.6097560975609756
0.5956112852664576
The F1 Score is: 0.618727463283623
How can they be that diffeerent for the same folds?
What i tried already is give a fixed Random seed right before my loop starts:
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
I came up with approach of having the Model initialized in the loop because, when its outside the loop, it somehow remembers what it has learned - that means after the 2nd fold I get f1 score of almost one - despite the fact that i delete the cache..
I figured it out myself, just set all seeds plus torch.backends.cudnn.deterministic = True
and
torch.backends.cudnn.benchmark = False
like shown in this post and i get the same results for all runs!

What would be the equivalent of keras.layers.Masking in pytorch?

I have time-series sequences which I needed to keep the length of sequences fixed to a number by padding zeroes into matrix and using keras.layers.Masking in keras I could neglect those padded zeros for further computations, I am wondering how could it be done in Pytorch?
Either I need to do the padding in pytroch and pytorch can't handle the sequences with varying lengths what is the equivalent to Masking layer of keras in pytorch, or if pytorch handles the sequences with varying lengths, how could it be done?
You can use PackedSequence class as equivalent to keras masking. you can find more features at torch.nn.utils.rnn
Here putting example from packing for variable-length sequence inputs for rnn
import torch
import torch.nn as nn
from torch.autograd import Variable
batch_size = 3
max_length = 3
hidden_size = 2
n_layers =1
# container
batch_in = torch.zeros((batch_size, 1, max_length))
#data
vec_1 = torch.FloatTensor([[1, 2, 3]])
vec_2 = torch.FloatTensor([[1, 2, 0]])
vec_3 = torch.FloatTensor([[1, 0, 0]])
batch_in[0] = vec_1
batch_in[1] = vec_2
batch_in[2] = vec_3
batch_in = Variable(batch_in)
seq_lengths = [3,2,1] # list of integers holding information about the batch size at each sequence step
# pack it
pack = torch.nn.utils.rnn.pack_padded_sequence(batch_in, seq_lengths, batch_first=True)
>>> pack
PackedSequence(data=Variable containing:
1 2 3
1 2 0
1 0 0
[torch.FloatTensor of size 3x3]
, batch_sizes=[3])
# initialize
rnn = nn.RNN(max_length, hidden_size, n_layers, batch_first=True)
h0 = Variable(torch.randn(n_layers, batch_size, hidden_size))
#forward
out, _ = rnn(pack, h0)
# unpack
unpacked, unpacked_len = torch.nn.utils.rnn.pad_packed_sequence(out)
>>> unpacked
Variable containing:
(0 ,.,.) =
-0.7883 -0.7972
0.3367 -0.6102
0.1502 -0.4654
[torch.FloatTensor of size 1x3x2]
more you would find this article useful. [Jum to Title - "How the PackedSequence object works"] - link
You can use a packed sequence to mask a timestep in the sequence dimension:
batch_mask = ... # boolean mask e.g. (seq x batch)
# move `padding` at right place then it will be cut when packing
compact_seq = torch.zeros_like(x)
for i, seq_len in enumerate(batch_mask.sum(0)):
compact_seq[:seq_len, i] = x[batch_mask[:,i],i]
# pack in sequence dimension (the number of agents)
packed_x = pack_padded_sequence(compact_seq, batch_mask.sum(0).cpu().numpy(), enforce_sorted=False)
packed_scores, rnn_hxs = nn.GRU(packed_x, rnn_hxs)
# restore sequence dimension
scores, _ = pad_packed_sequence(packed_scores)
# restore order, moving padding in its place
scores = torch.zeros((*batch_mask.shape,scores.size(-1))).to(scores.device).masked_scatter(batch_mask.unsqueeze(-1), scores)
instead use a mask select/scatter to mask in the batch dimension:
batch_mask = torch.any(x, -1).unsqueeze(-1) # boolean mask (batch,1)
batch_x = torch.masked_select(x, batch_mask).reshape(-1, x.size(-1))
batch_rnn_hxs = torch.masked_select(rnn_hxs, batch_mask).reshape(-1, rnn_hxs.size(-1))
batch_rnn_hxs = nn.GRUCell(batch_x, batch_rnn_hxs)
rnn_hxs = rnn_hxs.masked_scatter(batch_mask, batch_rnn_hxs) # restore batch
Note that using scatter function is safe for gradient backpropagation

Pad data using tf.data.Dataset

I have to use tf.data.Dataset for creating a input pipeline for an RNN model in tensorflow. I am providing a basic code, by which I need to pad the data in batch with a pad token and use it for further manipulation.
import pandas as pd
import numpy as np
import tensorflow as tf
import functools
total_data_size = 10000
embedding_dimension = 25
max_len = 17
varying_length = np.random.randint(max_len, size=(10000)) # varying length data
X = np.array([np.random.randint(1000, size=(value)).tolist()for index, value in enumerate(varying_length)]) # data of arying length
Y = np.random.randint(2, size=(total_data_size)).astype(np.int32) # target binary
embedding = np.random.uniform(-1,1,(1000, embedding_dimension)) # word embedding
def gen():
for index in range(len(X)):
yield X[index] , Y[index]
dataset = tf.data.Dataset.from_generator(gen,(tf.int32,tf.int32))
dataset = dataset.batch(batch_size=25)
padded_shapes = (tf.TensorShape([None])) # sentence of unknown size
padding_values = (tf.constant(-111)) # the value with which pad index needs to be filled
dataset = (dataset
.padded_batch(25, padded_shapes=padded_shapes, padding_values=padding_values)
)
iter2 = dataset.make_initializable_iterator()
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run(iter2.initializer)
print(sess.run(iter2.get_next()))
I hope the code is self explanatory with comments. But I am getting following error,
InvalidArgumentError (see above for traceback): Cannot batch tensors with different shapes in component 0. First element had shape [11] and element 1 had shape [12].
[[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,?], [?]], output_types=[DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
I believe that since your generator yields two outputs, your padded_shapes and padded_values tuples must have a length of two. For me, this works:
dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32))
dataset = dataset.batch(batch_size=25)
padded_shapes = (tf.TensorShape([None]), tf.TensorShape([None])) # sentence of unknown size
padding_values = (tf.constant(-111), tf.constant(-111)) # the value with which pad index needs to be filled
dataset = (dataset
.padded_batch(25, padded_shapes=padded_shapes, padding_values=padding_values)
)
iter2 = dataset.make_initializable_iterator()
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run(iter2.initializer)
Finally got the answer. The issue was for the second padded shapes instead of Tensorshape([None]), we should provide [], because the second item returned by the generator is a scalar. If using Tensorshape([None]),, make sure we are returning a vector
import pandas as pd
import numpy as np
import tensorflow as tf
import functools
total_data_size = 10000
embedding_dimension = 25
max_len = 17
varying_length = np.random.randint(max_len, size=(10000)) # varying length data
X = np.array([np.random.randint(1000, size=(value)).tolist()for index, value in enumerate(varying_length)]) # data of arying length
Y = np.random.randint(2, size=(total_data_size)).astype(np.int32) # target binary
embedding = np.random.uniform(-1,1,(1000, embedding_dimension)) # word embedding
def gen():
for index in range(len(X)):
yield X[index] , Y[index]
dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32), (tf.TensorShape([None]), []))
padded_shapes = (tf.TensorShape([None]), []) # sentence of unknown size
dataset = (dataset
.padded_batch(25, padded_shapes=padded_shapes, padding_values=(-111, 0))
)
iter2 = dataset.make_initializable_iterator()
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run(iter2.initializer)
sess.run(iter2.get_next())

Resources