I'm trying to fine tuning Bert for document classification.
I started by tokenizing the documents to generate the input_ids, attention_mask and token_type_ids lists to feed my TFBertModel:
def tokenize_sequences(tokenizer, max_length, corpus):
input_ids = []
token_type_ids = []
attention_masks = []
for i in tqdm(range(len(corpus))):
encoded = tokenizer.encode_plus(
corpus[i],
max_length=max_length,
add_special_tokens=True,
padding='max_length',
truncation=True,
return_token_type_ids=True,
return_attention_mask=True, # add attention mask to not focus on pad tokens)
)
input_ids.append(encoded["input_ids"])
attention_masks.append(encoded["attention_mask"])
token_type_ids.append(encoded["token_type_ids"])
input_ids = tf.convert_to_tensor(input_ids)
attention_masks = tf.convert_to_tensor(attention_masks)
token_type_ids = tf.convert_to_tensor(token_type_ids)
#print(input_ids.shape, attention_masks.shape, token_type_ids.shape)
return [input_ids, attention_masks, token_type_ids]
Then, I tried to fit my model:
x_train = tokenize_sequences(tokenizer, MAXLEN, corpus_train)
model = loadBertModel()
model.fit(
x_train, y_bin_train,
epochs=N_EPOCHS,
verbose=1,
batch_size=4,
)
And I get this error:
InvalidArgumentError: indices[3] = [1,5] is out of order. Many sparse ops require sorted indices.
Use tf.sparse.reorder to create a correctly ordered copy.
I tried to solve the issue following this suggestion. I did this by modifying input_ids, attention_masks, token_type_ids tensors returned by tokenize_sequences.
input_ids = tf.sparse.reorder(input_ids)
attention_masks = tf.sparse.reorder(attention_masks)
token_type_ids = tf.sparse.reorder(token_type_ids)
But then another error occurred:
TypeError: Input must be a SparseTensor.
PS: When I checked the type of my tensors, I noticed that they were <class 'tensorflow.python.framework.ops.EagerTensor'>.
Any ideas on how to solve this?
I don't have enough points to comment, so I'm trying to comment by answering..
It seems this question is same as yours:
Multiclass text classification TypeError: Input must be a SparseTensor
In my case, I solved a similar issue by simply converting the inputs using .toarray() instead of trying to reorder them.
input_ids = input_ids.toarray()
Related
I want to fine tune the blip model on ROCO database for image captioning chest x-ray images. But I am getting an error regarding integer indexing.
Can anyone please help me understand the cause of the error and how to rectify it.
This is the code:
def read_data(filepath,csv_path,n_samples):
df = pd.read_csv(csv_path)
images = []
capts = []
for idx in range(len(df)):
if 'hest x-ray' in df['caption'][idx] or 'hest X-ray' in df['caption'][idx]:
if len(images)>n_samples:
break
else:
images.append(Image.open(os.path.join(filepath,df['name'][idx])).convert('L'))
capts.append(df['caption'][idx])
return images, capts
def get_data():
imgtrpath = 'all_data/train/radiology/images'
trcsvpath = 'all_data/train/radiology/traindata.csv'
imgtspath = 'all_data/test/radiology/images'
tscsvpath = 'all_data/test/radiology/testdata.csv'
imgvalpath = 'all_data/validation/radiology/images'
valcsvpath = 'all_data/validation/radiology/valdata.csv'
print('Extracting Training Data')
trainimgs, traincapts = read_data(imgtrpath, trcsvpath, 1800)
print('Extracting Testing Data')
testimgs, testcapts = read_data(imgtrpath, trcsvpath, 100)
print('Extracting Validation Data')
valimgs, valcapts = read_data(imgtrpath, trcsvpath, 100)
return trainimgs, traincapts, testimgs, testcapts, valimgs, valcapts
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
trainimgs, traincapts, testimgs, testcapts, valimgs, valcapts = get_data()
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
metric = evaluate.load("accuracy")
traindata = processor(text=traincapts, images=trainimgs, return_tensors="pt", padding=True, truncation=True)
evaldata = processor(text=testcapts, images=testimgs, return_tensors="pt", padding=True, truncation=True)
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=traindata,
eval_dataset=evaldata,
compute_metrics=compute_metrics
)
trainer.train()
The code is meant to fine-tune the BLIP model on the ROCO dataset chest x-ray images for the purpose of image captioning.
But when I run it, I am getting this error:
File "C:\Users\omair\anaconda3\envs\torch\lib\site-packages\transformers\feature_extraction_utils.py", line 86, in __getitem__
raise KeyError("Indexing with integers is not available when using Python based feature extractors")
KeyError: 'Indexing with integers is not available when using Python based feature extractors'
There are two issues here:
You're not providing the labels during training, your ...capts are passed as the model's "Question". There is an example on how to do that in the link below.
Finetuning HF's BlipForConditionalGeneration is not supported at the moment, see https://discuss.huggingface.co/t/finetune-blip-on-customer-dataset-20893/28446 where they just fixed BlipForQuestionAnswering. If you create a dataset based on this link, you will also get the error ValueError: Expected input batch_size (0) to match target batch_size (511). which can be solved if you put the effort to reproduce the changes made on BlipForQuestionAnswering to BlipForConditionalGeneration.
I have successfully build a sentiment analysis tool with BertForSequenceClassification from huggingface/transformers to classify $tsla tweets as positive or negative.
However, I can't find out how I can obtain the feature vectors per tweet (more specifically the embedding of [CLS]) from my finetuned model.
more info of used model:
model = BertForSequenceClassification.from_pretrained(OUTPUT_DIR, num_labels=num_labels)
model.config.output_hidden_states = True
tokenizer = BertTokenizer(OUTPUT_DIR+'vocab.txt')
However, when I run the code below the output variable only consists of the logits.
model.eval()
eval_loss = 0
nb_eval_steps = 0
preds = []
for input_ids, input_mask, segment_ids, label_ids in tqdm_notebook(eval_dataloader, desc="Evaluating"):
input_ids = input_ids.to(device)
input_mask = input_mask.to(device)
segment_ids = segment_ids.to(device)
label_ids = label_ids.to(device)
with torch.no_grad():
output = model(input_ids,token_type_ids= segment_ids,attention_mask= input_mask)
I also have this problem after fine-tuning BertForSequenceClassification. I know your purpose is to get the hidden state of [CLS] as the representation of each tweet. Right? As the instruction of API document, I think the code is:
model = BertForSequenceClassification.from_pretrained(OUTPUT_DIR, output_hidden_states=True)
logits, hidden_states = model(input_ids, attn_masks)
cls_hidden_state = hidden_states[-1][:, 0, :] # the first hidden state in last layer
or
model = BertForSequenceClassification.from_pretrained(OUTPUT_DIR, output_hidden_states=True)
last_hidden_states = model.bert(input_ids, attn_masks)[0]
cls_hidden_state = last_hidden_states[:, 0, :]
BertForSequenceClassification is a wrapper that consists of two parts: BERT model (attribute bert) and a classifier (attribute classifier).
You can call directly the underling BERT model. If you pass your input directly to it, you will get the hidden states. It returns a tuple: the first member of the tuple are all hidden states, the second one is the [CLS] vector.
I am playing with Variational Autoencoders and would like to adapt a Keras example found on GitHub.
Basically, the example is very simple based on mnist dataset and I would like to implement on a more difficult set as it is more realistic.
Code I'm trying to modify:
vae_dfc.fit(
x_train,
epochs=epochs,
steps_per_epoch=train_size//batch_size,
validation_data=(x_val),
validation_steps=val_size//batch_size,
verbose=1
)
With more complex datasets it is nearly impossible to load everything on memory so we need to use fit_generator() to train the model. But it doesn't seem able to handle this:
image_generator = image.ImageDataGenerator(
rescale=1./255,
validation_split=0.2
)
train_generator = image_generator.flow_from_directory(
dir,
class_mode=None,
color_mode='rgb',
target_size=(ORIGINAL_SHAPE[0], ORIGINAL_SHAPE[1]),
batch_size=BATCH_SIZE,
subset='training'
)
vae.fit_generator(
train_generator,
epochs=EPOCHS,
steps_per_epoch=train_generator.samples // BATCH_SIZE,
validation_data=validation_generator,
validation_steps=validation_generator.samples // BATCH_SIZE
)
My understanding is that class_mode=None is producing an output similar to the original simple example, but the fit_generator() is unable to handle this. Are there any workarounds to deal with the fit generator error?
Configurations:
tensorflow-gpu==1.12.0
Python 3.6
Windows 10
Cuda 9.0
Full error:
File "xxx\venv\lib\site-packages\tensorflow\python\keras\engine\training.py",
line 2177, in fit_generator
initial_epoch=initial_epoch)
File "xxx\venv\lib\site-packages\tensorflow\python\keras\engine\training_generator.py",
line 162, in fit_generator
'or (x, y). Found: ' + str(generator_output)) ValueError: Output of generator should be a tuple (x, y, sample_weight) or (x, y).
Found: [[[[0.48627454 0.34901962 0.2901961 ] ....]]]
An autoencoder needs outputs = inputs. It's different from not having outputs.
I believe you can try class_mode='input'.
If this doesn't work, you can create a wrapper generator for outputting both:
class AutoencGenerator(keras.utils.Sequence):
def __init__(self, originalGenerator):
self.generator = originalGenerator
def __len__(self):
return len(self.generator)
def __getitem__(self, i):
x = self.generator[i]
return x, x
def on_epoch_end(self):
self.generator.on_epoch_end() #this only if there is an on_epoch_end in the original
train_autoenc_generator = AutoencGenerator(train_generator)
Both options will need that your model has outputs, of course. If the model was created without outputs (unusual), make it output the results and use the loss function in model.compile(loss=the_loss).
Example of VAE
inputs = Input(shape)
means, sigmas = encoder(inputs)
def encode(x):
means, sigmas = x
randomSamples = tf.random_normal(K.shape(means)) #samples
encoded = (sigmas * randomSamples) + means
return encoded
encodings = Lambda(encode)([means, sigmas])
outputs = decoder(encodings)
kl_loss = some_tensor_function(means, sigmas)
VAE = Model(inputs, outputs)
VAE.add_loss(kl_loss)
VAE.compile(loss = 'mse', optimizer='adam')
Train with the generator:
VAE.fit_generator(train_autoenc_generator, ...)
I try to build a very simple LSTM to classify text.
def encoded(texts):
res = [one_hot(text, 100000, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~', split=' ') for text in texts]
return res
def train(X, y, X_t, y_t):
X = encoded(X)
X_t = encoded(X_t)
model = Sequential()
model.add(Embedding(100000,100))
model.add(Bidirectional(LSTM(20,return_sequences = True),merge_mode='ave'))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.fit(np.array(X), np.array(y), batch_size=16, epochs=8)
score = model.evaluate(np.array(X_t), np.array(y_t), batch_size = 16)
print(score)
However I got this error:
ValueError: setting an array element with a sequence.
It seems like embedding layer didnt create right dimension vector or something wrong with the format of input X(X_t).
Any idea?
I try to build a generator for a Keras model which will be trained on a large hdf store.
To speed up the training, I pre-calculated all features incl. one-hot encoding already in the hdfstore. So the call from that should be straight forward.
To feed chunks of my data into the network, I try to use fit_generator, but struggle to get it up and running.
The generator:
def myGenerator(myStore, generateFrom,generateTo):
# Create empty arrays to contain batch of features and labels#
while True:
X = pd.read_hdf(myStore,'X',start=generateFrom,stop=generateTo)
y = pd.read_hdf(myStore,'y',start=generateFrom,stop=generateTo)
yield X,y
Network and fitting:
def get_model(shape):
'''Create a keras model.'''
inputlayer = Input(shape=shape)
model = BatchNormalization()(inputlayer)
model = Dense(1024, activation='relu')(model)
model = Dropout(0.25)(model)
model = BatchNormalization()(inputlayer)
model = Dense(512, activation='relu')(model)
model = Dropout(0.25)(model)
model = BatchNormalization()(inputlayer)
model = Dense(256, activation='relu')(model)
model = Dropout(0.25)(model)
model = BatchNormalization()(inputlayer)
model = Dense(128, activation='relu')(model)
model = Dropout(0.25)(model)
# 11 because background noise has been taken out
model = Dense(2, activation='tanh')(model)
model = Model(inputs=inputlayer, outputs=model)
return model
shape = (6603,10000)
model = get_model(shape)
model.compile(loss='mean_squared_error', optimizer=Adam(), metrics=['accuracy'])
#X = generator(myStore)
#Xt = generator(myStore)
labelbinarizer = LabelBinarizer()
y = labelbinarizer.fit_transform(y)
#yt = labelbinarizer.fit_transform(yt)
generateFrom = 0
for i in range(10):
generateTo=generateFrom+10000
model.fit_generator(
generator=myGenerator(myStore,generateFrom,generateTo),
epochs=1,
steps_per_epoch=X[0].shape[0] // 1000)
generateFrom=generateTo
I have tried both, to have the fit_generator within a loop and plug in the range (as shown above), but also to handle the range inside the generator. Both does not work. Currently running into
TypeError: 'generator' object is not subscriptable
Likely I have some misunderstanding how fit_generator() is supposed to be used in this context. Most examples out there are around generating tensors from pictures.
Any hint is appreciated.
Thanks
The function read_hdf returns a panda object, you need to convert it to numpy array.