save model output in pytorch - nlp

dic = []
for step, batch in tqdm(enumerate(train_dataloader)):
inpt = batch[0].to(device)
msks = batch[1].to(device)
#Run the sentences through the model
outputs = model_obj(inpt, msks)
dic.append( {
'hidden_states': outputs[2],
'pooled_output': outputs[1]})
I want to save the model output in each iteration but I got the below error for a small set of datasets.
RuntimeError: CUDA out of memory.
notice that without the below code my model works correctly.
dic.append( { 'hidden_states': outputs[2], 'pooled_output': outputs[1]})
How can I save these outputs in each iteration?

First of all, you should always post the full error stacktrace. Secondly, you should move the outputs from your GPU when you want to store them to free up memory:
dic.append( {
'hidden_states': outputs[2].detach().cpu().tolist(),
'pooled_output': outputs[1].detach().cpu().tolist()
})

Related

How to integrate pytorch lightning profiler with tensorboard?

I know we can use torch profiler with tensorboard using something like this:
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
record_shapes=True,
with_stack=True
) as prof:
for step, batch_data in enumerate(train_loader):
if step >= (1 + 1 + 3) * 2:
break
train(batch_data)
prof.step() # Need to call this at the end of each step to notify profiler of steps' boundary.
It works perfectly with pytorch, but the problem is I have to use pytorch lightning and if I put this in my training step, it just doesn't create the log file nor does it create an entry for profiler. All I get is lightning_logs which isn't the profiler output. I couldn't find anything in the docs about lightning_profiler and tensorboard so does anyone have any idea?
Here's what my training function looks like:
def training_step(self, train_batch, batch_idx):
with torch.profiler.profile(
activities=[ProfilerActivity.CPU],
schedule=torch.profiler.schedule(
wait=1,
warmup=1,
active=2,
repeat=1),
with_stack=True,
on_trace_ready=torch.profiler.tensorboard_trace_handler('./logs'),
) as profiler:
x, y = train_batch
x = x.float()
logits = self.forward(x)
loss = self.loss_fn(logits, y)
profiler.step()
return loss
You don't have to use raw torch.profiler at all. There is a whole page in Lightning Docs dedicated to Profiling ..
.. and its as easy as passing a trainer flag called profiler like
# other profilers are "simple", "advanced" etc
trainer = pl.Trainer(profiler="pytorch")
Also, set TensorBoardLogger as your preferred logger as you normally do
trainer = pl.Trainer(profiler="pytorch", logger=TensorBoardLogger(..))

Running multiple inferences in parallel with PyTorch

I'm trying to implement Double DQN (not to be confused with DQN with a slightly delayed Q-target network) in PyTorch to train an agent to play an Atari OpenAI Gym game. Here I discuss the implementation of the following formula:
Update of Q-network, formula taken from Sutton & Barto.
My first implementation is:
Q_pred = self.Q_1.forward(s_now)[T.arange(batch_size), actions.long()]
Q_next_all = self.Q_1.forward(s_next)
maxA_id = T.argmax(Q_next_all, dim=1)
Q_pred2 = self.Q_2.forward(s_next)[T.arange(batch_size), maxA_id]
Q_target = (rewards + (~dones) * self.GAMMA * Q_pred2).detach()
self.Q_1.optimizer.zero_grad()
self.Q_1.loss(Q_target, Q_pred).backward()
self.Q_1.optimizer.step()
(Q_1 and Q_2 are nn.Module classes, and all of the variables involved here are already torch tensors lying in the GPU.)
I noticed that my program ran much slower than a previous implementation which used plain DQN.
I realized that I can combine the batches entering Q_1, so there will be one combined batch being forwarded in the neural network, instead of two batches in sequence. The code becomes:
s_combined = T.cat((s_now, s_next))
Q_combined = self.Q_1.forward(s_combined)
Q_pred = Q_combined[T.arange(batch_size), actions.long()]
Q_next_all = Q_combined[batch_size:]
Q_pred2_all = self.Q_2.forward(s_next)
maxA_id = T.argmax(Q_next_all, dim=1)
Q_pred2 = Q_pred2_all[T.arange(batch_size), maxA_id]
Q_target = (rewards + (~dones) * self.GAMMA * Q_pred2).detach()
self.Q_1.optimizer.zero_grad()
self.Q_1.loss(Q_target, Q_pred).backward()
self.Q_1.optimizer.step()
(This proves that I understand how to do batch training in PyTorch, so don't mark this as a duplicate of this question.)
Furthermore, I realized that Q_1 and Q_2 can process their batches in parallel. So I looked up how to do multiprocessing in PyTorch. Unfortunately, I couldn't find a good example. I tried to adapt a code that looks similar to my scenario, and my code becomes:
def spawned():
s_combined = T.cat((s_now, s_next))
Q_combined = self.Q_1.forward(s_combined)
Q_pred = Q_combined[T.arange(batch_size), actions.long()]
Q_next_all = Q_combined[batch_size:]
mp.set_start_method('spawn', force=True)
p = mp.Process(target=spawned)
p.start()
Q_pred2_all = self.Q_2.forward(s_next)
p.join()
maxA_id = T.argmax(Q_next_all, dim=1)
Q_pred2 = Q_pred2_all[T.arange(batch_size), maxA_id]
Q_target = (rewards + (~dones) * self.GAMMA * Q_pred2).detach()
self.Q_1.optimizer.zero_grad()
self.Q_1.loss(Q_target, Q_pred).backward()
self.Q_1.optimizer.step()
This crashes with the error message:
AttributeError: Can't pickle local object 'Agent.learn.<locals>.spawned'
So how do I make this work?
(Achieving this in CUDA programming is trivial. One simply launches two device kernels using a sequential host code, and the two kernels are automatically computed in parallel in the GPU.)

nvidia_deeplearningexamples_tacotron2 :RuntimeError: CUDA error: invalid device function

Set up runtime: python3 and GPU.
Run the code step by step.
I only successfully run the code at first time.
After that, when run the below part, occured "RuntimeError: CUDA error: invalid device function"
sequence = np.array(tacotron2.text_to_sequence(text, ['english_cleaners']))[None, :]
sequence = torch.from_numpy(sequence).to(device='cuda', dtype=torch.int64)
with torch.no_grad():
_, mel, _, _ = tacotron2.infer(sequence)
audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050
Do you know the root cause? And does the pre-trained model be run on local CPU?
At the time of writing, you can solve this issue by adding
!pip install torch==1.1.0 torchvision==0.3.0
before import torch
in https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/nvidia_deeplearningexamples_tacotron2.ipynb

Doing feature generation in serving_input_fn for Tensorflow model

I've been playing around with BERT and TensorFlow following the example here and have a trained working model.
I then wanted to save and deploy the model, so used the export_saved_model function, which requires you build a serving_input_fn to handle any incoming requests when the model is reloaded.
I wanted to be able to pass a single string for sentiment analysis to the deployed model, rather than having a theoretical client side application do the tokenisation and feature generation etc, so tried to write an input function that would handle that and pass the constructed features to the model. Is this possible? I wrote the following which I feel should do what I want:
import json
import base64
def plain_text_serving_input_fn():
input_string = tf.placeholder(dtype=tf.string, shape=None, name='input_string_text')
# What format to expect input in.
receiver_tensors = {'input_text': input_string}
input_examples = [run_classifier.InputExample(guid="", text_a = str(input_string), text_b = None, label = 0)] # here, "" is just a dummy label
input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
variables = {}
for i in input_features:
variables["input_ids"] = i.input_ids
variables["input_mask"] = i.input_mask
variables["segment_ids"] = i.segment_ids
variables["label_id"] = i.label_id
feature_spec = {
"input_ids" : tf.FixedLenFeature([MAX_SEQ_LENGTH], tf.int64),
"input_mask" : tf.FixedLenFeature([MAX_SEQ_LENGTH], tf.int64),
"segment_ids" : tf.FixedLenFeature([MAX_SEQ_LENGTH], tf.int64),
"label_ids" : tf.FixedLenFeature([], tf.int64)
}
string_variables = json.dumps(variables)
encode_input = base64.b64encode(string_variables.encode('utf-8'))
encode_string = base64.decodestring(encode_input)
features_to_input = tf.parse_example([encode_string], feature_spec)
return tf.estimator.export.ServingInputReceiver(features_to_input, receiver_tensors)
I would expect that this would allow me to call predict on my deployed model with
variables = {"input_text" : "This is some test input"}
predictor.predict(variables)
I've tried a range of variations of this (putting it in an array, converting to base 64 etc), but I get a range of errors either telling me
"error": "Failed to process element: 0 of 'instances' list. Error: Invalid argument: JSON Value: {\n \"input_text\": \"This is some test input\"\n} not formatted correctly for base64 data" }"
or
Object of type 'bytes' is not JSON serializable
I suspect I'm formatting my requests incorrectly, but I also can't find any examples of something similar being done in a serving_input_fn, so has anyone ever done something similar?

Avoid memory leaks with promises and loop in coffee-script (no await)

I am currently trying to perform some operations using promises in a loop but I ended up with huge memory leaks.
My problem is exactly the one pointed out in this article but as opposite to author, I am writing in coffee-script (yes, with hyphen. Which means coffeescript 1.12 and not the latest version). Thus, I am not able to use "await" key word (this is a casual guess since each time I want to use it, I got "await is not defined" error).
This is my original code (with memory leaks) :
recursiveFunction: (next = _.noop) ->
_data = #getSomeData()
functionWithPromise(_data).then (_enrichedData) =>
#doStuffWithEnrichedData(_enrichedData)
#recursiveFunction()
.catch (_err) =>
#log.error _err.message
#recursiveFunction()
So according to the article I linked, I would have to do something like that :
recursiveFunction: (next = _.noop) ->
_data = #getSomeData()
_enrichedData = await functionWithPromise(_data)
#recursiveFunction()
But then again, I am stuck because I can't use "await" key word. What would be the best approach then ?
EDIT:
Here is my real original code. What I am trying to achieve is a face-detection application. This function is located in a lib and I am using "Service" variable to expose variables between libs. In order to get frame from webcam, I am using opencv4nodejs.
faceapi = require('face-api.js')
tfjs = require('#tensorflow/tfjs-node')
(...)
# Analyse the new frame
analyseFrame: (next = _.noop) ->
# Skip if not capturing
return unless Service.isCapturing
# get frame
_frame = Service.videoCapture.getFrame()
# get frame date, and
#currentFrameTime = Date.now()
# clear old faces in history
#refreshFaceHistory(#currentFrameTime)
#convert frame to a tensor
try
_data = new Uint8Array(_frame.cvtColor(cv.COLOR_BGR2RGB).getData().buffer)
_tensorFrame = tfjs.tensor3d(_data, [_frame.rows, _frame.cols, 3])
catch _err
#log.error "Error instantiating tensor !!!"
#log.error _err.message
# find faces on frames
faceapi.detectAllFaces(_tensorFrame, #faceDetectionOptions).then (_detectedFaces) =>
#log.debug _detectedFaces
# fill face history with detceted faces
_detectedFaces = #fillFacesHistory(_detectedFaces)
# draw boxes on image
Service.videoCapture.drawFaceBoxes(_frame, _detectedFaces)
# Get partial time
Service.frameDuration = Date.now() - #currentFrameTime
# write latency on image
Service.videoCapture.writeLatency(_frame, Service.frameDuration)
# show image
Service.faceRecoUtils.showImage(_frame)
# Call next
_delayNextFrame = Math.max(0, 1000/#options.fps - Service.frameDuration)
setTimeout =>
# console.log "Next frame : #{_delayNextFrame}ms - TOTAL : #{_frameDuration}ms"
#analyseFrame()
, (_delayNextFrame)
The solution was to dispose the tensor copy sent to detectFaces.
faceapi.detectAllFaces(_tensorFrame, #faceDetectionOptions).then (_detectedFaces) =>
(...)
_tensorFrame.dispose()
(...)

Resources