I have a very slow performance problem when I execute an inference batch loop on a single GPU.
This slow behavior appears after the first batch has been processed -
that is when the GPU is already almost full and its memory needs to be recycled to accept the next batch.
At a pristine GPU state - the performance is super fast (as expected).
I hope both the following code snippet and the output illustrate the problem in a nutshell.
(I've removed the print and time measurements from the snippet for brevity)
predictions = None
for i, batch in enumerate(self.test_dataloader):
# if this line is active - the bottleneck after the first batch moves here, rather than below
# i.e. when i > 0
# torch.cuda.empty_cache()
# HUGE PERFORMANCE HIT HAPPENS HERE - after the first batch
# i.e. when i > 0
# obviously tensor.to(device) uses torch.cuda.empty_cache() internally when needed
# and it is inexplicably SLOW
batch = tuple(t.to(device) for t in batch) # to GPU (or CPU) when gpu
b_input_ids, b_input_mask, b_labels = batch
with torch.no_grad():
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
logits = outputs[0]
logits = logits.detach()
# that doesn't help alleviate the issue
del outputs
predictions = logits if predictions is None else torch.cat((predictions, logits), 0)
# nor do all of the below - freeing references doesn't help speeding up
del logits
del b_input_ids
del b_input_mask
del b_labels
for o in batch:
del o
del batch
output
start empty cache... 0.00082
end empty cache... 1.9e-05
start to device... 3e-06
end to device... 0.001179 - HERE - time is super fast (as expected)
start outputs... 8e-06
end outputs... 0.334536
logits... 6e-06
start detach... 1.7e-05
end detach... 0.004036
start empty cache... 0.335932
end empty cache... 4e-06
start to device... 3e-06
end to device... 16.553849 - HERE - time is ridiculously high - it's 16 seconds to move tensor to GPU
start outputs... 2.3e-05
end outputs... 0.020878
logits... 7e-06
start detach... 1.4e-05
end detach... 0.00036
start empty cache... 0.00082
end empty cache... 6e-06
start to device... 4e-06
end to device... 17.385204 - HERE - time is ridiculously high
start outputs... 2.9e-05
end outputs... 0.021351
logits... 4e-06
start detach... 1.3e-05
end detach... 1.1e-05
...
Have I missed something obvious or is this the expected GPU behavior?
I am posting this question before engaging in complex coding, juggling between a couple of GPUs and CPU available on my server.
Thanks in advance,
Albert
EDIT
RESOLVED The issue was: in DataLoader constructor - I changed the pin_memory to False (True was causing the issue). That cut the .to(device) time by 350%-400%
self.test_dataloader = DataLoader(
test_dataset,
sampler=SequentialSampler(test_dataset),
# batch_size=len(test_dataset) # AKA - single batch - nope! no mem for that
batch_size=BATCH_SIZE_AKA_MAX_ROWS_PER_GUESS_TO_FIT_GPU_MEM,
# tests
num_workers=8,
# maybe this is the culprit as suggested by user12750353 in stackoverflow
# pin_memory=True
pin_memory=False
)
You should not be required to clear cache if you are properly clearing the references to the previously allocated variables. Cache is like free, is memory that your script can use for new variables.
Also notice that
a = torch.zeros(10**9, dtype=torch.float)
a = torch.zeros(10**9, dtype=torch.float)
Requires 8GB of memory, even though a uses 4GB (1B elements with 4 bytes each). This occurs because the torch.zeros will allocate memory before the previous content of a is released. This may be happening in your model in a larger scale, depending on how it is implemented.
Edit 1
One suspicious thing is that you are loading your batch to the GPU one example at a time.
Just to illustrate what I mean
import torch
device = 'cuda'
batch = torch.zeros((4500, 10));
Creating the batch as a tuple
batch_gpu = tuple(t.to(device) for t in batch)
torch.cuda.synchronize()
254 ms ± 36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Creating the batch as a list
batch_gpu = list(t.to(device) for t in batch)
torch.cuda.synchronize()
235 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
batch_gpu = batch.to(device)
torch.cuda.synchronize()
115 µs ± 2.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In this example it was 2000x faster to copy one example at a time.
Notice that GPU works asynchronously with the CPU. So you may keep calling functions that will return before the operation is finished. In order to make meaningful measurements you may call synchronize to make clear the time boundaries.
The code to be instrumented is this
for i, batch in enumerate(self.test_dataloader):
# torch.cuda.empty_cache()
# torch.synchronize() # if empty_cache is used
# start timer for copy
batch = tuple(t.to(device) for t in batch) # to GPU (or CPU) when gpu
torch.cuda.synchronize()
# stop timer for copy
b_input_ids, b_input_mask, b_labels = batch
# start timer for inference
with torch.no_grad():
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
torch.cuda.synchronize()
# stop timer for inference
logits = outputs[0]
logits = logits.detach()
# if you copy outputs to CPU it will be synchronized
Related
So I have the following code for running skopt.forest_minimize(), but the biggest challenge I am facing right now is that it is taking upwards of days to finish running even just 2 iterations.
SPACE = [skopt.space.Integer(4, max_neighbour, name='n_neighbors', prior='log-uniform'),
skopt.space.Integer(6, 10, name='nr_cubes', prior='uniform'),
skopt.space.Categorical(overlap_cat, name='overlap_perc')]
#skopt.utils.use_named_args(SPACE)
def objective(**params):
score, scomp = tune_clustering(X_cont=X_cont, df=df, pl_brewer=pl_brewer, **params)
if score == 0:
print('saving new scomp')
with open(scomp_file, 'w') as filehandle:
json.dump(scomp, filehandle, default = json_default)
return score
results = skopt.forest_minimize(objective, SPACE, n_calls=1, n_initial_points=1, callback=[scoring])
Is it possible to optimize the following code so that it can compute faster? I noticed that it was barely making use of my CPU, highest CPU utilized is about 30% (it's i7 9th gen with
8 cores).
Also a question while I'm at it, is it possible to utilize a GPU for these computational tasks? I have a 3050 that I can use.
I want to measure ONLY the inference time in the Jetson TX2. How can I improve my function to do that? As right now I am measuring:
the transfer of the image from CPU to GPU
transfer of results from GPU to CPU
the inference
Or is that not possible because of the way GPUs work? I mean, how many times will I have to use stream.synchronize() if I divide/segment the function into 3 parts:
transfer from CPU to GPU
Inference
transfer from GPU to CPU
Thank you
CODE IN INFERENCE.PY
def do_inference(engine, pics_1, h_input, d_input, h_output, d_output, stream, batch_size):
"""
This is the function to run the inference
Args:
engine : Path to the TensorRT engine.
pics_1 : Input images to the model.
h_input: Input in the host (CPU).
d_input: Input in the device (GPU).
h_output: Output in the host (CPU).
d_output: Output in the device (GPU).
stream: CUDA stream.
batch_size : Batch size for execution time.
height: Height of the output image.
width: Width of the output image.
Output:
The list of output images.
"""
# Context for executing inference using ICudaEngine
with engine.create_execution_context() as context:
# Transfer input data from CPU to GPU.
cuda.memcpy_htod_async(d_input, h_input, stream)
# Run inference.
#context.profiler = trt.Profiler() ##shows execution time(ms) of each layer
context.execute(batch_size=1, bindings=[int(d_input), int(d_output)])
# Transfer predictions back from the GPU to the CPU.
cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Synchronize the stream.
stream.synchronize()
# Return the host output.
out = h_output
return out
CODE IN TIMER.PY
for i in range (count):
start = time.perf_counter()
# Classification - calling TX2_classify.py
out = eng.do_inference(engine, image, h_input, d_input, h_output, d_output, stream, 1)
inference_time = time.perf_counter() - start
print("TIME")
print(inference_time * 1000)
print("\n")
pred = postprocess_inception(out)
print(pred)
print("\n")
I have spent considerable time trying to debug some pytorch code which I have created a minimal example of for the purpose of helping to better understand what the issue might be.
I have removed all necessary portions of the code which are unrelated to the issue so the remaining piece of code won't make much sense from a functional standpoint but it still displays the error I'm facing.
The overall task I'm working on is in a loop and every pass of the loop is computing the embedding of the image and adding it to a variable storing it. It's effectively aggregating it (not concatenating, so the size remains the same). I don't expect the number of iterations to force the datatype to overflow, I don't see this happening here nor in my code.
I have added multiple metrics to evaluate the size of the tensors I'm working with to make sure they're not growing in memory footprint
I'm checking the overall GPU memory usage to verify the issue leading to the final RuntimeError: CUDA out of memory..
My environment is as follows:
- python 3.6.2
- Pytorch 1.4.0
- Cudatoolkit 10.0
- Driver version 410.78
- GPU: Nvidia GeForce GT 1030 (2GB VRAM)
(though I've replicated this experiment with the same result on a Titan RTX with 24GB,
same pytorch version and cuda toolkit and driver, it only goes out of memory further in the loop).
Complete code below. I have marked 2 lines as culprits, as deleting them removes the issue, though obviously I need to find a way to execute them without having memory issues. Any help would be much appreciated! You may try with any image named "source_image.bmp" to replicate the issue.
import torch
from PIL import Image
import torchvision
from torchvision import transforms
from pynvml import nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo, nvmlInit
import sys
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0' # this is necessary on my system to allow the environment to recognize my nvidia GPU for some reason
os.environ['CUDA_LAUNCH_BLOCKING'] = '1' # to debug by having all CUDA functions executed in place
torch.set_default_tensor_type('torch.cuda.FloatTensor')
# Preprocess image
tfms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),])
img = tfms(Image.open('source_image.bmp')).unsqueeze(0).cuda()
model = torchvision.models.resnet50(pretrained=True).cuda()
model.eval() # we put the model in evaluation mode, to prevent storage of gradient which might accumulate
nvmlInit()
h = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(h)
print(f'Total available memory : {info.total / 1000000000}')
feature_extractor = torch.nn.Sequential(*list(model.children())[:-1])
orig_embedding = feature_extractor(img)
embedding_depth = 2048
mem0 = 0
embedding = torch.zeros(2048, img.shape[2], img.shape[3]) #, dtype=torch.float)
patch_size=[4,4]
patch_stride=[2,2]
patch_value=0.0
# Here, we iterate over the patch placement, defined at the top left location
for row in range(img.shape[2]-1):
for col in range(img.shape[3]-1):
print("######################################################")
######################################################
# Isolated line, culprit 1 of the GPU memory leak
######################################################
patched_embedding = feature_extractor(img)
delta_embedding = (patched_embedding - orig_embedding).view(-1, 1, 1)
######################################################
# Isolated line, culprit 2 of the GPU memory leak
######################################################
embedding[:,row:row+1,col:col+1] = torch.add(embedding[:,row:row+1,col:col+1], delta_embedding)
print("img size:\t\t", img.element_size() * img.nelement())
print("patched_embedding size:\t", patched_embedding.element_size() * patched_embedding.nelement())
print("delta_embedding size:\t", delta_embedding.element_size() * delta_embedding.nelement())
print("Embedding size:\t\t", embedding.element_size() * embedding.nelement())
del patched_embedding, delta_embedding
torch.cuda.empty_cache()
info = nvmlDeviceGetMemoryInfo(h)
print("\nMem usage increase:\t", info.used / 1000000000 - mem0)
mem0 = info.used / 1000000000
print(f'Free:\t\t\t {(info.total - info.used) / 1000000000}')
print("Done.")
Add this to your code as soon as you load the model
for param in model.parameters():
param.requires_grad = False
from https://pytorch.org/docs/stable/notes/autograd.html#excluding-subgraphs-from-backward
I'm trying to train my own NLP model with CreateML with Xcode playground, and going through the tutorial by Apple: https://developer.apple.com/documentation/createml/creating_a_text_classifier_model
but the program terminated by EXC_BAD_ACCESS (code=1, address=0x0)
I found some solution from the Internet, they stated that the pointer is pointing to NULL when trying to access the variable
import Foundation
import CreateML
let source = "icecream"
let data = try MLDataTable(contentsOf: URL(fileURLWithPath: "/path/to/\(source).csv"))
let (trainingData, testingData) = data.randomSplit(by: 0.8, seed: 0)
// program stopped here
let sentimentClassifier = try MLTextClassifier(trainingData: trainingData, textColumn: "text", labelColumn: "sentiment")
// error
error: Execution was interrupted, reason: EXC_BAD_ACCESS (code=1, address=0x0).
// output
Finished parsing file /path/to/icecream.csv
Parsing completed. Parsed 100 lines in 0.03412 secs.
Finished parsing file /path/to/icecream.csv
Parsing completed. Parsed 188 lines in 0.008235 secs.
Automatically generating validation set from 10% of the data.
Tokenizing data and extracting features
Starting MaxEnt training with 146 samples
Iteration 1 training accuracy 0.650685
Iteration 2 training accuracy 0.869863
Iteration 3 training accuracy 0.945205
Iteration 4 training accuracy 0.986301
Iteration 5 training accuracy 0.993151
Finished MaxEnt training in 0.04 seconds
I am new to TensorFlow. Currently, I am trying to evaluate the performance of distributed TensorFlow using Inception model provided by TensorFlow team.
The thing I want is to generate timestamps for some critical operations in a Parameter Server - Worker architecture, so I can measure the bottleneck (the network lag due to parameter transfer/synchronization or parameter computation cost) on replicas for one iteration (batch).
I came up with the idea of adding a customized dummy py_func operator designated of printing timestamps inside inception_distributed_train.py, with some control dependencies. Here are some pieces of code that I added:
def timer(s):
print ("-------- thread ID ", threading.current_thread().ident, ", ---- Process ID ----- ", getpid(), " ~~~~~~~~~~~~~~~ ", s, datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S.%f'))
return Falsedf
dummy1 = tf.py_func(timer, ["got gradients, before dequeues token "], tf.bool)
dummy2 = tf.py_func(timer, ["finished dequeueing the token "], tf.bool)
I modified
apply_gradients_op = opt.apply_gradients(grads, global_step=global_step)
with tf.control_dependencies([apply_gradients_op]):
train_op = tf.identity(total_loss, name='train_op')
into
with tf.control_dependencies([dummy1]):
apply_gradients_op = opt.apply_gradients(grads, global_step=global_step)
with tf.control_dependencies([apply_gradients_op]):
with tf.control_dependencies([dummy2]):
train_op = tf.identity(total_loss, name='train_op')
hoping to print the timestamps before evaluating the apply_gradient_op and after finishing evaluating the apply_gradient_op by enforcing node dependencies.
I did similar things inside sync_replicas_optimizer.apply_gradients, by adding two dummy print nodes before and after update_op:
dummy1 = py_func(timer, ["---------- before update_op "], tf_bool)
dummy2 = py_func(timer, ["---------- finished update_op "], tf_bool)
# sync_op will be assigned to the same device as the global step.
with ops.device(global_step.device), ops.name_scope(""):
with ops.control_dependencies([dummy1]):
update_op = self._opt.apply_gradients(aggregated_grads_and_vars, global_step)
# Clear all the gradients queues in case there are stale gradients.
clear_queue_ops = []
with ops.control_dependencies([update_op]):
with ops.control_dependencies([dummy2]):
for queue, dev in self._one_element_queue_list:
with ops.device(dev):
stale_grads = queue.dequeue_many(queue.size())
clear_queue_ops.append(stale_grads)
I understand that apply_gradient_op is the train_op returned by sync_replicas_optimizer.apply_gradient. And apply_gradient_op is the op to dequeue a token (global_step) from sync_queue managed by the chief worker using chief_queue_runner, so that replica can exit current batch and start a new batch.
In theory, apply_gradient_op should take some time as replica has to wait before it can dequeue the token (global_step) from sync_queue, but the print result for one replica I got, such as the time differences for executing apply_gradient_op is pretty short (~1/1000 sec) and sometimes the print output is indeterministic (especially for chief worker). Here is a snippet of the output on the workers (I am running 2 workers and 1 PS):
chief worker (worker 0) output
worker 1 output
My questions are:
1) I need to record the time TensorFlow takes to execute an op (such as train_op, apply_gradients_op, compute_gradients_op, etc.)
2) Is this the right direction to go, given my ultimate goal is to record the elapsed time for executing certain operations (such as the difference between the time a replica finishes computing gradients and the time it gets the global_step from sync_token)?
3) If this is not the way it should go, please guide me with some insights about the possible ways I could achieve my ultimate goal.
Thank you so much for reading my long long posts. as I have spent weeks working on this!