Currently, I am doing y Udemy Python course for data science. In there, there is the following example to train a model in Tensorflow:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
model = Sequential()
# Choose whatever number of layers/neurons you want.
model.add(Dense(units=78,activation='relu'))
model.add(Dense(units=39,activation='relu'))
model.add(Dense(units=19,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))
# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(x=X_train,
y=y_train,
epochs=3,
validation_data=(X_test, y_test), verbose=1
)
My goal now was to get this to run on my GPU. For that, I altered the last part as follows (the epochs are low on purpose, I just want to see how long it takes per epoch before scaling up):
with tf.device("/gpu:0"):
model.fit(x=X_train,
y=y_train,
epochs=3,
validation_data=(X_test, y_test), verbose=1
)
and for comparison, also as follows:
with tf.device("/cpu:0"):
model.fit(x=X_train,
y=y_train,
epochs=3,
validation_data=(X_test, y_test), verbose=1
)
However, the result is very unexpected: Either, both versions occupy all memory of the GPU but seemingly don't do any calculations on it, and take the exact same time per epoch. Or, the GPU version simply crashes with the following error:
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\gpu\lib\site-packages\six.py in raise_from(value, from_value)
InternalError: Blas GEMM launch failed : a.shape=(32, 78), b.shape=(78, 78), m=32, n=78, k=78
[[node sequential/dense/MatMul (defined at <ipython-input-115-79c9a84ee89a>:8) ]] [Op:__inference_distributed_function_874]
Function call stack:
distributed_function
Sometimes it crashes, sometimes it kind of works but takes as long as the CPU. Sometimes even the CPU version takes 20 sec per epoch, other times it takes 40 sec. The code stays the same, all that changes is that I restart the Kernel in between. I really don't understand it.
When I test the GPU and conda environment using the following code, everything seems to work fine, reproducible and the GPU is about 20x as fast as the CPU:
# https:// gist.github.com/ikarus-999/1a845437b454cdfcc1eb5455d373fe63
import sys
import numpy as np
import tensorflow.compat.v1 as tf # compatibility for TF 1 code
from datetime import datetime
def test_device (device_name: str):
shape = (int(10000), int(10000))
startTime = datetime.now()
with tf.device(device_name):
random_matrix = tf.random.uniform(shape=shape, minval=0, maxval=1)
dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
sum_operation = tf.reduce_sum(dot_operation)
result = sum_operation
print("Shape:", shape, "Device:", device_name)
print("—"*50)
print(result)
print("Time taken:", datetime.now() - startTime)
print("\n" * 2)
test_device("/cpu:0") # 6 sec
test_device("/gpu:0") # 0.3 sec
So, I am sure there is something I am doing wrong.
TLTR:
What would be the correct way to call model.fit on the GPU? How can different runs (without changing the code) result in so drastically different outcomes (Crash, vastly different calculation times)?
Any help is greatly appreciated, thx!
After a lot of try and error I finally found a working way to either force CPU or "mixed usage". GPU only doesn't seem to work, though. The with tf.device() method from my original post doesn't seem to do anything in this scenario. I have to hide the GPU if I want to use the CPU, only (Tensorflow 2.1.0):
CPU only
# force CPU (make CPU visible)
cpus = tf.config.experimental.list_physical_devices('CPU')
print(cpus)
tf.config.set_visible_devices([], 'GPU') # hide the GPU
tf.config.set_visible_devices(cpus[0], 'CPU') # unhide potentially hidden CPU
tf.config.get_visible_devices()
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test), verbose=1
)
This results in 3-4 sec per epoch and does not tax the GPU.
Restart the Kernel, then:
GPU only
# force GPU (make GPU visible)
# note: does not work without restarting the kernel, otherwise:
# "Visible devices cannot be modified after being initialized"
gpus = tf.config.experimental.list_physical_devices('GPU')
print(gpus)
tf.config.set_visible_devices([], 'CPU') # hide the CPU
tf.config.set_visible_devices(gpus[0], 'GPU') # unhide potentially hidden GPU
tf.config.get_visible_devices()
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test), verbose=1
)
That doesn't work as, apparently, the CPU is required by this model:
"NotFoundError: No CPU devices are available in this process"
Default (mixed CPU & GPU):
Restart the Kernel, then:
# test if CPU and GPU are visible
tf.config.get_visible_devices()
# [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
# PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test), verbose=1
)
This results in 5-6 sec per epoch, consumes all the RAM of the GPU and uses a small amount of processing power of the GPU (<10%). Apparently, this is slower than using the CPU only for this model (8 GB video RAM vs. 16 GB System RAM??).
If the default mode (CPU & GPU) throws the following error, it seems the GPU is occupied by another process and restarting Windows helps:
"InternalError: Blas GEMM launch failed"
There are still lots of mysteries left for me:
Why is the "mixed" mode slower than CPU only?
Can you change visible devices without having to restart the Kernel to avoid the following error? "Visible devices cannot be modified after being initialized"
Why does the with tf.device() method not work for this model (no effect), whereas it works for the test_device() code?
If anybody can provide some insight, thank you very much :)
I'm trying to optimize some weighs (weigts) in Pytorch but I keep getting this error:
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 8000000000000 bytes. Error code 12 (Cannot allocate memory).
Namely, things blow up when I run (weights * col).sum() / weights.sum(). Weights is a tensor of size (1000000,1) and col is also a tensor of size (1000000, 1). Both tensors are decently sized, but it seems odd that I'm using up all the memory in my computer (8GB) for these operations.
It could be that your weights and col tensors are not aligned (i.e. one of them is transposed so it is (1,1000000) instead of (1000000,1). Then when you do (weights * col) the shapes are broadcast together and it makes a tensor that is (1000000,1000000) which is probably where you are getting the extreme memory usage (as the resulting tensor is 1000000 times bigger than your original tensor).
I'm surprised to face an Out-of-Memory error using tf.keras.applications.ResNet50 implementation on an Nvidia RTX2080Ti (with 11Gb of memory !).
Question:
Is there something wrong with the workflow I use?
Notes:
I'm using tensorflow-gpu==2.0.0b1 with CUDA v10.1
I work on a segmentation task, thus the large output_shape
I build the batches myself, thus the use of train_on_batch()
Even when setting memory_growth to True, the memory get filled-up from 700Mb to 10850Mb in a fraction of a second.
Code:
import tensorflow as tf
import tensorflow.keras as ke
import numpy as np
ke.backend.clear_session()
inputs = ke.layers.Input(shape=(512,1024,3), dtype="float32")
outputs = ke.applications.ResNet50(include_top=False, weights="imagenet")(inputs)
outputs = ke.layers.Lambda(lambda x: tf.compat.v1.image.resize_bilinear(x, size=(512,1024)))(outputs)
outputs = ke.layers.Conv2D(2, 1, activation="softmax")(outputs)
model = ke.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=ke.optimizers.RMSprop(lr=0.001), loss=ke.losses.CategoricalCrossentropy())
images = np.zeros((1,512,1024,3), dtype=np.float32)
targets = np.zeros((1,512,1024,2), dtype=np.float32)
model.train_on_batch(images, targets)
Resnet being the complex complex model, the dimensions of the input might be the reason for OOM error. Try reducing the dimensions and the corresponding batch size(as much as the memory can hold) and try.
As mentioned in comments it worked with batch size 1 and with dimensions 700*512.
I am a beginner with TF and I am trying to running some Tensorflow Object Detection API with:
GeForce 2GB-MX150
16GB RAM
I7 8550U
I getting the following error when it starts training and I can't figure out what's wrong.
I have tried to change some parameters like batch size multiple times but it stil getting the some error.
In this picture you can see the total and available memory that the computer has.
I'll grateful for you help me.
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,1024,52,38] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block3/unit_20/bottleneck_v1/conv3/Conv2D
= Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true,
_device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block3/unit_20/bottleneck_v1/conv2/Relu, FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv3/weights/read/_2629)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: gradients/FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block3/unit_18/bottleneck_v1/conv3/Conv2D_grad/tuple/control_dependency_1/_3229
= _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_6894_...pendency_1", tensor_type=DT_FLOAT,
_device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
2 GB of GPU memory on your is far too little for a huge model like ResNet-101.
I failed in the third way. t3 is still on CPU. No idea why.
a = np.random.randn(1, 1, 2, 3)
t1 = torch.tensor(a)
t1 = t3.to(torch.device('cuda'))
t2 = torch.tensor(a)
t2 = t2.cuda()
t3 = torch.tensor(a, device=torch.device('cuda'))
All three methods worked for me.
In 1 and 2, you create a tensor on CPU and then move it to GPU when you use .to(device) or .cuda(). They are the same here.
However, when you use .to(device) method you can explicitly tell torch to move to specific GPU by setting device=torch.device("cuda:<id>"). with .cuda() you have to do .cuda(<id>) to move to some particular GPU.
Why do these two methods exist then?
.to(device) was introduced in 0.4 because it is easier to declare device variable at top of the code as
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
and use .to(device) everywhere. This makes it quite easy to switch from CPU to GPU and vice-versa
Before this, we had to use .cuda() and your code will have if check for cuda.is_available() everywhere which made it cumbersome to switch between GPU/CPU.
The third method doesn't create a tensor on the CPU and directly copies data to GPU, which is more efficient.
Example to make a 50 by 50 tensor of 0's directly on your nvidia GPU:
zeros_tensor_gpu = torch.zeros((50, 50), device='cuda')
This will immensely speed up creation for big tensors such as 4000 by 4000