How to run multi-gpu for a part of model? - pytorch

I"m trying to run multimodal data, whose sample consists of text tokens shape of (64, 512) and an image shape of (3, 256, 256).
Due to Memory issue, i am trying to run single sample, as my batch,
while running text tokens in parallel with 4 GPUS, where I pass (16, 512) to each GPU through encoder (i.e. BERT).
I'm trying to get average embedding of all of them. (i.e. (64, 768) -> (1, 768)). Then with this average embedding, afterwards the model proceeds with one GPU, to compute its relationship with the image.
My question is how can I now run/implement only a part of model for multi-gpus?
Thank you in advance!

Related

What does the `None` in `keras.summary()`'s output shape mean? [duplicate]

What is the meaning of the (None, 100) in Output Shape?
Is this("None") the Sample number or the hidden dimension?
None means this dimension is variable.
The first dimension in a keras model is always the batch size. You don't need fixed batch sizes, unless in very specific cases (for instance, when working with stateful=True LSTM layers).
That's why this dimension is often ignored when you define your model. For instance, when you define input_shape=(100,200), actually you're ignoring the batch size and defining the shape of "each sample". Internally the shape will be (None, 100, 200), allowing a variable batch size, each sample in the batch having the shape (100,200).
The batch size will be then automatically defined in the fit or predict methods.
Other None dimensions:
Not only the batch dimension can be None, but many others as well.
For instance, in a 2D convolutional network, where the expected input is (batchSize, height, width, channels), you can have shapes like (None, None, None, 3), allowing variable image sizes.
In recurrent networks and in 1D convolutions, you can also make the length/timesteps dimension variable, with shapes like (None, None, featuresOrChannels)
Yes, None in summary means a dynamic dimension of a batch (mini batch).
This is why you can set any batch size to your model.
The summary() method is part of TF that incorporates Keras method print_summary().

Visualizing convoluational layers in autoencoder

I have built a variational autoencoder using 2D convolutions (Conv2D) in the encoder and decoder. I'm using Keras. In total I have 2 layers with 32 and 64 filters each and a a kernel size of 4x4 and stride 2x2 each. My input images are (64, 80, 1). I'm using the MSE loss. Now, I would like to visualize the individual convolutional layers (i.e. what they learn) as done here.
So, first I load my model using load_weights() function and then I call visualize_layer(encoder, 'conv2d_1') from above mentioned code where conv2d_1 is the layer name of the first convolutional layer in my encoder.
When I do so I'm getting the error message
tensorflow.python.framework.errors_impl.UnimplementedError: Fused conv implementation does not support grouped convolutions for now.
[[{{node conv2d_1/BiasAdd}}]]
When I use the VGG16 model as in the example code it works. Does somebody know how I can adapt the code to work for my case?

Keras: Time CNN+LSTM for video recognition

I am trying to implement the Model shown in the above picture that basically consists of time-distributed CNNs followed by a sequence of LSTMs using Keras with TF.
I have divided two types of class, and extract the frames from each video captured. The frames extract is variable, do not fix.
However, I am having a problem trying to figure out How can I load my image frames for each video in each class to become x_train, x_test, y_train, y_test.
model = Sequential()
model.add(
TimeDistributed(
Conv2D(64, (3, 3), activation='relu'),
input_shape=(data.num_frames, data.width, data.height, 1)
)
)
I don't know how to type in the data.num_frames if each video contains n different number of frames extracted.
The inputs are small videos just 3-8 seconds (i.e. a sequence of frames).
You can use None since this dimension doesn't affect the number of trainable weights of your model.
You will have problems, though, to create a batch of videos with numpy, since numpy doesn't accept variable sizes.
You can train each video individually, or you can create dummy frames (zero padding) to make all videos reach the same max length. Then use a Masking layer to ignore these frames. (Certain Keras versions have problems when using TimeDistributed + Masking)

CNN-LSTM Image Classification

Is it possible to reshape 512x512 rgb image to (timestep, dim)? Otherwards, I am trying to convert this reshape layer: Reshape((23, 3887)) to 512 vice 299. Also, is there any documentation explaining how to determine input_dim and timestep for Keras?
It seems like your problem is similar to one that i had earlier today. Look at it here: Keras functional API: Combine CNN model with a RNN to to look at sequences of images
Now to add to the answer from the question i linked too. Let number_of_images be n. In your case the original data format would be (n, 512, 512, 3). All you then need to do decide how many images you want per sequence. Say you want a sequence of 5 images and have gotten 5000 images in total. Then reshaping to (1000, 5, 512, 512, 3) should do. This way the model sees 1000 sequences of 5 images.

Maxpooling Layer causes error in Keras

I have created a CNN in Keras with 12 Convolutional layers each followed by BatchNormalization, Activation and MaxPooling. A sample of the layer is:
model.add(Conv2D(256, (3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=2))
I start with 32 feature maps and end with 512. If I add MaxPooling after every Conv Layer like in the code above, I get an error in the final layer:
ValueError: Negative dimension size caused by subtracting 2 from 1 for 'max_pooling2d_11/MaxPool' (op: 'MaxPool') with input shapes: [?,1,1,512].
If I omit one MaxPooling in any layer the model compiles and starts training. I am using Tensorflow as backend and I have the right input shape of the image in the first layer.
Are there any suggestions why this may happening?
If your spatial dimensions are 256x256, then you cannot have more than 8 Max-Pooling layers in your network. As 2 ** 8 == 256, after downsampling by a factor of two, eight times, your feature maps will be 1x1 in the spatial dimensions, meaning you cannot perform max pooling as you would get a 0x0 or negative dimensions.
Its just an obvious limitation of Max Pooling but not always discussed in papers.
This can also be caused by having your input image in the wrong format
if you're using (3,X,Y) and it expects (X,Y,3) then the down sampling occurs on the colour channels and causes issues.

Resources