nn.Conv1d consumes too much GPU memory - pytorch

I have a ResBlock as below, which can change feature vector length as well. However, It consumes way too much GPU memory. In fact, one ResBlock like this alone can consume as much as 2.3GB of GPU memory, which causes CUDA_OUT_OF_MEMORY all the time.
Typical input size(batch size included): (65536, 256) or (65536, 63)
Typical output size(per ResBlock): (65536, 256)
The UpwardsConv1d module can change feature vector length with convolution.
You might think the batch size is too big, but a linear layer can handle that very well, which only consumes around 100MB of GPU memory per layer. There's no way that the Conv1d layer can't handle that with significantly fewer trainable parameters.
nn.Linear(63, 256) trainable parameters: 16384
ResBlock(63, 256) trainable parameters: 104
class UpwardsConv1d(nn.Module):
Increase feature vector length by flattening dimentions.
def __init__(self, size_in, size_out=None, size_h=8, k_size=3):
self.size_in = size_in
if size_out is None:
self.size_out = size_in
self.size_out = size_out
self.size_h = size_h
self.k_size = k_size
factor = math.ceil(self.size_out/self.size_in)
self.conv0 = nn.Conv1d(self.size_h, factor,
self.k_size, padding=(self.k_size-1)//2)
self.f = nn.Flatten(1, 2)
self.conv1 = nn.Conv1d(1, 1, self.size_in*factor + 1 - self.size_out)
def forward(self, x):
x = self.conv0(x)
x = self.f(x)
x = x.unsqueeze(-2)
x = self.conv1(x)
return x
class ResBlock(nn.Module):
Act pretty much like a nn.Linear module, but uses convolution so have fewer trainable parameters.
def __init__(self, size_in, size_out=None, size_h=2, k_size=3):
self.size_in = size_in
if size_out is None:
self.size_out = size_in
self.size_out = size_out
self.size_h = size_h
self.k_size = k_size
self.conv0 = nn.Conv1d(1, self.size_h, self.k_size,
if self.size_in == self.size_out:
self.conv1 = nn.Conv1d(self.size_h, 1, (self.k_size-1)//2)
self.conv1 = UpwardsConv1d(
self.size_in, self.size_out, self.size_h, self.k_size)
def forward(self, x):
x = x.unsqueeze(-2)
x = self.conv0(x)
x = self.conv1(x)
x = x.squeeze(-2)
return x


DQN not converging

I am trying to implement DQN in openai-gym's "lunar lander" environment.
It shows no sign of converging after 3000 episodes for training. (for comparison, a very simple policy gradient method converges after 2000 episodes)
I went through my code for several times but can't find where's wrong. I hope if someone here can point out where the problem is. Below is my code:
I use a simple fully-connected network:
class Net(nn.Module):
def __init__(self) -> None:
self.main = nn.Sequential(
nn.Linear(8, 16),
nn.Linear(16, 16),
nn.Linear(16, 4)
def forward(self, state):
return self.main(state)
I use epsilon greedy when choosing actions, and the epsilon(start from 0.5) decreases exponentially overtime:
def sample_action(self, state):
self.epsilon = self.epsilon * 0.99
action_probs = self.network_train(state)
random_number = random.random()
if random_number < (1-self.epsilon):
action = torch.argmax(action_probs, dim=-1).item()
action = random.choice([0, 1, 2, 3])
return action
When training, I use a replay buffer, batch size of 64, and gradient clipping:
def learn(self):
if len(self.buffer) >= BATCH_SIZE:
self.learn_counter += 1
transitions = self.buffer.sample(BATCH_SIZE)
batch = Transition(*zip(*transitions))
state = torch.from_numpy(np.concatenate(batch.state)).reshape(-1, 8)
action = torch.tensor(batch.action).reshape(-1, 1)
reward = torch.tensor(batch.reward).reshape(-1, 1)
state_value = self.network_train(state).gather(1, action)
next_state = torch.from_numpy(np.concatenate(batch.next_state)).reshape(-1, 8)
next_state_value = self.network_target(next_state).max(1)[0].reshape(-1, 1).detach()
loss = F.mse_loss(state_value.float(), (self.DISCOUNT_FACTOR*next_state_value + reward).float())
for param in self.network_train.parameters():
param.grad.data.clamp_(-1, 1)
I also use a target network, its parameters are updated every 100 timesteps:
def update_network_target(self):
if (self.learn_counter % 100) == 0:
BTW, I use a Adam optimizer and LR of 1e-3.
Solved. Apparently the freq of updating target network is too high. I set it to every 10 episodes and fixed the problem.

Distributed sequential windowed data in pytorch

At every epoch of my training, I need to split my dataset in n batches of t consecutive samples. For example, if my data is [1,2,3,4,5,6,7,8,9,10], n = 2 and t = 3 then valid batches would be
[1-2-3, 4-5-6] and [7-8-9, 10-1-2]
[2-3-4, 8-9-10] and [5-6-7, 1-2-3]
My old version is the following, but it samples every point in the data, meaning that I would parse the whole dataset t times per epoch.
train_dataset = list(range(n))
train_sampler = None
if distributed:
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=bsize, shuffle=(train_sampler is None),
pin_memory=True, sampler=train_sampler)
for epoch in range(epochs):
if distributed:
for starting_i in train_loader:
batch = np.array([np.mod(np.arange(i, i + t), n) for i in starting_i])
I have now implemented my own sampling function that splits the data into random batches where each sample is far from the two closest exactly t. In the non-distributed scenario, I can do
for epoch in range(epochs):
pad = np.random.randint(n)
train_loader = np.mod(np.arange(pad, n + pad, t), n)
train_loader = np.array_split(train_loader,
np.ceil(len(train_loader) / bsize))
for starting_i in train_loader:
batch = np.array([np.mod(np.arange(i, i + t), n) for i in starting_i])
How do I make this version distributed? Do I need to make a custom torch.nn.parallel.DistributedDataParallel or torch.utils.data.DataLoader?
I have checked the DistributedSampler class
and my guess is that I have to override the __iter__ method. Am I right?
How does DistributedSampler split the dataset? Is it sequentially among num_replicas?
Say num_replicas = 2. Would my dataset be split into [1,2,3,4,5] and [6,7,8,9,10] between the 2 workers? Or is it random? Like [1,4,7,3,10] and [2,9,5,8,6]? First case would be ok for me because keeps samples sequential, but second would not.
I ended up making my own Dataset where the data is [t, t + window, ... t + n * window]. Every time it is called it randomizes the starting indices of the window. Then the sampler does the shuffling as usual. For reproducibility, it has a set_seed method similar to set_epoch of samplers.
class SequentialWindowedDataset(Dataset):
def __init__(self, size, window):
self.size = size
self.window = window
self.seed = 0
self.data = np.arange(0, self.size, self.window)
def __getitem__(self, index):
rng = np.random.default_rng(self.seed)
pad = rng.integers(0, self.size)
data = (self.data + pad) % self.size
return data[index]
def __len__(self):
return len(self.data)
def set_seed(self, seed):
self.seed = seed
The following version randomizes the data outside the call and it is much much faster.
class SequentialWindowedDataset(Dataset):
def __init__(self, size, window):
self.size = size
self.window = window
self.data = np.arange(0, self.size, self.window)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return len(self.data)
def randomize(self, seed):
rng = np.random.default_rng(seed)
pad = rng.integers(0, self.size)
self.data = (self.data + pad) % self.size

PyTorch: GRU, one-to-many / many-to-one

I would like to implement a GRU able to encode a sequence of vectors to one vector (many-to-one), and then another GRU able to decode a vector to a sequence of vector (one-to-many). The size of the vectors wouldn't be changed. I would like to have an opinion about what I implemented.
Here is the code:
class AEGRU(nn.Module):
def __init__(self, opt):
super(AEGRU, self).__init__()
self.length = 256
self.latent_space = 256
self.num_layers = 1
self.GRU_enc = nn.GRU(input_size=3, hidden_size=self.latent_space, num_layers=self.num_layers, batch_first=True)
self.fc_enc = nn.Linear(self.latent_space, self.latent_space)
self.GRU_dec = nn.GRU(input_size=self.latent_space, hidden_size=3, num_layers=self.num_layers, batch_first=True)
self.fc_dec = nn.Linear(3, 3)
def enc(self, x):
# x has shape: Batch_size x self.length x 3
h0 = torch.zeros(self.num_layers, x.shape[0], self.latent_space).cuda()
out, _ = self.GRU_enc(x, h0)
out = out[:, -1, :]
out = self.fc_enc(out)
return out
def dec(self, x):
# x has shape: Batch_size x self.latent_space
x = x[:, None, :]
h = torch.zeros(self.num_layers, x.shape[0], 3).cuda()
# method 1 ??
'''outputs = torch.zeros(x.shape[0], self.length, 3).cuda()
for i in range(self.length):
out, h = self.GRU_dec(x, h)
outputs[:, i, :] = out[:, 0, :]'''
# method 2 ??
x = x.repeat(1, self.length, 1)
outputs, _ = self.GRU_dec(x, h)
# linear layer
outputs = self.fc_dec(outputs)
return outputs
def forward(self, x):
self.indices = []
latent = self.enc(x)
output = self.dec(latent)
return output
I am not sure whether this is the good way to do a one-to-many GRU. Could I have some opinions about this?
Thanks for reading!

save embedding layer in pytorch model

i have this model:
class model(nn.Module):
def __init__(self):
self.conv1 = nn.Conv2d(in_channels=12,out_channels=64,kernel_size=3,stride= 1,padding=1)
# self.conv2 = nn.Conv2d(in_channels=64,out_channels=64,kernel_size=3,stride= 1,padding=1)
self.fc1 = nn.Linear(24576, 128)
self.bn = nn.BatchNorm1d(128)
self.dropout1 = nn.Dropout2d(0.5)
self.fc2 = nn.Linear(128, 10)
self.fc3 = nn.Linear(10, 3)
def forward(self, x):
x = F.relu(self.conv1(x))
# x = F.relu(self.conv2(x))
x = F.max_pool2d(x, (2,2))
# print(x.shape)
x = x.view(-1,24576)
x = self.bn(F.relu(self.fc1(x)))
x = self.dropout1(x)
embeding_stage = F.relu(self.fc2(x))
x = self.fc3(embeding_stage)
return x
and i want to save the embeding_stage layer like i save the model here:
model = model()
torch.save(model.state_dict(), 'C:\project\count_speakers\model_pytorch.h5')
I'm not sure I understand what you mean with "save the embedding_stage layer" but if you want to save fc2 or fc3 or something, then you can do that with torch.save().
Ex: to save fc3: torch.save(model.fc3),'C:\...\fc3.pt')
Op wants to have the output of the embedding_stage.
You can do that in several ways:
load your model with model.load_state_dict(torch.load('C:\...\model_pytorch.h5'))
then model = nn.Sequential(*list(model.children())[:-1]). The output of model is the embeding_stage.
make a Model2(nn.Module), exactly the same as your first Model(), but replace return x in def forward(self, x): with return embeding_stage. Then load the state of your first model into your second model like this: model2.load_state_dict(torch.load('C:\...\model_pytorch.h5'))
Like this fc3 will be loaded, but not used. The output of model2(x) will be the embeding_stage.

Has anyone written weldon pooling for keras?

Has the Weldon pooling [1] been implemented in Keras?
I can see that it has been implemented in pytorch by the authors [2] but cannot find a keras equivalent.
[1] T. Durand, N. Thome, and M. Cord. Weldon: Weakly su-
pervised learning of deep convolutional neural networks. In
CVPR, 2016.
[2] https://github.com/durandtibo/weldon.resnet.pytorch/tree/master/weldon
Here is one based on the lua version (there is a pytorch impl but i think that has an error taking the average of max+min). I'm assuming the lua version's avg of top max and min values was still correct. I've not tested the whole custom layer aspects but close enough to get something going, comments welcomed.
class WeldonPooling(Layer):
"""Class to implement Weldon selective spacial pooling with negative evidence
def __init__(self, kmax, kmin=-1, data_format=None, **kwargs):
super(WeldonPooling, self).__init__(**kwargs)
self.data_format = conv_utils.normalize_data_format(data_format)
self.input_spec = InputSpec(ndim=4)
def compute_output_shape(self, input_shape):
if self.data_format == 'channels_last':
return (input_shape[0], input_shape[3])
return (input_shape[0], input_shape[1])
def get_config(self):
config = {'data_format': self.data_format}
base_config = super(_GlobalPooling2D, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
def call(self, inputs):
if self.data_format == "channels_last":
inputs = tf.transpose(inputs, [0, 3, 1, 2])
batch_size = shape[0]
num_channels = shape[1]
h = shape[2]
w = shape[3]
n = h * w
view = tf.reshape(inputs, [batch_size, num_channels, n])
sorted, indices = tf.nn.top_k(view, n, sorted=True)
#indices_max = tf.slice(indices,[0,0,0],[batch_size, num_channels, kmax])
output = tf.div(tf.reduce_sum(tf.slice(sorted,[0,0,0],[batch_size, num_channels, kmax]),2),kmax)
if kmin > 0:
#indices_min = tf.slice(indices,[0,0, n-kmin],[batch_size, num_channels, kmin])
output=tf.add(output,tf.div(tf.reduce_sum(tf.slice(sorted,[0,0,n-kmin],[batch_size, num_channels, kmin]),2),kmin))
return tf.reshape(output,[batch_size, num_channels])
