Efficient way of loading minibatches < gpu memory - theano

I have the following scenario:
My dataset >> gpu memory
My minibatches < gpu memory ... such that depending on size I can fit up to 10 in memory at once while still training no problem.
The size of my dataset means I won't revisit datapoints, so I guess no point in making them shared? Or is there? I was thinking that maybe it would be beneficial to have up to 10 shared initialised variables of size=mini-batch, such that I can I swap 10 in at once instead of just one at a a time. Also, is it possible to preload mini-batches in parallel?

If you're not revisiting datapoints then there probably isn't any value in using shared variables.
The following code could be modified and used to evaluate the different methods of getting data into your specific computation.
The "input" method is the one that will probably be best when you have no need to revisit data. The "shared_all" method may outperform everything else but only if you can fit the entire dataset in GPU memory. The "shared_batched" allows you to evaluate whether hierarchically batching your data could help.
In the "shared_batched" method, the dataset is divided into many macro batches and each macro batch is divided into many micro batches. A single shared variable is used to hold a single macro batch. The code evaluates all the micro batches within the current macro batch. Once a complete macro batch has been processed the next macro batch is loaded into the shared variable and the code iterates over the micro batches within it again.
In general, it might be expected that small numbers of large memory transfers will operate faster than larger numbers of smaller transfers (where the total transfered is the same for each). But this needs to be tested (e.g. with the code below) before it can be known for sure; YMMV.
The use of the "borrow" parameter may also have a significant impact on the performance, but be aware of the implications before using it.
import math
import timeit
import numpy
import theano
import theano.tensor as tt
def test_input(data, batch_size):
assert data.shape[0] % batch_size == 0
batch_count = data.shape[0] / batch_size
x = tt.tensor4()
f = theano.function([x], outputs=x.sum())
total = 0.
start = timeit.default_timer()
for batch_index in xrange(batch_count):
total += f(data[batch_index * batch_size: (batch_index + 1) * batch_size])
print 'IN\tNA\t%s\t%s\t%s\t%s' % (batch_size, batch_size, timeit.default_timer() - start, total)
def test_shared_all(data, batch_size):
batch_count = data.shape[0] / batch_size
for borrow in (True, False):
start = timeit.default_timer()
all = theano.shared(data, borrow=borrow)
load_time = timeit.default_timer() - start
x = tt.tensor4()
i = tt.lscalar()
f = theano.function([i], outputs=x.sum(), givens={x: all[i * batch_size:(i + 1) * batch_size]})
total = 0.
start = timeit.default_timer()
for batch_index in xrange(batch_count):
total += f(batch_index)
print 'SA\t%s\t%s\t%s\t%s\t%s' % (
borrow, batch_size, batch_size, load_time + timeit.default_timer() - start, total)
def test_shared_batched(data, macro_batch_size, micro_batch_size):
assert data.shape[0] % macro_batch_size == 0
assert macro_batch_size % micro_batch_size == 0
macro_batch_count = data.shape[0] / macro_batch_size
micro_batch_count = macro_batch_size / micro_batch_size
macro_batch = theano.shared(numpy.empty((macro_batch_size,) + data.shape[1:], dtype=theano.config.floatX),
borrow=True)
x = tt.tensor4()
i = tt.lscalar()
f = theano.function([i], outputs=x.sum(), givens={x: macro_batch[i * micro_batch_size:(i + 1) * micro_batch_size]})
for borrow in (True, False):
total = 0.
start = timeit.default_timer()
for macro_batch_index in xrange(macro_batch_count):
macro_batch.set_value(
data[macro_batch_index * macro_batch_size: (macro_batch_index + 1) * macro_batch_size], borrow=borrow)
for micro_batch_index in xrange(micro_batch_count):
total += f(micro_batch_index)
print 'SB\t%s\t%s\t%s\t%s\t%s' % (
borrow, macro_batch_size, micro_batch_size, timeit.default_timer() - start, total)
def main():
numpy.random.seed(1)
shape = (20000, 3, 32, 32)
print 'Creating random data with shape', shape
data = numpy.random.standard_normal(size=shape).astype(theano.config.floatX)
print 'Running tests'
for macro_batch_size in (shape[0] / pow(10, i) for i in xrange(int(math.log(shape[0], 10)))):
test_shared_all(data, macro_batch_size)
test_input(data, macro_batch_size)
for micro_batch_size in (macro_batch_size / pow(10, i) for i in
xrange(int(math.log(macro_batch_size, 10)) + 1)):
test_shared_batched(data, macro_batch_size, micro_batch_size)
main()

Related

Limit of Python recursion functions (Process finished with exit code 139)

I had an old script that from a pandas dataframe calculates new columns from others, but also from the previous result of that column being calculated.
This script used for loops, and it was quite slow.
For this reason, I replaced the for loops with recursive functions.
The new script is around 100 times faster than the old one, which is good news. But I am now encountering a limit that I did not have before. As soon as I have more than 29952 rows in my dataset, I get the following error:
"Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)"
I made this little script with lists reflecting my problem :
If I increase the size of the lists (list_lenght) to more than 29952, the script crashes (on my computer)
import random
import sys
def list_generator(min_value, max_value, list_lenght):
return [random.randrange(min_value,max_value) for i in range(list_lenght)]
def recursive_function(list_1, list_2, n, result):
if n == len(list_1):
return result
elif list_1[n] <= list_2[n]:
result.append(1 + result[n - 1])
else:
result.append(0)
return recursive_function(list_1, list_2, (n + 1), result)
list_lenght = 29952 # How to increase this limit without generating an error?
min_value = 10
max_value = 20
list_one = list_generator(min_value, max_value, list_lenght)
list_two = list_generator(min_value, max_value, list_lenght)
# Set recursion limit
sys.setrecursionlimit(list_lenght * 2)
# Compute a new list from list_one and list_two
list_result = recursive_function(list_one, list_two, 1, [0])
I suspect a memory problem, but how do you take advantage of all the power of python's recursive functions while avoiding this limit as well as possible?
Thanks in advance
Following comment from #trincot, here is the version of the code without recursion function... which is ultimately faster than the version above with a recursive function ! And with which there are no more limits
def no_recursive_function(list_1, list_2, n, result):
if list_1[n] <= list_2[n]:
return 1 + result[n - 1]
else:
return 0
list_lenght = 29952
min_value = 10
max_value = 20
list_one = list_generator(min_value, max_value, list_lenght)
list_two = list_generator(min_value, max_value, list_lenght)
# Set recursion limit
sys.setrecursionlimit(list_lenght * 2)
list_result_2 = [0]
for n in range(list_lenght - 1):
result = no_recursive_function(list_one, list_two, n + 1, list_result_2)
list_result_2.append(result)

How do I reduce memory usage for deep reinforcement learning algorithms?

I wrote a script of DQN to play BreakoutDeterministic and ran it on my school GPU server. However, it seems that the code is taking up 97% of the total RAM memory (more than 100GB)!
I would like to know which part of the script is demanding this high usage of RAM? I used memory-profiler for 3 episodes and it seems that the memory requirement increases linearly with each time step on my laptop.
I wrote the script in PyCharm, python 3.6. My laptop 12GB RAM with no GPU but the school server is using Ubuntu, p100 GPU.
import gym
import numpy as np
import random
from collections import deque
from keras.layers import Dense, Input, Lambda, convolutional, core
from keras.models import Model
from keras.optimizers import Adam
import matplotlib.pyplot as plt
import os
import time as dt
plt.switch_backend('agg')
def preprocess(state):
process_state = np.mean(state, axis=2).astype(np.uint8)
process_state = process_state[::2, ::2]
process_state_size = list(process_state.shape)
process_state_size.append(1)
process_state = np.reshape(process_state, process_state_size)
return process_state
class DQNAgent:
def __init__(self, env):
self.env = env
self.action_size = env.action_space.n
self.state_size = self.select_state_size()
self.memory = deque(maxlen=1000000) # specify memory size
self.gamma = 0.99
self.eps = 1.0
self.eps_min = 0.01
self.decay = 0.95
self.lr = 0.00025
self.start_life = 5 # get from environment
self.tau = 0.125 # special since 2 models to be trained
self.model = self.create_cnnmodel()
self.target_model = self.create_cnnmodel()
def select_state_size(self):
process_state = preprocess(self.env.reset())
state_size = process_state.shape
return state_size
def create_cnnmodel(self):
data_input = Input(shape=self.state_size, name='data_input', dtype='int32')
normalized = Lambda(lambda x: x/255)(data_input)
conv1 = convolutional.Convolution2D(32, 8, strides=(4, 4), activation='relu')(normalized)
conv2 = convolutional.Convolution2D(64, 4, strides=(2,2), activation='relu')(conv1)
conv3 = convolutional.Convolution2D(64, 3, strides=(1,1), activation='relu')(conv2)
conv_flatten = core.Flatten()(conv3) # flatten to feed cnn to fc
h4 = Dense(512, activation='relu')(conv_flatten)
prediction_output = Dense(self.action_size, name='prediction_output', activation='linear')(h4)
model = Model(inputs=data_input, outputs=prediction_output)
model.compile(optimizer=Adam(lr=self.lr),
loss='mean_squared_error') # 'mean_squared_error') keras.losses.logcosh(y_true, y_pred)
return model
def remember(self, state, action, reward, new_state, done): # store past experience as a pre-defined table
self.memory.append([state, action, reward, new_state, done])
def replay(self, batch_size):
if batch_size > len(self.memory):
return
all_states = []
all_targets = []
samples = random.sample(self.memory, batch_size)
for sample in samples:
state, action, reward, new_state, done = sample
target = self.target_model.predict(state)
if done:
target[0][action] = reward
else:
target[0][action] = reward + self.gamma*np.max(self.target_model.predict(new_state)[0])
all_states.append(state)
all_targets.append(target)
history = self.model.fit(np.vstack(all_states), np.vstack(all_targets), epochs=1, verbose=0)
return history
def act(self, state):
self.eps *= self.decay
self.eps = max(self.eps_min, self.eps)
if np.random.random() < self.eps:
return self.env.action_space.sample()
return np.argmax(self.model.predict(state)[0])
def train_target(self):
weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = (1-self.tau)*target_weights[i] + self.tau*weights[i]
self.target_model.set_weights(target_weights) #
def main(episodes):
env = gym.make('BreakoutDeterministic-v4')
agent = DQNAgent(env, cnn)
time = env._max_episode_steps
batch_size = 32
save_model = 'y'
filepath = os.getcwd()
date = dt.strftime('%d%m%Y')
clock = dt.strftime('%H.%M.%S')
print('++ Training started on {} at {} ++'.format(date, clock))
start_time = dt.time()
tot_r = []
tot_loss = []
it_r = []
it_loss = []
tot_frames = 0
for e in range(episodes):
r = []
loss = []
state = env.reset()
state = preprocess(state)
state = state[None,:]
current_life = agent.start_life
for t in range(time):
if rend_env == 'y':
action = agent.act(state)
new_state, reward, terminal_life, life = env.step(action)
new_state = preprocess(new_state)
new_state = new_state[None,:]
if life['ale.lives'] < current_life:
reward = -1
current_life = life['ale.lives']
agent.remember(state, action, reward, new_state, terminal_life)
hist = agent.replay(batch_size)
agent.train_target()
state = new_state
r.append(reward)
tot_frames += 1
if hist is None:
loss.append(0.0)
else:
loss.append(hist.history['loss'][0])
if t%20 == 0:
print('Frame : {}, Cum Reward = {}, Avg Loss = {}, Curr Life: {}'.format(t,
np.sum(r),
round(np.mean(loss[-20:-1]),3),
current_life))
agent.model.save('{}/Mod_Fig/DQN_BO_model_{}.h5'.format(filepath, date))
agent.model.save_weights('{}/Mod_Fig/DQN_BO_weights_{}.h5'.format(filepath, date))
if current_life == 0 or terminal_life:
print('Episode {} of {}, Cum Reward = {}, Avg Loss = {}'.format(e, episodes, np.sum(r), np.mean(loss)))
break
tot_r.append(np.sum(r))
tot_loss.append(np.mean(loss))
it_r.append(r)
it_loss.append(loss)
print('Training ended on {} at {}'.format(date, clock))
run_time = dt.time() - start_time
print('Total Training time: %d Hrs %d Mins $d s' % (run_time // 3600, (run_time % 3600) // 60),
(run_time % 3600) % 60 // 1)
if save_model == 'y':
agent.model.save('{}/Mod_Fig/DQN_BO_finalmodel_{}_{}.h5'.format(filepath, date, clock))
agent.model.save_weights('{}/Mod_Fig/DQN_BO_finalweights_{}_{}.h5'.format(filepath, date, clock))
agent.model.summary()
return tot_r, tot_loss, it_r, it_loss, tot_frames
if __name__ == '__main__':
episodes = 3
total_reward, total_loss, rewards_iter, loss_iter, frames_epi = main(episodes=episodes)
Would really appreciate your comments and help on writing memory and speed efficient deep RL codes! I hope to train my DQN on breakout for 5000 episodes but the remote server only allows maximum of 48 hours of training. Thanks in advance!
It sounds like you have a memory leak.
This line
agent.remember(state, action, reward, new_state, terminal_life)
gets called 5000 * env._max_episode_steps times, and each state is a (210, 160, 3) array. The first thing to try would be to reduce the size of self.memory = deque(maxlen=1000000) # specify memory size to verify that this is the sole cause.
If you really believe you need that much capacity, you should dump self.memory to disk and keep a only a small subsample in memory.
Additionally: subsampling from deque is very slow, deque is implemented as a linked list so each subsample is O(N*M). You should consider implementing your own ring buffer for self.memory.
Alternatively: you might consider a probabilistic buffer (I don't know the proper name), where each time you would append to a full buffer, remove an element at random and append the new element. This means any (state, action, reward, ...) tuple that is encountered has a nonzero probability of being contained in the buffer, with recent tuples being more likely than older ones.
I had similar problems with memory and I still do.
The main cause of the large memory consumption are the states. But here's what I did to make it better:
Step 1: Resize them to a 84 x 84 sample using openCV. Some people instead downsample the images to 84 x 84. This results in each state having the shape (84,84,3).
Step 2: Convert these frames to grayscale (basically, black and white). This should change the shape to (84,84,1).
Step 3: Use dtype=np.uint8 for storing states. They consume minimal memory and are perfect for the pixel intensity values ranged 0-255.
Additional Info
I run my code on free Google Collab notebooks (K80 Tesla GPU and 13GB RAM), periodically saving the replay buffer to my drive.
For steps 1 and 2, consider using the OpenAI baseline Atari wrappers, as there is no point in reinventing the wheel.
You could also this snippet to check the amount of RAM used by your own program at each step, like I did:
import os
import psutil
def show_RAM_usage(self):
py = psutil.Process(os.getpid())
print('RAM usage: {} GB'.format(py.memory_info()[0]/2. ** 30))
This snippet is modified to use in my own program from the original answer

Calculating the Cumulative Mean in Python

i am new on programming and python. I made a simulation mm1 queue. I ran it properly. I took the results. I have an 5000 output. But now i should calculate the cumulative mean of average delays for every 100 period(1 to 100, 1 to 200... until 1 to 5000).
#data 4 (delay time) set assign to list of numpy array
npdelaytime = np.array(data[4][0:5000])
#reshape the list of delay time 100 customer in each sample
npdelayreshape100 = np.reshape(npdelaytime, (-1,100))
#mean of this reshape matrix
meandelayreshape100 = np.mean(npdelayreshape100, axis=1)
cumsummdr100 = np.cumsum(meandelayreshape100)
a = range(1,51)
meancsmdr100 = cumsummdr100 / a
I can figure this out like this. First reshape the 5000 sample point into to 100*50. Then taking the means of these matrix. Lastly cumsum of these means.
My Question : Is there a easy way to do this ?
What about replacing range by np.arange ?
Try:
meancsmdr100 = cumsummdr100 / np.arange(1,51)
def cum_mean(arr):
cum_sum = np.cumsum(arr, axis=0)
for i in range(cum_sum.shape[0]):
if i == 0:
continue
print(cum_sum[i] / (i + 1))
cum_sum[i] = cum_sum[i] / (i + 1)
return cum_sum

Weighted moving average in python with different width in different regions

I was trying to take a oscillation avarage of a highly oscillating data. The oscillations are not uniform, it has less oscillations in the initial regions.
x = np.linspace(0, 1000, 1000001)
y = some oscillating data say, sin(x^2)
(The original data file is huge, so I can't upload it)
I want to take a weighted moving avarage of the function and plot it. Initially the period of the function is larger, so I want to take avarage over a large time interval. While I can do with smaller time interval latter.
I have found a possible elegant solution in following post:
Weighted moving average in python
However, I want to have different width in different regions of x. Say when x is between (0,100) I want the width=0.6, while when x is between (101, 300) width=0.2 and so on.
This is what I have tried to implement( with my limited knowledge in programing!)
def weighted_moving_average(x,y,step_size=0.05):#change the width to control average
bin_centers = np.arange(np.min(x),np.max(x)-0.5*step_size,step_size)+0.5*step_size
bin_avg = np.zeros(len(bin_centers))
#We're going to weight with a Gaussian function
def gaussian(x,amp=1,mean=0,sigma=1):
return amp*np.exp(-(x-mean)**2/(2*sigma**2))
if x.any() < 100:
for index in range(0,len(bin_centers)):
bin_center = bin_centers[index]
weights = gaussian(x,mean=bin_center,sigma=0.6)
bin_avg[index] = np.average(y,weights=weights)
else:
for index in range(0,len(bin_centers)):
bin_center = bin_centers[index]
weights = gaussian(x,mean=bin_center,sigma=0.1)
bin_avg[index] = np.average(y,weights=weights)
return (bin_centers,bin_avg)
It is needless to say that this is not working! I am getting the plot with the first value of sigma. Please help...
The following snippet should do more or less what you tried to do. You have mainly a logical problem in your code, x.any() < 100 will always be True, so you'll never execute the second part.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 1000)
y = np.sin(x**2)
def gaussian(x,amp=1,mean=0,sigma=1):
return amp*np.exp(-(x-mean)**2/(2*sigma**2))
def weighted_average(x,y,step_size=0.3):
weights = np.zeros_like(x)
bin_centers = np.arange(np.min(x),np.max(x)-.5*step_size,step_size)+.5*step_size
bin_avg = np.zeros_like(bin_centers)
for i, center in enumerate(bin_centers):
# Select the indices that should count to that bin
idx = ((x >= center-.5*step_size) & (x <= center+.5*step_size))
weights = gaussian(x[idx], mean=center, sigma=step_size)
bin_avg[i] = np.average(y[idx], weights=weights)
return (bin_centers,bin_avg)
idx = x <= 4
plt.plot(*weighted_average(x[idx],y[idx], step_size=0.6))
idx = x >= 3
plt.plot(*weighted_average(x[idx],y[idx], step_size=0.1))
plt.plot(x,y)
plt.legend(['0.6', '0.1', 'y'])
plt.show()
However, depending on the usage, you could also implement moving average directly:
x = np.linspace(0, 60, 1000)
y = np.sin(x**2)
z = np.zeros_like(x)
z[0] = x[0]
for i, t in enumerate(x[1:]):
a=.2
z[i+1] = a*y[i+1] + (1-a)*z[i]
plt.plot(x,y)
plt.plot(x,z)
plt.legend(['data', 'moving average'])
plt.show()
Of course you could then change a adaptively, e.g. depending of the local variance. Also note that this has apriori a small bias depending on a and the step size in x.

Incremental PCA

I've never used incremental PCA which exists in sklearn and I'm a bit confused about it's parameters and not able to find a good explanation of them.
I see that there is batch_size in the constructor, but also, when using partial_fit method you can again pass only a part of your data, I've found the following way:
n = df.shape[0]
chunk_size = 100000
iterations = n//chunk_size
ipca = IncrementalPCA(n_components=40, batch_size=1000)
for i in range(0, iterations):
ipca.partial_fit(df[i*chunk_size : (i+1)*chunk_size].values)
ipca.partial_fit(df[iterations*chunk_size : n].values)
Now, what I don't understand is the following - when using partial fit, does the batch_size play any role at all, or not? And how are they related?
Moreover, if both are considered, how should I change their values properly, when wanting to increase the precision while increasing memory footprint (and the other way around, decrease the memory consumption for the price of decreased accuracy)?
The docs say:
batch_size : int or None, (default=None)
The number of samples to use for each batch. Only used when calling fit...
This param is not used within partial_fit, where the batch-size is controlled by the user.
Bigger batches will increase memory-consumption, smaller ones will decrease it.
This is also written in the docs:
This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory.
Despite some checks and parameter-heuristics, the whole fit-function looks like this:
for batch in gen_batches(n_samples, self.batch_size_):
self.partial_fit(X[batch], check_input=False)
Here is some an incremental PCA code based on https://github.com/kevinhughes27/pyIPCA which is an implementation of CCIPCA method.
import scipy.sparse as sp
import numpy as np
from scipy import linalg as la
import scipy.sparse as sps
from sklearn import datasets
class CCIPCA:
def __init__(self, n_components, n_features, amnesic=2.0, copy=True):
self.n_components = n_components
self.n_features = n_features
self.copy = copy
self.amnesic = amnesic
self.iteration = 0
self.mean_ = None
self.components_ = None
self.mean_ = np.zeros([self.n_features], np.float)
self.components_ = np.ones((self.n_components,self.n_features)) / \
(self.n_features*self.n_components)
def partial_fit(self, u):
n = float(self.iteration)
V = self.components_
# amnesic learning params
if n <= int(self.amnesic):
w1 = float(n+2-1)/float(n+2)
w2 = float(1)/float(n+2)
else:
w1 = float(n+2-self.amnesic)/float(n+2)
w2 = float(1+self.amnesic)/float(n+2)
# update mean
self.mean_ = w1*self.mean_ + w2*u
# mean center u
u = u - self.mean_
# update components
for j in range(0,self.n_components):
if j > n: pass
elif j == n: V[j,:] = u
else:
# update the components
V[j,:] = w1*V[j,:] + w2*np.dot(u,V[j,:])*u / la.norm(V[j,:])
normedV = V[j,:] / la.norm(V[j,:])
normedV = normedV.reshape((self.n_features, 1))
u = u - np.dot(np.dot(u,normedV),normedV.T)
self.iteration += 1
self.components_ = V / la.norm(V)
return
def post_process(self):
self.explained_variance_ratio_ = np.sqrt(np.sum(self.components_**2,axis=1))
idx = np.argsort(-self.explained_variance_ratio_)
self.explained_variance_ratio_ = self.explained_variance_ratio_[idx]
self.components_ = self.components_[idx,:]
self.explained_variance_ratio_ = (self.explained_variance_ratio_ / \
self.explained_variance_ratio_.sum())
for r in range(0,self.components_.shape[0]):
d = np.sqrt(np.dot(self.components_[r,:],self.components_[r,:]))
self.components_[r,:] /= d
You can test it with
import pandas as pd, ccipca
df = pd.read_csv('iris.csv')
df = np.array(df)[:,:4].astype(float)
pca = ccipca.CCIPCA(n_components=2,n_features=4)
S = 10
print df[0, :]
for i in range(150): pca.partial_fit(df[i, :])
pca.post_process()
The resulting eigenvectors / values will not exaactly be the same as the batch PCA. Results are approximate, but they are useful.

Resources