I want to extract the VGG features of a set of images and keep them in memory in a dictionary. The dictionary ends up holding 8091 tensors each of shape (1,4096), but my machine crashes with an out of memory error after about 6% of the way. Does anybody have a clue why this is happening and how to prevent it?
In fact, this seems to be triggered by the call to VGG rather than the memory space, since storing the VGG classification is sufficient to trigger the error.
Below is the simplest code I've found to reproduce the error. Once a helper function is defined:
import torch, torchvision
from tqdm import tqdm
vgg = torchvision.models.vgg16(weights='DEFAULT')
def try_and_crash(gen_data):
store_out = {}
for i in tqdm(range(8091)):
my_output = gen_data(torch.randn(1,3,224,224))
store_out[i] = my_output
return store_out
Calling it to quickly produce a large tensor doesn't cause a fuss
just_fine = try_and_crash(lambda x: torch.randn(1,4096))
but calling it to use vgg causes the machine to crash:
will_crash = try_and_crash(vgg)
The problem is that each element of the dictionary store_out[i] also stores the gradients that led to its computation, therefore ends up being much larger than a simple 1x4096 element tensor.
Running the code with torch.no_grad(), or equivalently with torch.set_grad_enabled(False) solves the issue. We can test it by slightly changing the helper function
def try_and_crash_grad(gen_data, grad_enabled):
store_out = {}
for i in tqdm(range(8091)):
with torch.set_grad_enabled(grad_enabled):
my_output = gen_data(torch.randn(1,3,224,224))
store_out[i] = my_output
return store_out
Now the following works
works_fine = try_and_crash_grad(vgg, False)
while the following throws an out of memory error
crashes = try_and_crash_grad(vgg, True)
Related
I am currently developing some code that deals with big multidimensional arrays. Of course, Python gets very slow if you try to perform these computations in a serialized manner. Therefore, I got into code parallelization, and one of the possible solutions I found has to do with the multiprocessing library.
What I have come up with so far is first dividing the big array in smaller chunks and then do some operation on each of those chunks in a parallel fashion, using a Pool of workers from multiprocessing. For that to be efficient and based on this answer I believe that I should use a shared memory array object defined as a global variable, to avoid copying it every time a process from the pool is called.
Here I add some minimal example of what I'm trying to do, to illustrate the issue:
import numpy as np
from functools import partial
import multiprocessing as mp
import ctypes
class Trials:
# Perform computation along first dimension of shared array, representing the chunks
def Compute(i, shared_array):
shared_array[i] = shared_array[i] + 2
# The function you actually call
def DoSomething(self):
# Initializer function for Pool, should define the global variable shared_array
# I have also tried putting this function outside DoSomething, as a part of the class,
# with the same results
def initialize(base, State):
global shared_array
shared_array = np.ctypeslib.as_array(base.get_obj()).reshape(125, 100, 100) + State
base = mp.Array(ctypes.c_float, 125*100*100) # Create base array
state = np.random.rand(125,100,100) # Create seed
# Initialize pool of workers and perform calculations
with mp.Pool(processes = 10,
initializer = initialize,
initargs = (base, state,)) as pool:
run = partial(self.Compute,
shared_array = shared_array) # Here the error says that shared_array is not defined
pool.map(run, np.arange(125))
pool.close()
pool.join()
print(shared_array)
if __name__ == '__main__':
Trials = Trials()
Trials.DoSomething()
The trouble I am encountering is that when I define the partial function, I get the following error:
NameError: name 'shared_array' is not defined
For what I understand, I think that means that I cannot access the global variable shared_array. I'm sure that the initialize function is executing, as putting a print statement inside of it gives back a result in the terminal.
What am I doing incorrectly, is there any way to solve this issue?
I have a function func that takes on average 9 sec to run. But when I try to use multiprocessing to parallelize it (even using torch.multiprocessing) each inference takes on average 20 sec why is that ?
func is an inference function which takes in a patient_name and runs a torch model in inference on that patient's data.
device = torch.device(torch.device('cpu'))
def func(patient_name):
data = np.load(my_dict[system_name]['data_path'])
model_state = torch.load(my_dict[system_name]['model_state_path'],map_location='cpu')
model = my_net(my_dict[system_name]['HPs'])
model = model.to(device)
model.load_state_dict(model_state)
model.eval()
result = model(torch.FloatTensor(data).to(device))
return result
from torch.multiprocessing import pool
core_cnt = 10
pool = Pool(core_cnt)
out = pool.starmap(func, pool_args)
Please check if the inference on your model architecture with data that you provide already uses substantial computational power. That would make OS process scheduler to switch between each process which takes additional time.
Also, you load model every time inside your function. It is always faster to copy data object between processes than read it from a disk (that's the strategy with multiprocessing, or share model altogether with torch.multiprocessing)
I'm trying to build a pytorch project on an IterableDataset with zarr as storage backend.
class Data(IterableDataset):
def __init__(self, path, start=None, end=None):
super(Data, self).__init__()
store = zarr.DirectoryStore(path)
self.array = zarr.open(store, mode='r')
if start is None:
start = 0
if end is None:
end = self.array.shape[0]
assert end > start
self.start = start
self.end = end
def __iter__(self):
return islice(self.array, self.start, self.end)
This works quite nicely with small test-datasets but once i move to my actual dataset (480 000 000 x 290) i'm running into a memory leak. I've tried logging out the python heap periodically as everything slows to a crawl, but i couldn't see anything increasing in size abnormally, so the lib i used (pympler) didn't actually catch the memory leak.
I'm kind of at my wits end, so if anybody has any idea how to further debug this, it would be greatly appreciated.
Cross-posted on PyTorch Forums.
Turns out that I had an issue in my validation routine:
with torch.no_grad():
for batch in tqdm(testloader, **params):
x = batch[:, 1:].to(device)
y = batch[:, 0].unsqueeze(0).T
y_test_pred = torch.sigmoid(sxnet(x))
y_pred_tag = torch.round(y_test_pred)
y_pred_list.append(y_pred_tag.cpu().numpy())
y_list.append(y.numpy())
I originally thought that I am well clear of running into troubles with appending my results to lists, but the issue is that the result of .numpy was an array of arrays (since the original datatype was a 1xn Tensor).
Adding .flatten() on the numpy arrays has fixed this issue and the RAM consumption is now as I originally provisioned.
I am trying to solve a stochastic problem. The problem for 24 steps and in each step I solve it for 450 different instances.
Now it is possible to solve the problem serially, but that takes a lot of time, so i wanted to solve those 450 instances parallely.
I approached this in 2 ways:
using solver_manager and multiple pyro mip servers. Here I just have to queue the insances and the solver manager would assign the problems to pyro mip servers. But the pyro mip servers occupy a lot of memory and retain it even after the final result is computed.
using opt solver.
a) I can solve using the opt solver directly, but it is slow.
b) using multi threading. I tried to create a threadPoolExecutor with 5 threads and then solve those 450 instances in different threads. But just shifting to multithreading causes the pyomo to throw errors.
def optimize():
optsolver = SolverFactory(self.solver_name)
value = self.initialize_all_values()
for timestep in range(24):
instance_list = self.create_instance_list(value)
futures = {}
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
for instance_object in instance_list:
futures = {executor.submit(self.thread_solver, instance_object["instance"], optsolver, timestep)}
for future in concurrent.futures.as_completed(futures):
try:
result = future.result()
value.update(result)
except Exception as e:
self.logger.error(e)
def thread_solver(instance, optsolver, timestep):
result = optsolver.solve(instance)
if (result.solver.status == SolverStatus.ok) and (
result.solver.termination_condition == TerminationCondition.optimal):
instance.solutions.load_from(result)
my_dict = {}
for v in instance.component_objects(Var, active=True):
varobject = getattr(instance, str(v))
var_list = []
try:
for index in varobject:
var_list.append(varobject[index].value)
my_dict[str(v)] = var_list
except Exception as e:
self.logger.error("error reading result " + str(e))
value = {key: my_dict[k][0]}
return value
elif result.solver.termination_condition == TerminationCondition.infeasible:
self.logger.info("Termination condition is infeasible")
return {}
else:
self.logger.info("Nothing fits")
return {}
I get the following errors/outputs:
Nothing fits
Solver failed to locate input problem file: /usr/src/app/temp/pyomo/tmpn23yhmz_.pyomo.lp
I have set the pyomo temp file location to "/usr/src/app/temp/pyomo"
The same code runs perfectly fine without multithreading.
Any reason for the pyomo to not solve the problem?
UPDATE 1:
I tried using mutex when I try to call the optsolver.solve(instance). This helped in reducing the above error and not completely eliminate. Whenever these error occur, I tried to solve the instance again and in the second attemp they would be solved. So, multithreading does work but I don't know why the errors still occur.
I'm trying to leverage concurrent.futures.ProcessPoolExecutor in Python3 to process a large matrix in parallel. The general structure of the code is:
class X(object):
self.matrix
def f(self, i, row_i):
<cpu-bound process>
def fetch_multiple(self, ids):
with ProcessPoolExecutor() as executor:
futures = [executor.submit(self.f, i, self.matrix.getrow(i)) for i in ids]
return [f.result() for f in as_completed(futures)]
self.matrix is a large scipy csr_matrix. f is my concurrrent function that takes a row of self.matrix and apply a CPU-bound process on it. Finally, fetch_multiple is a function that run multiple instance of f in parallel and returns the results.
The problem is that after running the script, all cpu cores are less than 50% busy (See the following screenshot):
Why all cores are not busy?
I think the problem is the large object of self.matrix and passing row vectors between processes. How can I solve this problem?
Yes.
The overhead should not be that big - but it is likely the cause of your CPUs appearing iddle (although, they should be busy passing the data around anyway).
But try the recipe here to pass a "pointer" of the object to the subprocess using shared memory.
http://briansimulator.org/sharing-numpy-arrays-between-processes/
Quoting from there:
from multiprocessing import sharedctypes
size = S.size
shape = S.shape
S.shape = size
S_ctypes = sharedctypes.RawArray('d', S)
S = numpy.frombuffer(S_ctypes, dtype=numpy.float64, count=size)
S.shape = shape
Now we can send S_ctypes and shape to a child process in
multiprocessing, and convert it back to a numpy array in the child
process as follows:
from numpy import ctypeslib
S = ctypeslib.as_array(S_ctypes)
S.shape = shape
It should be tricky to take care of reference counting, but I suppose numpy.ctypeslib takes care of that - so, just coordinate the passing of the actual row number to sub-processes in a way they don't work on the same data