A confession first - a noob programmer here doing occasional scripting. I've been trying to figure the memory consumption for this simple piece of code but unable to figure this out. I have tried searching in the answered questions, but couldn't figure it out. I'm fetching some json data using REST API, and the piece of code below ends up consuming a lot of RAM. I checked the Windows task manager and the memory consumption increases incrementally with each iteration of the loop. I'm overwriting the same variable for each API call, so I think the previous response variable should be overwritten.
while Flag == True:
urlpart= 'data/device/statistics/approutestatsstatistics?scrollId='+varScrollId
response = json.loads(obj1.get_request(urlpart))
lstDataList = lstDataList + response['data']
Flag = response['pageInfo']['hasMoreData']
varScrollId = response['pageInfo']['scrollId']
count += 1
print("Fetched {} records out of {}".format(len(lstDataList), recordCount))
print('Size of List is now {}'.format(str(sys.getsizeof(lstDataList))))
return lstDataList
I tried to profile memory usage using memory_profiler...here's what it shows
92 119.348 MiB 0.000 MiB count = 0
93 806.938 MiB 0.000 MiB while Flag == True:
94 806.938 MiB 0.000 MiB urlpart= 'data/device/statistics/approutestatsstatistics?scrollId='+varScrollId
95 807.559 MiB 30.293 MiB response = json.loads(obj1.get_request(urlpart))
96 806.859 MiB 0.000 MiB print('Size of response within the loop is {}'.format(sys.getsizeof(response)))
97 806.938 MiB 1.070 MiB lstDataList = lstDataList + response['data']
98 806.938 MiB 0.000 MiB Flag = response['pageInfo']['hasMoreData']
99 806.938 MiB 0.000 MiB varScrollId = response['pageInfo']['scrollId']
100 806.938 MiB 0.000 MiB count += 1
101 806.938 MiB 0.000 MiB print("Fetched {} records out of {}".format(len(lstDataList), recordCount))
102 806.938 MiB 0.000 MiB print('Size of List is now {}'.format(str(sys.getsizeof(lstDataList))))
103 return lstDataList
obj1 is an object of Cisco's rest_api_lib class. Link to code here
In fact the program ends up consuming ~1.6 Gigs of RAM. The data I'm fetching has roughly 570K records. The API limits the records to 10K at a time, so the loop runs ~56 times. Line 95 of the code consumes ~30M of RAM as per the memory_profiler output. It's as if each iteration consumes 30M ending u with ~1.6G, so in the same ballpark. Unable to figure out why the memory consumption keeps on accumulating for the loop.
Thanks.
I would suspect it is the line lstDataList = lstDataList + response['data']
This is accumulating response['data'] over time. Also, your indentation seems off, should it be:
while Flag == True:
urlpart= 'data/device/statistics/approutestatsstatistics?scrollId='+varScrollId
response = json.loads(obj1.get_request(urlpart))
lstDataList = lstDataList + response['data']
Flag = response['pageInfo']['hasMoreData']
varScrollId = response['pageInfo']['scrollId']
count += 1
print("Fetched {} records out of {}".format(len(lstDataList), recordCount))
print('Size of List is now {}'.format(str(sys.getsizeof(lstDataList))))
return lstDataList
As far as I can tell, lstDataList will keep growing with each request, leading to the memory increase. Hope that helps, Happy Friday!
it's as if each iteration consumes 30M
That is exactly what is happening. You need to free memory that you dont need for example once you have extracted data from response. You can delete it like so
del response
more on del
more on garbage collection
Related
I am using Ray in order to parallelize some computations, but it seems to be accumulating spillage..
I don't mind it spilling objects to my hard drive, but I do if it means using +130 GiB for processing about 1.6 GiB of simulations..
Bellow is a trace of what is happening:
Number of steps: 55 (9,091 simulations each)
0%
[2m[36m(raylet)[0m Spilled 3702 MiB, 12 objects, write throughput 661 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
[2m[36m(raylet)[0m Spilled 5542 MiB, 17 objects, write throughput 737 MiB/s.
2%
[2m[36m(raylet)[0m Spilled 9883 MiB, 33 objects, write throughput 849 MiB/s.
5%
[2m[36m(raylet)[0m Spilled 16704 MiB, 58 objects, write throughput 997 MiB/s.
13%
[2m[36m(raylet)[0m Spilled 32903 MiB, 124 objects, write throughput 784 MiB/s.
29%
[2m[36m(raylet)[0m Spilled 66027 MiB, 268 objects, write throughput 661 MiB/s.
53%
[2m[36m(raylet)[0m Spilled 131920 MiB, 524 objects, write throughput 461 MiB/s.
60%
And here is the code I am running:
def get_res_parallel(simulations, num_loads=num_cpus):
load_size = simulations.shape[0] / num_loads
simulations_per_load = [simulations[round(n * load_size): round((n+1) * load_size)]
for n in range(num_loads)]
# 2D numpy arrays
results = ray.get([get_res_ray.remote(simulations=simulations)
for simulations in simulations_per_load])
return np.vstack(results)
MAX_RAM = 6 * 2**30 # 6 GiB
def get_expected_res(simulations, MAX_RAM=MAX_RAM):
expected_result = np.zeros(shape=87_381, dtype=np.float64)
bytes_per_res = len(expected_result) * (64 // 8)
num_steps = simulations.shape[0] * bytes_per_res // MAX_RAM + 1
step_size = simulations.shape[0] / num_steps
print(f"Number of steps: {num_steps} ({step_size:,.0f} simulations each)")
for n in range(num_steps):
print(f"\r{n / num_steps:.0%}", end="")
step_simulations = simulations[round(n * step_size): round((n+1) * step_size)]
results = get_res_parallel(simulations=step_simulations)
expected_result += results.mean(axis=0)
print(f"\r100%")
return expected_result / num_steps
Running on a Mac M1 with 16 GiB of RAM, Ray 2.0.0 and Python 3.9.13.
Question
Given my code, is it normal behavior?
What can I do to resolve this problem? Force garbage collection?
Do you know the expected size of the array returned by get_res_ray?
Ray will spill objects returned by remote tasks as well as objects passed to remote tasks, so in this case there are two possible places that can cause memory pressure:
The ObjectRefs returned by get_res_ray.remote
The simulations passed to get_res_ray.remote. Since these are large, Ray will automatically put these in the local object store to reduce the size of the task definition.
It may be expected to spill if the size of these objects combined is greater than 30% of the RAM on your machine (this is the default size of Ray's object store). It's not suggested to increase the size of the object store, since this can cause memory pressure on the functions instead.
But you can try to either process fewer things in each iteration and/or you can try to release ObjectRefs sooner. In particular, you should try to release the results from the previous iteration as soon as possible, so that Ray can GC the objects for you. You can do this by calling del results once you're done using them.
Here's a full suggestion that will do the same thing by feeding the array results into another task instead of getting them on the driver. This is usually a better approach because it avoids adding memory pressure on the driver and you're less likely to be accidentally pinning results in the driver's memory.
#ray.remote
def mean(*arrays):
return np.vstack(arrays).mean(axis=0)
def get_res_parallel(simulations, num_loads=num_cpus):
load_size = simulations.shape[0] / num_loads
simulations_per_load = [simulations[round(n * load_size): round((n+1) * load_size)]
for n in range(num_loads)]
# 2D numpy arrays
# Use the * syntax in Python to unpack the ObjectRefs as function arguments.
result = mean.remote(*[get_res_ray.remote(simulations=simulations)
for simulations in simulations_per_load])
# We never have the result arrays stored in driver's memory.
return ray.get(result)
MAX_RAM = 6 * 2**30 # 6 GiB
def get_expected_res(simulations, MAX_RAM=MAX_RAM):
expected_result = np.zeros(shape=87_381, dtype=np.float64)
bytes_per_res = len(expected_result) * (64 // 8)
num_steps = simulations.shape[0] * bytes_per_res // MAX_RAM + 1
step_size = simulations.shape[0] / num_steps
print(f"Number of steps: {num_steps} ({step_size:,.0f} simulations each)")
for n in range(num_steps):
print(f"\r{n / num_steps:.0%}", end="")
step_simulations = simulations[round(n * step_size): round((n+1) * step_size)]
expected_result += get_res_parallel(simulations=step_simulations)
print(f"\r100%")
return expected_result / num_steps
I am trying to run this code from fastai
from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'
def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224), num_workers = 0)
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
I get the following error
RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0;
4.00 GiB total capacity; 2.20 GiB already allocated; 6.20 MiB free; 2.23 GiB reserved in total by PyTorch)
I also tried running
import torch
torch.cuda.empty_cache()
and restarting the kernel which was of no use
Any help would be appreciated
The default batch_size used in the ImageDataLoaders.from_name_func is 64 according to the documentation here. Reducing that should solve your problem. Pass another parameter to ImageDataLoaders.from_name_func like bs=32 or any other smaller value till the error is not thrown
I'm learning about threads and how they interact with Node's native cluster module. I saw some behavior I can't explain that I'd like some help understanding.
My code:
process.env.UV_THREADPOOL_SIZE = 1;
const cluster = require('cluster');
if (cluster.isMaster) {
cluster.fork();
} else {
const crypto = require('crypto');
const express = require('express');
const app = express();
app.get('/', (req, res) => {
crypto.pbkdf2('a', 'b', 100000, 512, 'sha512', () => {
res.send('Hi there');
});
});
app.listen(3000);
}
I benchmarked this code with one request using apache benchmark.
ab -c 1 -n 1 localhost:3000/ yielded these connection times
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 605 605 0.0 605 605
Waiting: 605 605 0.0 605 605
Total: 605 605 0.0 605 605
So far so good. I then ran ab -c 2 -n 2 localhost:3000/ (doubling the number of calls from the benchmark). I expected the total time to double since I limited the libuv thread pool to one thread per child process and I only started one child process. But nothing really changed. Here's those results.
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 0
Processing: 608 610 3.2 612 612
Waiting: 607 610 3.2 612 612
Total: 608 610 3.3 612 612
For extra info, when I further increase the number of calls with ab -c 3 -n 3 localhost:3000/, I start to see a slow down.
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 599 814 352.5 922 1221
Waiting: 599 814 352.5 922 1221
Total: 599 815 352.5 922 1221
I'm running all this on a quadcore mac using Node v14.13.1.
tldr: how did my benchmark not use up all my threads? I forked one child process with one thread in its libuv pool - so the one call in my benchmark should have been all it could handle without taking longer. And yet the second test (the one that doubled the amount of calls) took the same amount of time as the benchmark.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I have to serialize to JSON string the result of tracemalloc.
current_mem, peak_mem = tracemalloc.get_traced_memory()
overhead = tracemalloc.get_tracemalloc_memory()
stats = tracemalloc.take_snapshot().statistics('traceback')[:top]
summary = "traced memory: %d KiB peak: %d KiB overhead: %d KiB" % (
int(current_mem // 1024), int(peak_mem // 1024), int(overhead // 1024)
)
logging.info("%s", summary)
out_lines = [ summary ]
for trace in stats:
out_lines.append("---")
out_lines.append( "%d KiB in %d blocks" % (int(trace.size // 1024), int(trace.count)) )
logging.info("%s", out_lines)
out_lines.extend( trace.traceback.format() )
out_lines.append('')
data = {}
data['traceback'] = '\n'.join(out_lines).encode('utf-8')
res = json.dumps(data)
print(res)
When I dump data to JSON I get a
Object of type bytes is not JSON serializable
From logging I can see the string output:
2020-01-08 11:54:25 - INFO - traced memory: 35 KiB peak: 91 KiB overhead: 31 KiB
2020-01-08 11:54:25 - INFO - ['traced memory: 35 KiB peak: 91 KiB overhead: 31 KiB', '---', '1 KiB in 4 blocks']
and then in the loop:
2020-01-08 11:54:26 - ERROR - ['traced memory: 35 KiB peak: 91 KiB overhead: 31 KiB', '---', '1 KiB in 4 blocks', ' File "/usr/local/lib/python3.7/site-packages/tornado/routing.py", line 256', ' self.delegate.finish()', ' File "/usr/local/lib/python3.7/site-packages/tornado/web.py", line 2195', ' self.execute()', ' File "/usr/local/lib/python3.7/site-packages/tornado/web.py", line 2228', ' **self.path_kwargs)', ' File "/usr/local/lib/python3.7/site-packages/tornado/gen.py", line 326', ' yielded = next(result)', ' File "/usr/local/lib/python3.7/site-packages/tornado/web.py", line 1590', ' result = method(*self.path_args, **self.path_kwargs)', ' File "/tornado/handlers/memTraceHandler.py", line 56', ' self.write(json.dumps(response.getData()))', '---', '0 KiB in 2 blocks']
So which is the b"" string I cannot serialize?
YOU are creating the bytes object here:
data['traceback'] = '\n'.join(out_lines).encode('utf-8')
That's what calling encode does.
Simply do:
data['traceback'] = '\n'.join(out_lines)
And it will dump out fine.
I want to write a logger (please no comments why and "use ...")
But I am confused with the nodejs (event?) loop/forEach.
As example:
for(var i = 0; i<100; i++){
process.stdout.write(Date.now().toString() + "\n", "utf8");
};
output as: 1466021578453, 1466021578453, 1466021578469, 1466021578469
Questions: Where comes the Delay from 16ms; And how can I prevent that?
EDIT: Windows 7, x64; (Delay on Ubuntu 15, max 2ms)
sudo ltrace -o outlog node myTest.js
This is likely more than you want. The call Date.now() translates into on my machine is clock_gettime. You want to look at the stuff between subsequent calls to clock_gettime. You're also writing out to STDOUT, each time you do that there is overhead. You can run the whole process under ltrace to see what's happening and get a summary with -c.
For me, it runs in 3 ms when not running it under ltrace.
% time seconds usecs/call calls function
------ ----------- ----------- --------- --------------------
28.45 6.629315 209 31690 memcpy
26.69 6.219529 217 28544 memcmp
16.78 3.910686 217 17990 free
9.73 2.266705 214 10590 malloc
2.92 0.679971 220 3083 _Znam
2.86 0.666421 216 3082 _ZdaPv
2.55 0.593798 206 2880 _ZdlPv
2.16 0.502644 211 2378 _Znwm
1.09 0.255114 213 1196 strlen
0.69 0.161741 215 750 pthread_getspecific
0.67 0.155609 209 744 memmove
0.57 0.133857 212 631 _ZNSo6sentryC1ERSo
0.57 0.133344 226 589 pthread_mutex_lock
0.52 0.121342 206 589 pthread_mutex_unlock
0.46 0.106343 207 512 clock_gettime
0.40 0.093022 204 454 memset
0.39 0.089857 216 416 _ZNSt9basic_iosIcSt11char_traitsIcEE4initEPSt15basic_streambufIcS1_E
0.22 0.050741 195 259 strcmp
0.20 0.047454 228 208 _ZNSt8ios_baseC2Ev
0.20 0.047236 227 208 floor
0.19 0.044603 214 208 _ZNSt6localeC1Ev
0.19 0.044536 212 210 _ZNSs4_Rep10_M_destroyERKSaIcE
0.19 0.044200 212 208 _ZNSt8ios_baseD2Ev
I'm not sure why there are 31,690 memcpy's in there and 28544 memcmp. That seems a bit excessive but perhaps that just the JIT start up cost, as for the runtime cost, you can see there are 512 calls to clock_gettime. No idea why there at that many calls either, but you can see 106ms lost in clock_gettime. Good luck with it.