Memory error in pycharm using scipy's welch function - python-3.x

I want to get the Welch's periodogram using scipy.signal in pycharm. My signal is an 5-min audio file with Fs = 48 kHz, so I guess it's a very big signal. The line was:
f, p = signal.welch(audio, Fs, nperseg=512)
I am getting a memory error. I was wondering if that's a pycharm configuration thing, or it's just a too big signal. My RAM is 8 Gb.
Sometimes it works with some audio files, but the idea is to do it with several, so after one or two, the error raises.

I've tested your setup and welch does not seem to be the problem. For further analysis the entire script you are running would be necessary.
import numpy as np
from scipy.signal import welch
fs = 48000
signal_length = 5 * 60 * fs
audio_signal = np.random.rand(signal_length)
f, Pxx = welch(audio_signal, fs=fs, nperseg=512)
On my computer (windows 10, 64 bit) it consumes 600 MB of peak memory during the call to welch which gets recycled directly afterwards, additionally to ~600MB of allocation for the initial array and Python itself. The call to welch itself does not lead to any permanent significant memory increase.
You can do the following:
Upgrade to newest version of scipy, as there have been problems with Welch previously
Check that your PC has enough free memory and close memory-hungry applications (eg. chrome)
Convert your array in a lower datatype e.g. from float64 to float32 or float16
Make sure to free variables that are not needed anymore . Especially if you load several signals and store the result in different arrays, it can accumulate quite quickly. Only keep what you need and delete vars via del variable_name, check that there are no references remaining elsewhere in the program. E.g if you don't need the audio variable, either delete it explicitly after welch(...) or overwrite it with the next audio data.
Run the garbage collector gc.collect(). However, this will probably not solve your problem as garbage is managed automatically in Python anyway.

Related

What is the fastest way to sum 2 matrices using Numba?

I am trying to find the fastest way to sum 2 matrices of the same size using Numba. I came up with 3 different approaches but none of them could beat Numpy.
Here is my code:
import numpy as np
from numba import njit,vectorize, prange,float64
import timeit
import time
# function 1:
def sum_numpy(A,B):
return A+B
# function 2:
sum_numba_simple= njit(cache=True,fastmath=True) (sum_numpy)
# function 3:
#vectorize([float64(float64, float64)])
def sum_numba_vectorized(A,B):
return A+B
# function 4:
#njit('(float64[:,:],float64[:,:])', cache=True, fastmath=True, parallel=True)
def sum_numba_loop(A,B):
n=A.shape[0]
m=A.shape[1]
C = np.empty((n, m), A.dtype)
for i in prange(n):
for j in prange(m):
C[i,j]=A[i,j]+B[i,j]
return C
#Test the functions with 2 matrices of size 1,000,000x3:
N=1000000
np.random.seed(123)
A=np.random.uniform(low=-10, high=10, size=(N,3))
B=np.random.uniform(low=-5, high=5, size=(N,3))
t1=min(timeit.repeat(stmt='sum_numpy(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t2=min(timeit.repeat(stmt='sum_numba_simple(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t3=min(timeit.repeat(stmt='sum_numba_vectorized(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t4=min(timeit.repeat(stmt='sum_numba_loop(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
print("function 1 (sum_numpy): t1= ",t1,"\n")
print("function 2 (sum_numba_simple): t2= ",t2,"\n")
print("function 3 (sum_numba_vectorized): t3= ",t3,"\n")
print("function 4 (sum_numba_loop): t4= ",t4,"\n")
Here are the results:
function 1 (sum_numpy): t1= 0.1655790419999903
function 2 (sum_numba_simple): t2= 0.3019776669998464
function 3 (sum_numba_vectorized): t3= 0.16486266700030683
function 4 (sum_numba_loop): t4= 0.1862256660001549
As you can see, the results show that there isn't any advantage in using Numba in this case. Therefore, my question is:
Is there any other implementation that would increase the speed of the summation ?
Your code is bound by page-faults (see here, here and there for more information about this). Page-faults happens because the array is newly allocated. A solution is to preallocate it and then write within it so to no cause pages to be remapped in physical memory. np.add(A, B, out=C) does this as indicated by #August in the comments. Another solution could be to adapt the standard allocator so not to give the memory back to the OS at the expense of a significant memory usage overhead (AFAIK TC-Malloc can do that for example).
There is another issue on most platforms (especially x86 ones): the cache-line write allocations of write-back caches are expensive during writes. The typical solution to avoid this is to do non-temporal store (if available on the target processor, which is the case on x86-64 one but maybe not others). That being said, neither Numpy nor Numba are able to do that yet. For Numba, I filled an issue covering a simple use-case. Compilers themselves (GCC for Numpy and Clang for Numba) tends not to generate such instructions because they can be detrimental in performance when arrays fit in cache and compilers do not know the size of the array at compile time (they could generate a specific code when they can evaluate the amount of data computed but this is not easy and can slow-down some other codes). AFAIK, the only possible way to fix this is to write a C code and use low-level instructions or to use compiler directives. In your case, about 25% of the bandwidth is lost due to this effect, causing a slowdown up to 33%.
Using multiple threads do not always make memory-bound code faster. In fact, it generally barely scale because using more core do not speed up the execution when the RAM is already saturated. Few cores are generally required so to saturate the RAM on most platforms. Page faults can benefit from using multiple cores regarding the target system (Linux does that in parallel quite well, Windows generally does not scale well, IDK for MacOS).
Finally, there is another issue: the code is not vectorized (at least not on my machine while it can be). On solution is to flatten the array view and do one big loop that the compiler can more easily vectorize (the j-based loop is too small for SIMD instructions to be effective). The contiguity of the input array should also be specified for the compiler to generate a fast SIMD code. Here is the resulting Numba code:
#njit('(float64[:,::1], float64[:,::1], float64[:,::1])', cache=True, fastmath=True, parallel=True)
def sum_numba_fast_loop(A, B, C):
n, m = A.shape
assert C.shape == A.shape
A_flat = A.reshape(n*m)
B_flat = B.reshape(n*m)
C_flat = C.reshape(n*m)
for i in prange(n*m):
C_flat[i]=A_flat[i]+B_flat[i]
return C
Here are results on my 6-core i5-9600KF processor with a ~42 GiB/s RAM:
sum_numpy: 0.642 s 13.9 GiB/s
sum_numba_simple: 0.851 s 10.5 GiB/s
sum_numba_vectorized: 0.639 s 14.0 GiB/s
sum_numba_loop serial: 0.759 s 11.8 GiB/s
sum_numba_loop parallel: 0.472 s 18.9 GiB/s
Numpy "np.add(A, B, out=C)": 0.281 s 31.8 GiB/s <----
Numba fast: 0.288 s 31.0 GiB/s <----
Optimal time: 0.209 s 32.0 GiB/s
The Numba code and the Numpy one saturate my RAM. Using more core does not help (in fact it is a bit slower certainly due to the contention of the memory controller). Both are sub-optimal since they do not use non-temporal store instructions that can prevent cache-line write allocations (causing data to be fetched from the RAM before being written back). The optimal time is the one expected using such instruction. Note that it is expected to reach only 65-80% of the RAM bandwidth because of RAM mixed read/writes. Indeed, interleaving reads and writes cause low-level overheads preventing the RAM to be saturated. For more information about how RAM works, please consider reading Introduction to High Performance Scientific Computing -- Chapter 1.3 and What Every Programmer Should Know About Memory (and possibly this).

Matplotlib savefig() slow ...just the way things are or ideas to speed up?

I've got some code where I'm using MPL (not pyplot) via imshow() to show some arrays and then am using savefig() to save them as PNG files.
The arrays are approx 3,000 x 4,000 in size.
My problem is that saving is taking a long time - on the order of 4 seconds or so per image.
Minor Details
The arrays are floats
I'm using cmap of gray
I'm making sure the figure resolution is the same as the images, and the axes fills the entire figure (so fig size * dpi matches exactly the shape of the arrays)
I'm using imshow() with interpolation of none.
Running on macbook pro - but running on anything else is about the same (assuming SSD)
The slowness seems to be due to CPU bottleneck. Using time wrapped around my code shows real and user time to be about the same, so it doesn't seem to be a IO bottleneck.
However, (very curiously!), if I run the code via Multiprocessing in multiple processes, it doesn't seem to help much with overall real time (even with 4 cores).
Questions
Is saving to PNG taking around 4 seconds 'normal'?
Any tips or ideas on how to speed things up?
Never tried it but I think you could try to run the code via Multiprocessing using the GPU (which may be more suited for the process) if you have Nvidia graphic card.
https://documen.tician.de/pycuda/
Other than that I don't think you can speed up the process more.
From the details of what you are doing it sounds like you just want to save the array as a (false) color image. You are very carefully setting up Matplotlib to do that for you, but we do not have the logic to notice we can take any short-cuts so are still going through all of the resampling logic.
You can generate an equivalent output more simply via:
import matplotlib.colors as mcolors
import matplotlib.cm as mcm
import numpy as np
import PIL
import time
# "data"
my_data = np.random.randn(3000, 4000) * 50
start_time = time.monotonic()
# to scale the data to [0, 1]
my_norm = mcolors.Normalize(-50, 50)
# to map the scaled data to gray scale RGB
my_cmap = mcm.get_cmap('gray')
setup_time = time.monotonic()
# apply the above transforms
color_mapped = my_cmap(my_norm(my_data), bytes=True)
mapping_time = time.monotonic()
# use pillow to save the png
PIL.Image.fromarray(color_mapped).save('/tmp/so.png', compress_level=1)
end_time = time.monotonic()
print(f"saving took {end_time - start_time}")
print(f" setup took {setup_time - start_time}")
print(f" mapping took {mapping_time - setup_time}")
print(f" saving took {end_time - mapping_time}")
but a majority of the time is still spent in .save(...). By playing with compress_level you can make that line more/less expensive which suggests that cost is inside of libpng while compressing the data (see https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#png).

Python 3.8 RAM owerflow and loading issues

First, I want to mention, that this is our first project in a bigger scale and therefore we don't know everything but we learn fast.
We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.
But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method('spawn') because otherwise I will get the following error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the 'spawn' start method"
Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to 'spawn' the ERROR disappears but the Jetson starts to allocate way to much memory.
Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: "cannot allocate memory" which is obvious.
I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.
#clasify
import torchvision.transforms as transforms
from skimage import io
import time
from torch.utils.data import Dataset
from .loader import *
from .ResNet import *
#if this part is in the classify() function than no allocation problem occurs
net = ResNet152(num_classes=25)
net = net.to('cuda')
save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
net.load_state_dict(save_file)
def classify(imgp=""):
#do some classification with the net
pass
if __name__ == '__main__':
mp.set_start_method('spawn') #if commented out the first error ocours
manager = mp.Manager()
return_dict = manager.dict()
p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
p.start()
p.join()
print(return_dict.values())
Any help here will be much appreciated. Thank you.

Python keeps running for 10+mins (after last statement in program) when there is huge (33GB) data structure in memory (nothing in swap)

I have need to parse a huge gz file (about ~10GB compressed, ~100GB uncompressed). The code creates data structure ('data_struct') in memory. I am running on a machine with Intel(R) Xeon(R) CPU E5-2667 v4 # 3.20GHz with 16 CPUs and plenty RAM (ie 200+ GB), running CentOS-6.9. I have implemented these things using a Class in Python3.6.3 (CPython) as shown below :
class my_class():
def __init__(self):
cmd = f'gunzip huge-file.gz'
self.process = subprocess(cmd, stdout=subprocess.PIPE, shell=True)
self.data_struct = dict()
def populate_struct(self):
for line in process.stdout:
<populate the self.data_struct dictionary>
def __del__():
self.process.wait()
#del self.data_struct # presence/absence of this statement decreases/increases runtime respectively
#================End of my_class===================
def main():
my_object = my_class()
my_object.populate_struct()
print(f'~~~~ Finished populate_struct() ~~~~') # last statement in my program.
## Python keeps running at 100% past the previous statement for 10+mins
if __name__ == '__main__':
main()
#================End of Main=======================
The resident memory consumption of my data_struct in memory (RAM only, no swap) is about ~33GB. I did $ top to find the PID of Python process and traced the Python process using $ strace -p <PID> -o <out_file> (to see what Python is doing). While it is executing populate_struct(), I can see in the out_file of strace that Python is using calls like mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b0684160000 to create data_struct. While Python was running past the last print() statement, I found that Python was issuing only munmap() operations as shown below :
munmap(0x2b3c75375000, 41947136) = 0
munmap(0x2b3c73374000, 33558528) = 0
munmap(0x2b4015d2a000, 262144) = 0
munmap(0x2b4015cea000, 262144) = 0
munmap(0x2b4015caa000, 262144) = 0
munmap(0x2b4015c6a000, 262144) = 0
munmap(0x2b4015c2a000, 262144) = 0
munmap(0x2b4015bea000, 262144) = 0
munmap(0x2b4015baa000, 262144) = 0
...
...
The Python keeps running anywhere between 10+ mins to 12mins after the last print() statement. An observation is that if I have del self.data_struct statement in __del__() method, then it takes 2mins only. I have done these experiments multiple times and the runtime decrease/increase by the presence/absence of del self.data_struct in __del__().
My questions :
I understanding is that Python is doing cleanup work by using munmap(), but unlike Python, other languages like Perl immediately release memory and exit the program. Am I doing it right by implementing as shown above ? Is there a way to tell Python to avoid this munmap() ?
Why does it take 10+mins to cleanup if there is no del self.data_struct statement in __del__(), and takes only 2mins to cleanup if there is del self.data_struct statement in __del__() ?
Is there a way to speedup the cleanup work ie munmap()?
Is there a way to exit program immediately without the cleanup work ?
Other thoughts/suggestions about tackling this problem are appreciated.
Please try a more recent version of Python (at least 3.8)? This shows several signs of being a mild(!) form of a worst-case quadratic-time algorithm in CPython's object deallocator, which was rewritten here (and note that the issue linked to here in turn contains a link to an older StackOverflow post with more details):
https://bugs.python.org/issue37029
Some glosses
If my guess is right, the amount of memory isn't particularly important - it's instead the sheer number of distinct Python objects being managed by CPython's "small object allocator" (obmalloc.c), combined with "bad luck" in the order in which their memory is released.
When that code was first written, RAM wasn't big enough to hold millions of Python objects, so nobody noticed that one particular part of deallocation logic could take time quadratic in the number of allocated "arenas" (details aren't really helpful, but "arenas" are the granularity at which system mmap() and munmap() calls are made - 256 KiB chunks).
It's not those mapping calls that are consuming mounds of time, and any decent implementation of any language using OS memory mapping facilities will eventually call munmap() enough times to release the OS resources consumed by its mmap() calls.
So that's a red herring. munmap() is being called many times simply because you allocated many objects, which required many mmap() calls.
There isn't any crisp or easy way to explain exactly when the problem shows up. See "bad luck" above ;-) The relevant code was rewritten for CPython 3.8 to be worst-case linear time instead, which gave a factor of ~250 speedup for the specific program that triggered the issue report (see the link already given).
As a comment noted, you can exit your program immediately at any time by invoking os._exit(), but the leading underscore is meant to scare you off: "immediately" means "immediately". No cleanups of any kind are performed. For example, the __del__ method in your class? Skipped. __del__ is run as a side effect of deallocation, but if you actually "immediately release memory and exit the program" then no destructors of any kind are run, nor any handlers registered with the atexit module, etc etc. It's as drastic as a program dying, e.g., with a segfault.

Python3 multiprocessing: Memory Allocation Error

I know that this question has been asked a lot of times, but the answers are not applicable.
This is answer one of a parallelized loop using multiprocessing on StackoverFlow:
import multiprocessing as mp
def processInput(i):
return i * i
if __name__ == '__main__':
inputs = range(1000000)
pool = mp.Pool(processes=4)
results = pool.map(processInput, inputs)
print(results)
This code works fine. But if I increase the range to 1000000000, my 16GB of Ram are getting filled completely and I get [Errno 12] Cannot allocate memory. It seems as if the map function starts as many processes as possible. How do I limit the number of parallel processes?
The pool.map function starts 4 processes as you instructed it (in the line processes=4 you instruct the pool on how many processes it can use to perform your logic).
There is however a different issue underlying this implementation.
The pool.map function will return a list of objects, in this case its numbers.
Numbers do not act like int-s in ANSI-C they have overhead and will not overflow (e.g. turn to -2^31 whenever reaching 2^31+1 on 32-bit).
Also python lists are not array and do incur an overhead.
To be more specific, on python 3.6, running the following code will reveal some overhead:
>>>import sys
>>>t = [1,2,3,4]
>>>sys.getsizeof(t)
96
>>>t = [x for x in range(1000)]
>>>sys.getsizeof(t)
9024
So this means 24 bytes per number on small lists and ~9 bytes on large lists.
So for a list the size of 10^9 we get about 8.5GB
EDIT: 1. As tfb mentioned, this is not even the size of the underlying Number objects, just pointers and list overhead, meaning there is much more memory overhead I did not account for in the original answer.
Default python installation on windows is 32-bit (you can get 64-bit installation but you need to check the section of all available downloads in the python website), So I assumed you are using the 32-bit installation.
range(1000000000) creates a list of 10^9 ints. This is around 8GB (8 bytes per int on a 64-bit system). You are then trying to process this to create another list of 10^9 ints. A really really smart implementation might be able to do this on a 16GB machine, but its basically a lost cause.
In Python 2 you could try using xrange which might or might not help. I am not sure what the Python 3 equivalent is.

Resources