I'm trying to load a decent amount of audio segments in librosa (about 173K) all <15 sec for the most part and when I run my function, within 30 minutes my RAM is at 90%+ capacity used.. eventually crashing my computer completely.
The segments are .wav files and I've tried soundfile and audioread as standalone but same result. I also tried different iterator methods which doesn't work either. I've ran diagnostics on my RAM and everything is fine. Am I simply trying to loop through to many audio files at once? I would imagine that since my files are extremely small that this shouldn't be a problem. I've had no issues with memory leakage in the past or running large model job batches.
RAM: 16.0 GB
Disk space for cache: 2TB of space
Tried this:
def load_wavs(wav_dir, sr):
wavs = list()
for file in os.listdir(wav_dir):
file_path = os.path.join(wav_dir, file)
wav, _ = librosa.load(file_path, sr = sr, mono = True)
#wav = wav.astype(np.float64)
wavs.append(wav)
return wavs
Tried this:
def load_segs(audio_arrays):
segments_data = []
for a in audio_arrays:
data = librosa.load(a, sr=16000, mono=True)
segments_data.append(data)
print(librosa.display.waveplot(data))
And tried this:
audio_data_all = []
for i in audio_arrays:
data = librosa.load(i, sr=16000, mono=True)
audio_data_all.append(data)
And this in each function:
audio_data = [librosa.load(i, sr=16000, mono=True) for i in audio_arrays]
Any help would be much appreciated, thanks.
Each loaded audio file will take up memory. This is roughly samplewidth_bytes * channels * samplerate * seconds_per_sample * number_of_samples bytes.
Using 16 kHz samplerate, loaded to 64 bit float, 1 channel, up to 15 seconds, and 173k audio files this is: (8*1*16000*15*173000)/1e9 = 332 GB.
So it will not fit in 16 GB of RAM.
This is not a memory leak issue, just that you are trying to load too much data at a time. Process the audio files one by one or in batches of up to 1-2k files instead.
Related
So I'm making a program that corrects stereo in-balance for an audio file. I'm using pysoundfile to read/write the files. Code looks something like this.
import soundfile as sf
data, rate = sf.read("Input.wav")
for d in data:
# processes audio
sf.write("Output.wav", data, rate, 'PCM_24')
The issue is that I'm working with DJ mixes that can be a couple hours long. So loading the entire mix into ram is causing the program to be killed.
My question is how do I read/write the file in smaller sections vs loading the entire thing?
I need to process about 200 folders containing 300 pictures (205 kb) from an external HD.
I have the following loop within a thread.
ffs=FileFrameStream(lFramePaths).start()
#___While Loop through the frames____
image,path = ffs.read()
while ffs.more(): #While there is frames in the Queue to read
try:
img = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
#some more operations....
except:
print(f"Erorr in picture:{path}")
image,path=ffs.read()
count+=1
continue
image,path=ffs.read()
count+=1
ffs.stop()
The code runs fast for 1 to 30-40 folders. One folder takes around 0.5s, and for 20 13.20s, but if I want to analyse the 200 folders, it takes 500-600 s. So I don't know what I'm doing wrong or how I can increase the performance of the code.
I appreciate any help you can provide.
Eduardo
You are probably seeing effects of your operating-system's file cache. When parsing 200 files, it may go out of capacity and the actual streaming is done directly from disk instead of RAM.
Check if your OS file cache capacity is less than sum of 200 files' sizes or if it can cache the external drive or not. When data is beginning to not fit the cache, the performance drops exponentially, unless the disk drive is nearly as fast as RAM (I guess its not).
I want to download and extract 100 tar.gz files that are each 1GB in size. Currently, I've sped it up with multithreading and by avoiding disk IO via in-memory byte streams, but can anyone show me how to make this faster (just for curiosity's sake)?
from bs4 import BeautifulSoup
import requests
import tarfile
import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor
# speed up by only extracting what we need
def select(members):
for file in members:
if any(ext in file.name for ext in [".tif", ".img"]):
yield file
# for each url download the tar.gz and extract the necessary files
def download_and_extract(x):
# read and unzip as a byte stream
r = requests.get(x, stream=True)
tar = tarfile.open(fileobj=r.raw, mode='r|gz')
tar.extractall(members=select(tar))
tar.close()
# parallel download and extract the 96 1GB tar.gz files
links = get_asset_links()
# 3 * cpu count seemed to be fastest on a 4 core cpu
with ThreadPoolExecutor(3 * mp.cpu_count()) as executor:
executor.map(download_and_extract, links)
My current approach takes 20 - 30 minutes. I'm not sure what the theoretical possible speed up is, but if its helpful, the download speed for a single file is 20 MB/s in isolation.
If anyone could indulge my curiosity, that would be greatly appreciated! Some things I looked into were asyncio, aiohttp, and aiomultiprocess, io.BytesIO etc. But I wasn't able to get them to work well with the tarfile library.
Your computation is likely IO bound. Compression is generally a slow task, especially the gzip algorithm (new algorithms can be much faster). From the provided information, the average reading speed is about 70 Mo/s. This means that the storage throughput is at least roughly 140 Mo/s. It looks like totally normal and expected. This is especially true if you use a HDD or a slow SSD.
Besides this, it seems you iterate over the files twice due to the selection of members. Keep in mind that tar gz files are a big block of files packed together and then compressed with gzip. To iterate over the filenames the tar file need to be already partially decompressed. This may not be a problem regarding the implementation of tarfile (possible caching). If the size of all the discarded files is small, it may be better to simply decompress the whole archive in a raw and then remove the files to discard. Moreover, if you have a lot a memory and the size of all discarded files is not small, you can decompress the files in an in-memory virtual storage device first in order to write the discarded files. This can be natively done on Linux systems.
I want to get the Welch's periodogram using scipy.signal in pycharm. My signal is an 5-min audio file with Fs = 48 kHz, so I guess it's a very big signal. The line was:
f, p = signal.welch(audio, Fs, nperseg=512)
I am getting a memory error. I was wondering if that's a pycharm configuration thing, or it's just a too big signal. My RAM is 8 Gb.
Sometimes it works with some audio files, but the idea is to do it with several, so after one or two, the error raises.
I've tested your setup and welch does not seem to be the problem. For further analysis the entire script you are running would be necessary.
import numpy as np
from scipy.signal import welch
fs = 48000
signal_length = 5 * 60 * fs
audio_signal = np.random.rand(signal_length)
f, Pxx = welch(audio_signal, fs=fs, nperseg=512)
On my computer (windows 10, 64 bit) it consumes 600 MB of peak memory during the call to welch which gets recycled directly afterwards, additionally to ~600MB of allocation for the initial array and Python itself. The call to welch itself does not lead to any permanent significant memory increase.
You can do the following:
Upgrade to newest version of scipy, as there have been problems with Welch previously
Check that your PC has enough free memory and close memory-hungry applications (eg. chrome)
Convert your array in a lower datatype e.g. from float64 to float32 or float16
Make sure to free variables that are not needed anymore . Especially if you load several signals and store the result in different arrays, it can accumulate quite quickly. Only keep what you need and delete vars via del variable_name, check that there are no references remaining elsewhere in the program. E.g if you don't need the audio variable, either delete it explicitly after welch(...) or overwrite it with the next audio data.
Run the garbage collector gc.collect(). However, this will probably not solve your problem as garbage is managed automatically in Python anyway.
I'm creating a basic GUI as a college project. It scans a user selected hard drive from their PC and gives them information about it such as the number of files on it etc...
There's a part of my scanning function that, for each file on the drive, takes the size of said file in bytes, and adds it to a running total. At the end of this, after comparing the number to the Windows total, I always find that my Python script finds less data than Windows says is on the drive.
Below is the code...
import os
overall_space_used = 0
def Scan (drive):
global overall_space_used
for path, subdirs, files in os.walk (r"" + drive + "\\"):
for file in files:
overall_space_used = overall_space_used + os.path.getsize(os.path.join(path,file))
print (overall_space_used)
When this is executed on one my HDDs, Python says there is 23,328,445,304 bytes of data in total (21.7 GB). However, when I go into the drive in Windows, it says that there is 23,536,922,624 bytes of data (21.9 GB). Why is there this difference?
I calculated it by hand, and using the same formula that Windows used to convert from bytes to gibibytes (gibibytes = bytes / 1024**3), I still arrived .2 GB short. Why is Python finding less data?
With os.path.getsize(...) you get the actual size of the file.
But NTFS, FAT32,... filesystems use cluster to store data in them, so they aren't filled up fully.
You can see this difference, when you go to the properties of a file, there is a difference between 'size' and 'size on the disk'. Now when you check the file size of the disk, it gives you the size of the used up clusters and not the size of the files added up.
Here some more detailed information:
Why is There a Big Difference Between ‘Size’ and ‘Size on Disk’?