Downloading & extracting in parallel, maximizing performance?

Downloading & extracting in parallel, maximizing performance? - multithreading

I want to download and extract 100 tar.gz files that are each 1GB in size. Currently, I've sped it up with multithreading and by avoiding disk IO via in-memory byte streams, but can anyone show me how to make this faster (just for curiosity's sake)?
from bs4 import BeautifulSoup
import requests
import tarfile
import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor
# speed up by only extracting what we need
def select(members):
for file in members:
if any(ext in file.name for ext in [".tif", ".img"]):
yield file
# for each url download the tar.gz and extract the necessary files
def download_and_extract(x):
# read and unzip as a byte stream
r = requests.get(x, stream=True)
tar = tarfile.open(fileobj=r.raw, mode='r|gz')
tar.extractall(members=select(tar))
tar.close()
# parallel download and extract the 96 1GB tar.gz files
links = get_asset_links()
# 3 * cpu count seemed to be fastest on a 4 core cpu
with ThreadPoolExecutor(3 * mp.cpu_count()) as executor:
executor.map(download_and_extract, links)
My current approach takes 20 - 30 minutes. I'm not sure what the theoretical possible speed up is, but if its helpful, the download speed for a single file is 20 MB/s in isolation.
If anyone could indulge my curiosity, that would be greatly appreciated! Some things I looked into were asyncio, aiohttp, and aiomultiprocess, io.BytesIO etc. But I wasn't able to get them to work well with the tarfile library.

Your computation is likely IO bound. Compression is generally a slow task, especially the gzip algorithm (new algorithms can be much faster). From the provided information, the average reading speed is about 70 Mo/s. This means that the storage throughput is at least roughly 140 Mo/s. It looks like totally normal and expected. This is especially true if you use a HDD or a slow SSD.
Besides this, it seems you iterate over the files twice due to the selection of members. Keep in mind that tar gz files are a big block of files packed together and then compressed with gzip. To iterate over the filenames the tar file need to be already partially decompressed. This may not be a problem regarding the implementation of tarfile (possible caching). If the size of all the discarded files is small, it may be better to simply decompress the whole archive in a raw and then remove the files to discard. Moreover, if you have a lot a memory and the size of all discarded files is not small, you can decompress the files in an in-memory virtual storage device first in order to write the discarded files. This can be natively done on Linux systems.

Related

How do I read/write small sections of an audio file with pysoundfile?

So I'm making a program that corrects stereo in-balance for an audio file. I'm using pysoundfile to read/write the files. Code looks something like this.
import soundfile as sf
data, rate = sf.read("Input.wav")
for d in data:
# processes audio
sf.write("Output.wav", data, rate, 'PCM_24')
The issue is that I'm working with DJ mixes that can be a couple hours long. So loading the entire mix into ram is causing the program to be killed.
My question is how do I read/write the file in smaller sections vs loading the entire thing?

How can I optimize the speed of reading files from an external HD in Python?

I need to process about 200 folders containing 300 pictures (205 kb) from an external HD.
I have the following loop within a thread.
ffs=FileFrameStream(lFramePaths).start()
#___While Loop through the frames____
image,path = ffs.read()
while ffs.more(): #While there is frames in the Queue to read
try:
img = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
#some more operations....
except:
print(f"Erorr in picture:{path}")
image,path=ffs.read()
count+=1
continue
image,path=ffs.read()
count+=1
ffs.stop()
The code runs fast for 1 to 30-40 folders. One folder takes around 0.5s, and for 20 13.20s, but if I want to analyse the 200 folders, it takes 500-600 s. So I don't know what I'm doing wrong or how I can increase the performance of the code.
I appreciate any help you can provide.
Eduardo

You are probably seeing effects of your operating-system's file cache. When parsing 200 files, it may go out of capacity and the actual streaming is done directly from disk instead of RAM.
Check if your OS file cache capacity is less than sum of 200 files' sizes or if it can cache the external drive or not. When data is beginning to not fit the cache, the performance drops exponentially, unless the disk drive is nearly as fast as RAM (I guess its not).

How to zip and unzip files that do not fit in memory with Python

I have a lot of files which sum up to more than 125GB. I want to compress them because they will be transferred over the internet quite often, so I think about zipping them before transfer and then decompressing, but a colleague told me that wit would have to fit in memory in order to do it using Python.
Is there a way to do it without using up all my memory? It is possible that the built-in ZipFile module already avoids loading all data into memory (and that my colleague is mistaken), but I have not found any source with the answer.

so I think about zipping them before transfer and then decompressing, but a colleague told me that wit would have to fit in memory in order to do it using Python.
This is not true, and certainly not true if you control both ends (the zip format does technically allow for zip files that can't be stream-unzipped, but I'm yet to actually see one)
You can use stream-zip and stream-unzip to do this (full disclosure: written mostly by me). Both avoid not only storing any zip file or member file in memory, but also avoids even having to have them on disk - random access is not required.
The details depend a bit on where the files are and how you want to transfer them, but an example to stream zip:
from datetime import datetime
from stream_zip import ZIP_64, stream_zip
def unzipped_files():
modified_at = datetime.now()
perms = 0o600
def file_data():
# An iterable that yields bytes of the file
# e.g. could come from disk or an http request
yield b'Some bytes 1'
yield b'Some bytes 2'
# ZIP_64 mode
yield 'my-file-1.txt', modified_at, perms, ZIP_64, file_data()
# An iterable of bytes to then, for example, save to disk,
# or send via an http request
zipped_chunks = stream_zip(unzipped_files())
and an example to stream unzip:
from stream_unzip import stream_unzip
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks):
# Can save the unzipped chunks to disk here instead of printing
# them out
for chunk in unzipped_chunks:
print(chunk)
In the above examples zipped_chunks and unzipped_chunks are iterables that yield bytes. And to save any such iterables to disk, you can use a pattern like this:
with open('my.zip', 'wb') as f:
for chunk in zipped_chunks:
f.write(chunk)

You could use gzip instead, which more readily supports stream compression and decompression.

How to convert fixed size dimension to unlimited in a netcdf file

I'm downloading daily 600MB netcdf-4 files that have this structure:
netcdf myfile {
dimensions:
time_counter = 18 ;
depth = 50 ;
latitude = 361 ;
longitude = 601 ;
variables:
salinity
temp, etc
I'm looking for a better way to convert the time_counter dimension from a fixed size (18) to an unlimited dimension.
I found a way of doing it with the netcdf commands and sed. Like this:
ncdump myfile.nc | sed -e "s#^.time_counter = 18 ;#time_counter = UNLIMITED ; // (currently 18)#" | ncgen -o myfileunlimited.nc
which worked for me for small files, but when dumping a 600 MB netcdf files, takes to much memory and time.
Somebody knows another method for accomplishing this?

Your answers are very insightful. I'm not really looking a way to improve this ncdump-sed-ncgen method, I know that dumping a netcdf file that is 600MB uses almost 5 times more space in a text file (CDL representation). To then modify some header text and generate the netcdf file again, doesn't feels very efficient.
I read the latest NCO commands documentation, and found a option specific to ncks "--mk_rec_dmn". Ncks mainly extracts and writes or appends data to a new netcdf file, then this seems the better approach, extract all the data of myfile.nc and write it with a new record dimension (unlimited dimension) which the "--mk_rec_dmn" does, then replace the old file.
ncks --mk_rec_dmn time_counter myfile.nc -o myfileunlimited.nc ; mv myfileunlimited.nc myfile.nc
To do the opposite operation (record dimension to fixed-size) would be.
ncks --fix_rec_dmn time_counter myfile.nc -o myfilefixedsize.nc ; mv myfilefixedsize.nc myfile.nc

The shell pipeline can only be marginally improved by making the sed step only modify the beginning of the file and pass everything else through, but the expression you have is very cheap to process and will not make a dent in the time spent.
The core problem is likely that you're spending a lot of time in ncdump formatting the file information into textual data, and in ncgen parsing textual data into a NetCDF file format again.
As the route through dump+gen is about as slow as it is shown, that leaves using NetCDF functionality to do the conversion of your data files.
If you're lucky, there may be tools that operate directly on your data files to do changes or conversions. If not, you may have to write them yourself with the NetCDF libraries.
If you're extremely unlucky, NetCDF-4 files are HDF5 files with some extra metadata. In particular, the length of the dimensions is stored in the _netcdf_dim_info dataset in group _netCDF (or so the documentation tells me).
It may be possible to modify the information there to turn the current length of the time_counter dimension into the value for UNLIMITED (which is the number 0), but if you do this, you really need to verify the integrity of the resulting file, as the documentation neatly puts it:
"Note that modifying these files with HDF5 will almost certainly make them unreadable to netCDF-4."
As a side note, if this process is important to your group, it may be worth looking into what hardware could do the task faster. On my Bulldozer system, the process of converting a 78 megabyte file takes 20 seconds, using around 500 MB memory for ncgen working set (1 GB virtual) and 12 MB memory for ncdump working set (111 MB virtual), each task taking up the better part of a core.
Any decent disk should read/sink your files in 10 seconds or so, memory doesn't matter as long as you don't swap, so CPU is probably your primary concern if you take the dump+gen route.
If concurrent memory use is a big concern, you can trade some bytes for space by saving the intermediary result from sed onto disk, which will likely take up to 1.5 gigabytes or so.

You can use the xarray python package's xr.to_netdf() method, then optimise memory usage via using Dask.
You just need to pass names of the dimensions to make unlimited to the unlimited_dims argument and use the chunks to split the data. For instance:
import xarray as xr
ds = xr.open_dataset('myfile.nc', chunks={'time_counter': 18})
ds.to_netcdf('myfileunlimited.nc', unlimited_dims={'time_counter':True})
There is a nice summary of combining Dask and xarray linked here.

How to modify a gzip compressed file

i've a single gzip compressed file (100GB uncompressed 40GB compressed). Now i would like to modify some bytes / ranges of bytes - i DO NOT want to change the files size.
For example
Bytes 8 + 10 and Bytes 5000 - 40000
is this possible without recompressing the whole file?
Stefan

Whether you want to change the file sizes makes no difference (since the resulting gzip isn't laid out according to the original file sizes anyway), but if you split the compressed file into parts so that the parts you want to modify are in isolated chunks, and use a multiple-file compression method instead of the single-file gzip method, you could update just the changed files without decompressing and compressing the entire file.
In your example:
bytes1-7.bin \
bytes8-10.bin \ bytes.zip
bytes11-4999.bin /
bytes5000-40000.bin /
Then you could update bytes8-10.bin and bytes5000-40000.bin but not the other two. But whether this will take less time is dubious.

In a word, no. It would be necessary to replace one or more deflate blocks with new blocks with exactly the same total number of bits, but with different contents. If the new data is less compressible with deflate, this becomes impossible. Even if it is more compressible, it would require a lot of bit twiddling by hand to try to get the bits to match. And it still might not be possible.

The man page for gzip says "If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip." I believe that means that gzip compression continues through the files, therefore is context-sensitive, and therefore will not permit what you want.
Either decompress/patch/recompress, or switch to a different representation of your data (perhaps an uncompressed tar or zip of individually compressed files, so you only have to decompress/recompress the one you want to change.) The latter will not store your data as compactly, in general, but that's the tradeoff you have to make.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string