Wrap an io.BufferedIOBase such that it becomes seek-able - python-3.x

I was trying to craft a response to a question about streaming audio from a HTTP server, then play it with PyGame. I had the code mostly complete, but hit an error where the PyGame music functions tried to seek() on the urllib.HTTPResponse object.
According to the urlib docs, the urllib.HTTPResponse object (since v3.5) is an io.BufferedIOBase. I expected this would make the stream seek()able, however it does not.
Is there a way to wrap the io.BufferedIOBase such that it is smart enough to buffer enough data to handle the seek operation?
import pygame
import urllib.request
import io
# Window size
WINDOW_WIDTH = 400
WINDOW_HEIGHT = 400
# background colour
SKY_BLUE = (161, 255, 254)
### Begin the streaming of a file
### Return the urlib.HTTPResponse, a file-like-object
def openURL( url ):
result = None
try:
http_response = urllib.request.urlopen( url )
print( "streamHTTP() - Fetching URL [%s]" % ( http_response.geturl() ) )
print( "streamHTTP() - Response Status [%d] / [%s]" % ( http_response.status, http_response.reason ) )
result = http_response
except:
print( "streamHTTP() - Error Fetching URL [%s]" % ( url ) )
return result
### MAIN
pygame.init()
window = pygame.display.set_mode( ( WINDOW_WIDTH, WINDOW_HEIGHT ) )
pygame.display.set_caption("Music Streamer")
clock = pygame.time.Clock()
done = False
while not done:
# Handle user-input
for event in pygame.event.get():
if ( event.type == pygame.QUIT ):
done = True
# Keys
keys = pygame.key.get_pressed()
if ( keys[pygame.K_UP] ):
if ( pygame.mixer.music.get_busy() ):
print("busy")
else:
print("play")
remote_music = openURL( 'http://127.0.0.1/example.wav' )
if ( remote_music != None and remote_music.status == 200 ):
pygame.mixer.music.load( io.BufferedReader( remote_music ) )
pygame.mixer.music.play()
# Re-draw the screen
window.fill( SKY_BLUE )
# Update the window, but not more than 60fps
pygame.display.flip()
clock.tick_busy_loop( 60 )
pygame.quit()
When this code runs, and Up is pushed, it fails with the error:
streamHTTP() - Fetching URL [http://127.0.0.1/example.wav]
streamHTTP() - Response Status [200] / [OK]
io.UnsupportedOperation: seek
io.UnsupportedOperation: File or stream is not seekable.
io.UnsupportedOperation: seek
io.UnsupportedOperation: File or stream is not seekable.
Traceback (most recent call last):
File "./sound_stream.py", line 57, in <module>
pygame.mixer.music.load( io.BufferedReader( remote_music ) )
pygame.error: Unknown WAVE format
I also tried re-opening the the io stream, and various other re-implementations of the same sort of thing.

Seeking seeking
According to the urlib docs, the urllib.HTTPResponse object (since v3.5) is an io.BufferedIOBase. I expected this would make the stream seek()able, however it does not.
That's correct. The io.BufferedIOBase interface doesn't guarantee the I/O object is seekable. For HTTPResponse objects, IOBase.seekable() returns False:
>>> import urllib.request
>>> response = urllib.request.urlopen("http://httpbin.org/get")
>>> response
<http.client.HTTPResponse object at 0x110870ca0>
>>> response.seekable()
False
That's because the BufferedIOBase implementation offered by HTTPResponse is wrapping a socket object, and sockets are not seekable either.
You can't wrap an BufferedIOBase object in a BufferedReader object and add seeking support. The Buffered* wrapper objects can only wrap RawIOBase types, and they rely on the wrapped object to provide seeking support. You would have to emulate seeking at raw I/O level, see below.
You can still provide the same functionality at a higher level, but take into account that seeking on remote data is a lot more involved; this isn't a simple change a simple OS variable that represents a file position on disk operation. For larger remote file data, seeking without backing the whole file on disk locally could be as sophisticated as using HTTP range requests and local (in memory or on-disk) buffers to balance sound play-back performance and minimising local data storage. Doing this correctly for a wide range of use-cases can be a lot of effort, so is certainly not part of the Python standard library.
If your sound files are small
If your HTTP-sourced sound files are small enough (a few MB at most) then just read the whole response into an in-memory io.BytesIO() file object. I really do not think it is worth making this more complicated than that, because the moment you have enough data to make that worth pursuing your files are large enough to take up too much memory!
So this would be more than enough if your sound files are smaller (no more than a few MB):
from io import BytesIO
import urllib.error
import urllib.request
def open_url(url):
try:
http_response = urllib.request.urlopen(url)
print(f"streamHTTP() - Fetching URL [{http_response.geturl()}]")
print(f"streamHTTP() - Response Status [{http_response.status}] / [{http_response.reason}]")
except urllib.error.URLError:
print("streamHTTP() - Error Fetching URL [{url}]")
return
if http_response.status != 200:
print("streamHTTP() - Error Fetching URL [{url}]")
return
return BytesIO(http_response.read())
This doesn't require writing a wrapper object, and because BytesIO is a native implementation, once the data is fully copied over, access to the data is faster than any Python-code wrapper could ever give you.
Note that this returns a BytesIO file object, so you no longer need to test for the response status:
remote_music = open_url('http://127.0.0.1/example.wav')
if remote_music is not None:
pygame.mixer.music.load(remote_music)
pygame.mixer.music.play()
If they are more than a few MB
Once you go beyond a few megabytes, you could try pre-loading the data into a local file object. You can make this more sophisticated by using a thread to have shutil.copyfileobj() copy most of the data into that file in the background and give the file to PyGame after loading just an initial amount of data.
By using an actual file object, you can actually help performance here, as PyGame will try to minimize interjecting itself between the SDL mixer and the file data. If there is an actual file on disk with a file number (the OS-level identifier for a stream, something that the SDL mixer library can make use of), then PyGame will operate directly on that and so minimize blocking the GIL (which in turn will help the Python portions of your game perform better!). And if you pass in a filename (just a string), then PyGame gets out of the way entirely and leaves all file operations over to the SDL library.
Here's such an implementation; this should, on normal Python interpreter exit, clean up the downloaded files automatically. It returns a filename for PyGame to work on, and finalizing downloading the data is done in a thread after the initial few KB has been buffered. It will avoid loading the same URL more than once, and I've made it thread-safe:
import shutil
import urllib.error
import urllib.request
from tempfile import NamedTemporaryFile
from threading import Lock, Thread
INITIAL_BUFFER = 1024 * 8 # 8kb initial file read to start URL-backed files
_url_files_lock = Lock()
# stores open NamedTemporaryFile objects, keeping them 'alive'
# removing entries from here causes the file data to be deleted.
_url_files = {}
def open_url(url):
with _url_files_lock:
if url in _url_files:
return _url_files[url].name
try:
http_response = urllib.request.urlopen(url)
print(f"streamHTTP() - Fetching URL [{http_response.geturl()}]")
print(f"streamHTTP() - Response Status [{http_response.status}] / [{http_response.reason}]")
except urllib.error.URLError:
print("streamHTTP() - Error Fetching URL [{url}]")
return
if http_response.status != 200:
print("streamHTTP() - Error Fetching URL [{url}]")
return
fileobj = NamedTemporaryFile()
content_length = http_response.getheader("Content-Length")
if content_length is not None:
try:
content_length = int(content_length)
except ValueError:
content_length = None
if content_length:
# create sparse file of full length
fileobj.seek(content_length - 1)
fileobj.write(b"\0")
fileobj.seek(0)
fileobj.write(http_response.read(INITIAL_BUFFER))
with _url_files_lock:
if url in _url_files:
# another thread raced us to this point, we lost, return their
# result after cleaning up here
fileobj.close()
http_response.close()
return _url_files[url].name
# store the file object for this URL; this keeps the file
# open and so readable if you have the filename.
_url_files[url] = fileobj
def copy_response_remainder():
# copies file data from response to disk, for all data past INITIAL_BUFFER
with http_response:
shutil.copyfileobj(http_response, fileobj)
t = Thread(daemon=True, target=copy_response_remainder)
t.start()
return fileobj.name
Like the BytesIO() solution, the above returns either None or a value ready for passing to pass to pygame.mixer.music.load().
The above will probably not work if you try to immediately set an advanced playing position in your sound files, as later data may not yet have been copied into the file. It's a trade-off.
Seeking and finding third party libraries
If you need to have full seeking support on remote URLs and don't want to use on-disk space for them and don't want to have to worry about their size, you don't need to re-invent the HTTP-as-seekable-file wheel here. You could use an existing project that offers the same functionality. I found two that offer io.BufferedIOBase-based implementations:
smart_open
httpio
Both use HTTP Range requests to implement seeking support. Just use httpio.open(URL) or smart_open.open(URL) and pass that directly to pygame.mixer.music.load(); if the URL can't be opened, you can catch that by handling the IOError exception:
from smart_open import open as url_open # or from httpio import open
try:
remote_music = url_open('http://127.0.0.1/example.wav')
except IOError:
pass
else:
pygame.mixer.music.load(remote_music)
pygame.mixer.music.play()
smart_open uses an in-memory buffer to satisfy reads of a fixed size, but creates a new HTTP Range request for every call to seek that changes the current file position, so performance may vary. Since the SDL mixer executes a few seeks on audio files to determine their type, I expect this to be a little slower.
httpio can buffer blocks of data and so might handle seeks better, but from a brief glance at the source code, when actually setting a buffer size the cached blocks are never evicted from memory again so you'd end up with the whole file in memory, eventually.
Implementing seeking ourselves, via io.RawIOBase
And finally, because I'm not able to find efficient HTTP-Range-backed I/O implementations, I wrote my own. The following implements the io.RawIOBase interface, specifically so you can then wrap the object in a io.BufferedIOReader() and so delegate caching to a caching buffer that will be managed correctly when seeking:
import io
from copy import deepcopy
from functools import wraps
from typing import cast, overload, Callable, Optional, Tuple, TypeVar, Union
from urllib.request import urlopen, Request
T = TypeVar("T")
#overload
def _check_closed(_f: T) -> T: ...
#overload
def _check_closed(*, connect: bool, default: Union[bytes, int]) -> Callable[[T], T]: ...
def _check_closed(
_f: Optional[T] = None,
*,
connect: bool = False,
default: Optional[Union[bytes, int]] = None,
) -> Union[T, Callable[[T], T]]:
def decorator(f: T) -> T:
#wraps(cast(Callable, f))
def wrapper(self, *args, **kwargs):
if self.closed:
raise ValueError("I/O operation on closed file.")
if connect and self._fp is None or self._fp.closed:
self._connect()
if self._fp is None:
# outside the seekable range, exit early
return default
try:
return f(self, *args, **kwargs)
except Exception:
self.close()
raise
finally:
if self._range_end and self._pos >= self._range_end:
self._fp.close()
del self._fp
return cast(T, wrapper)
if _f is not None:
return decorator(_f)
return decorator
def _parse_content_range(
content_range: str
) -> Tuple[Optional[int], Optional[int], Optional[int]]:
"""Parse a Content-Range header into a (start, end, length) tuple"""
units, *range_spec = content_range.split(None, 1)
if units != "bytes" or not range_spec:
return (None, None, None)
start_end, _, size = range_spec[0].partition("/")
try:
length: Optional[int] = int(size)
except ValueError:
length = None
start_val, has_start_end, end_val = start_end.partition("-")
start = end = None
if has_start_end:
try:
start, end = int(start_val), int(end_val)
except ValueError:
pass
return (start, end, length)
class HTTPRawIO(io.RawIOBase):
"""Wrap a HTTP socket to handle seeking via HTTP Range"""
url: str
closed: bool = False
_pos: int = 0
_size: Optional[int] = None
_range_end: Optional[int] = None
_fp: Optional[io.RawIOBase] = None
def __init__(self, url_or_request: Union[Request, str]) -> None:
if isinstance(url_or_request, str):
self._request = Request(url_or_request)
else:
# copy request objects to avoid sharing state
self._request = deepcopy(url_or_request)
self.url = self._request.full_url
self._connect(initial=True)
def readable(self) -> bool:
return True
def seekable(self) -> bool:
return True
def close(self) -> None:
if self.closed:
return
if self._fp:
self._fp.close()
del self._fp
self.closed = True
#_check_closed
def tell(self) -> int:
return self._pos
def _connect(self, initial: bool = False) -> None:
if self._fp is not None:
self._fp.close()
if self._size is not None and self._pos >= self._size:
# can't read past the end
return
request = self._request
request.add_unredirected_header("Range", f"bytes={self._pos}-")
response = urlopen(request)
self.url = response.geturl() # could have been redirected
if response.status not in (200, 206):
raise OSError(
f"Failed to open {self.url}: "
f"{response.status} ({response.reason})"
)
if initial:
# verify that the server supports range requests. Capture the
# content length if available
if response.getheader("Accept-Ranges") != "bytes":
raise OSError(
f"Resource doesn't support range requests: {self.url}"
)
try:
length = int(response.getheader("Content-Length", ""))
if length >= 0:
self._size = length
except ValueError:
pass
# validate the range we are being served
start, end, length = _parse_content_range(
response.getheader("Content-Range", "")
)
if self._size is None:
self._size = length
if (start is not None and start != self._pos) or (
length is not None and length != self._size
):
# non-sensical range response
raise OSError(
f"Resource at {self.url} served invalid range: pos is "
f"{self._pos}, range {start}-{end}/{length}"
)
if self._size and end is not None and end + 1 < self._size:
# incomplete range, not reaching all the way to the end
self._range_end = end
else:
self._range_end = None
fp = cast(io.BufferedIOBase, response.fp) # typeshed doesn't name fp
self._fp = fp.detach() # assume responsibility for the raw socket IO
#_check_closed
def seek(self, offset: int, whence: int = io.SEEK_SET) -> int:
relative_to = {
io.SEEK_SET: 0,
io.SEEK_CUR: self._pos,
io.SEEK_END: self._size,
}.get(whence)
if relative_to is None:
if whence == io.SEEK_END:
raise IOError(
f"Can't seek from end on unsized resource {self.url}"
)
raise ValueError(f"whence value {whence} unsupported")
if -offset > relative_to: # can't seek to a point before the start
raise OSError(22, "Invalid argument")
self._pos = relative_to + offset
# there is no point in optimising an existing connection
# by reading from it if seeking forward below some threshold.
# Use a BufferedIOReader to avoid seeking by small amounts or by 0
if self._fp:
self._fp.close()
del self._fp
return self._pos
# all read* methods delegate to the SocketIO object (itself a RawIO
# implementation).
#_check_closed(connect=True, default=b"")
def read(self, size: int = -1) -> Optional[bytes]:
assert self._fp is not None # show type checkers we already checked
res = self._fp.read(size)
if res is not None:
self._pos += len(res)
return res
#_check_closed(connect=True, default=b"")
def readall(self) -> bytes:
assert self._fp is not None # show type checkers we already checked
res = self._fp.readall()
self._pos += len(res)
return res
#_check_closed(connect=True, default=0)
def readinto(self, buffer: bytearray) -> Optional[int]:
assert self._fp is not None # show type checkers we already checked
n = self._fp.readinto(buffer)
self._pos += n or 0
return n
Remember that this is a RawIOBase object, which you really want to wrap in a BufferReader(). Doing so in open_url() looks like this:
def open_url(url, *args, **kwargs):
return io.BufferedReader(HTTPRawIO(url), *args, **kwargs)
This gives you fully buffered I/O, with full support seeking, over a remote URL, and the BufferedReader implementation will minimise resetting the HTTP connection when seeking. I've found that using this with the PyGame mixer, only single HTTP connection is made, as all the test seeks are within the default 8KB buffer.

If your fine with using the requests module (which supports streaming) instead of urllib, you could use a wrapper like this:
class ResponseStream(object):
def __init__(self, request_iterator):
self._bytes = BytesIO()
self._iterator = request_iterator
def _load_all(self):
self._bytes.seek(0, SEEK_END)
for chunk in self._iterator:
self._bytes.write(chunk)
def _load_until(self, goal_position):
current_position = self._bytes.seek(0, SEEK_END)
while current_position < goal_position:
try:
current_position = self._bytes.write(next(self._iterator))
except StopIteration:
break
def tell(self):
return self._bytes.tell()
def read(self, size=None):
left_off_at = self._bytes.tell()
if size is None:
self._load_all()
else:
goal_position = left_off_at + size
self._load_until(goal_position)
self._bytes.seek(left_off_at)
return self._bytes.read(size)
def seek(self, position, whence=SEEK_SET):
if whence == SEEK_END:
self._load_all()
else:
self._bytes.seek(position, whence)
Then I guess you can do something like this:
WINDOW_WIDTH = 400
WINDOW_HEIGHT = 400
SKY_BLUE = (161, 255, 254)
URL = 'http://localhost:8000/example.wav'
pygame.init()
window = pygame.display.set_mode( ( WINDOW_WIDTH, WINDOW_HEIGHT ) )
pygame.display.set_caption("Music Streamer")
clock = pygame.time.Clock()
done = False
font = pygame.font.SysFont(None, 32)
state = 0
def play_music():
response = requests.get(URL, stream=True)
if (response.status_code == 200):
stream = ResponseStream(response.iter_content(64))
pygame.mixer.music.load(stream)
pygame.mixer.music.play()
else:
state = 0
while not done:
for event in pygame.event.get():
if ( event.type == pygame.QUIT ):
done = True
if event.type == pygame.KEYDOWN and state == 0:
Thread(target=play_music).start()
state = 1
window.fill( SKY_BLUE )
window.blit(font.render(str(pygame.time.get_ticks()), True, (0,0,0)), (32, 32))
pygame.display.flip()
clock.tick_busy_loop( 60 )
pygame.quit()
using a Thread to start streaming.
I'm not sure this works 100%, but give it a try.

Related

How can I use BytesIO as a pandas.read_csv data source

I am trying to perform a csv data parsing using pandas.read_csv(bytes, chunksize=n) where bytes is a ongoing stream of data which I want to receive from a database CLOB field, reading it by chunks.
reader = pandas.read_csv(io.BytesIO(b'1;qwer\n2;asdf\n3;zxcv'), sep=';', chunksize=2)
for row_chunk in reader:
print(row_chunk)
Code above is working fine, but I want to use some updatable stream instead of fixed io.BytesIO(b'...')
I tried to redefine read method like this
class BlobIO(io.BytesIO):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._chunk_size = 4
self._file_data_table = 'my_table'
self._job_id = 'job_id'
self._get_raw_sql = """
select dbms_lob.substr(body, {0}, {1})
from {2}
where job_id = '{3}'
"""
dsn_tns = cx_Oracle.makedsn('host', 'port', 'service_name')
self.ora_con = cx_Oracle.connect('ora_user', 'ora_pass', dsn_tns)
self.res = b''
self.ora_cur = self.ora_con.cursor()
self.chunker = self.get_chunk()
next(self.chunker)
def get_chunk(self):
returned = 0
sended = (yield)
self._chunk_size = sended or self._chunk_size
while True:
to_exec = self._get_raw_sql.format(
self._chunk_size,
returned + 1,
self._file_data_table,
self._job_id)
self.ora_cur.execute(to_exec)
self.res = self.ora_cur.fetchall()[0][0]
returned += self._chunk_size
yield self.res
sended = (yield self.res)
self._chunk_size = sended or self._chunk_size
if not self.res:
break
def read(self, nbytes=None):
if nbytes:
self.chunker.send(nbytes)
else:
self.chunker.send(self._chunk_size)
try:
to_return = next(self.chunker)
except StopIteration:
self.ora_con.close()
to_return = b''
return to_return
buffer = BlobIO()
reader = pandas.read_csv(buffer, encoding='cp1251', sep=';', chunksize=2)
but it looks like I'm doing something completely wrong because pd.read_csv never got executed here at the last line and I don't understand what is happening there.
Maybe creating buffer = BytesIO(b'') and then writing new data to the buffer buffer.write(new_chunk_from_db) could be a better approach but I don't understand when exactly should I call such a write action.
I believe I can create a temporary file with the contents of a CLOB which I can then pass to read_csv, but I really want to skip this step and read data directly from database.
Please give me some directions.
cx_Oracle provides native way to read LOBs. Seems like overriding BytesIO read with cx_Oracle LOB read does the job:
class BlobIO(BytesIO):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.res = b''
self.ora_con = db.get_conn()
self.ora_cur = self.ora_con.cursor()
self.ora_cur.execute("select lob from table")
self.res = self.ora_cur.fetchall()[0][0]
self.offset = 1
def read(self, size=None):
r = self.res.read(self.offset, size)
self.offset += size
# size + 1 should be here to perform nonoverlaping reads
# but looks like panadas C parser uses some kind of overlaping
# because while testing size+1 - parser occasionally missed some bytes
if not r:
self.ora_cur.close()
self.ora_con.close()
return r
blob_buffer = BlobIO()
reader = pandas.read_csv(
blob_buffer,
chunksize=JobContext.rchunk_size)
for row_chunk in reader:
print(row_chunk)

Python Multiprocessing Pool Stops Abruptly

I am trying to perform parallel processing for my requirements, and the code seems to be working as expected for 4k-5k elements in parallel. But as soon as the elements to be processed start increasing, the code processes a few listings and then without throwing any error, the program stops running abruptly.
I checked and the program is not hung, the RAM is available (I have a 16 Gb RAM) and CPU Utilization is not even 30%. Can't seem to figure out what is happening. I have 1 million elements to be processed.
def get_items_to_download():
#iterator to fetch all items that are to be downloaded
yield download_item
def start_download_process():
multiproc_pool = multiprocessing.Pool(processes=10)
for download_item in get_items_to_download():
multiproc_pool.apply_async(start_processing, args = (download_item, ), callback = results_callback)
multiproc_pool.close()
multiproc_pool.join()
def start_processing(download_item):
try:
# Code to download item from web API
# Code to perform some processing on the data
# Code to update data into database
return True
except Exception as e:
return False
def results_callback(result):
print(result)
if __name__ == "__main__":
start_download_process()
UPDATE -
Found the error- BrokenPipeError: [Errno 32] Broken pipe
Trace -
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 125, in worker
put((job, i, result))
File "/usr/lib/python3.6/multiprocessing/queues.py", line 347, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
The code looks correct. The only thing I can think of is that all of your processes are hanging waiting for completion. Here is a suggestion: Rather than using the callback mechanism provided by apply_async, use the AsyncResult object that is returned to get the return value from the process. You can call get on this object specifying a timeout value (30 seconds arbitrarily specified below, possibly not long enough). If the task has not completed in that duration, a timeout exception will be thrown (you could catch it, if you wish). But this will test the hypothesis that the processes are hanging. Just be sure to specify a timeout value that is sufficiently large that the task should complete within that time period. I have also broken up the task submissions up into batches of 1000, not because I think the size of 1,000,000 is a problem per se, but just so you don't have a list of a 1,000,000 result objects. But if you find that you no longer hang as a result, then try increasing the batch size and see if it does make a difference.
import multiprocessing
def get_items_to_download():
#iterator to fetch all items that are to be downloaded
yield download_item
BATCH_SIZE = 1000
def start_download_process():
with multiprocessing.Pool(processes=10) as multiproc_pool:
results = []
for download_item in get_items_to_download():
results.append(multiproc_pool.apply_async(start_processing, args = (download_item, )))
if len(results) == BATCH_SIZE:
process_results(results)
results = []
if len(results):
process_results(results)
def start_processing(download_item):
try:
# Code to download item from web API
# Code to perform some processing on the data
# Code to update data into database
return True
except Exception as e:
return False
TIMEOUT_VALUE = 30 # or some suitable value
def process_results(results):
for result in results:
return_value = result.get(TIMEOUT_VALUE) # will cause an exception if process is hanging
print(return_value)
if __name__ == "__main__":
start_download_process()
Update
Based on Googling several pages for your broken pipe error, it appears that your error could be the result of exhausting memory. See Python Multiprocessing: Broken Pipe exception after increasing Pool size, for example. The following reworking attempts to utilize less memory. If it works, you can then try to increase the batch size:
import multiprocessing
BATCH_SIZE = 1000
POOL_SIZE = 10
def get_items_to_download():
#iterator to fetch all items that are to be downloaded
yield download_item
def start_download_process():
with multiprocessing.Pool(processes=POOL_SIZE) as multiproc_pool:
items = []
for download_item in get_items_to_download():
items.append(download_item)
if len(items) == BATCH_SIZE:
process_items(multiproc_pool, items)
items = []
if len(items):
process_items(multiproc_pool, items)
def start_processing(download_item):
try:
# Code to download item from web API
# Code to perform some processing on the data
# Code to update data into database
return True
except Exception as e:
return False
def compute_chunksize(iterable_size):
if iterable_size == 0:
return 0
chunksize, extra = divmod(iterable_size, POOL_SIZE * 4)
if extra:
chunksize += 1
return chunksize
def process_items(multiproc_pool, items):
chunksize = compute_chunksize(len(items))
# you must iterate the iterable returned:
for return_value in multiproc_pool.imap(start_processing, items, chunksize):
print(return_value)
if __name__ == "__main__":
start_download_process()
def get_items_to_download():
#instead of yield, return the complete generator object to avoid iterating over this function.
#Return type - generator (download_item1, download_item2...)
return download_item
def start_download_process():
download_item = get_items_to_download()
# specify the chunksize to get faster results.
with multiprocessing.Pool(processes=10) as pool:
#map_async() is also available, if that's your use case.
results= pool.map(start_processing, download_item, chunksize=XX )
print(results)
return(results)
def start_processing(download_item):
try:
# Code to download item from web API
# Code to perform some processing on the data
# Code to update data into database
return True
except Exception as e:
return False
def results_callback(result):
print(result)
if __name__ == "__main__":
start_download_process()
I had the same experience with Python 3.8 on Linux. I set up a new environment with Python 3.7 and multiprocessing.Pool() works now without any issue.

Python Selector with FIFO running to infinite loop

I am trying to write some non-blocking FIFO code with kqueue on my BSD machine. Here's the small server code: server.py
import os
import selectors
sel = selectors.KqueueSelector()
TMP_PATH="/tmp/myfifo"
def fifo_read(fd, mask):
data = os.read(fd, 8)
print("fd:{} gives:{} \n", fd, data)
sel.unregister(fd)
print("unregistered")
def fifo_accept(listen_fd, mask):
print("accepted {}".format(listen_fd))
fd = os.dup(listen_fd)
print("duped to {}".format(fd))
sel.register(fd, selectors.EVENT_READ, fifo_read)
if __name__ == "__main__":
try:
os.unlink(TMP_PATH)
except:
pass
os.mkfifo(TMP_PATH)
listen_fd = os.open(TMP_PATH, os.O_RDONLY, mode=0o600)
sel.register(listen_fd, selectors.EVENT_READ, fifo_accept)
while True:
events = sel.select()
for key, mask in events:
cb = key.data
cb(key.fileobj, mask)
sel.close()
Now, when I run a client.py as:
import os
TMP_PATH="/tmp/myfifo"
fd = os.open(TMP_PATH, os.O_WRONLY, mode=0o600)
res = os.write(fd, b"1234567")
print("sent {}".format(res))
When I run the client, I get:
sent 7
But on server, it runs to inifinite loop. Now I understand why the infinite loop is happening. I actually tried mimicking the socket way of using selectors in this Python Docs example.
Here's what I have tried:
I did try the code without duplicating the fd, but it's still in infinite loop.
I tried calling sel.unregister on the original listen_fd, but in this case, running the client the second time doesn't work (which is expected).
Can anyone please let me know if I'm missing something?
So I found one solution to this problem. With sockets, we get a new socket object on accept. So we need to emulate that behaviour by calling unregister on the original fileobj, open again and call register on that.
Fixed code:
import os
import selectors
sel = selectors.KqueueSelector()
try:
os.unlink("./myfifo")
except:
pass
os.mkfifo("./myfifo", 0o600)
def cb(fp):
sel.unregister(fp)
print(f"got {fp.read()}")
fp.close()
fp2 = open("./myfifo", "rb")
sel.register(fp2, selectors.EVENT_READ, cb)
if __name__ == "__main__":
orig_fp = open("./myfifo", "rb")
print("open done")
ev = sel.register(orig_fp, selectors.EVENT_READ, cb)
print(f"registration done for {ev}")
while True:
events = sel.select()
print(events)
for key, mask in events:
key.data(key.fileobj)

Create sub stream in python

How can I create a "sub stream" in python. Let's say I have an file opened for reading. I want to return a file-like object that you can use to read only part of that file.
with open(filename, 'rb') as f:
start = 0x34
size = 0x20
return Substream(f, start, size) # <-- How do I do this?
Seeking to 0 on this object should go to "start" on the f object. Further more reading past size should trigger eof behavior. Hope this makes sense. How do I accomplish this?
A quick subclass of io.RawIOBase seems to do the trick, at least for my use case. I understand this is not a full implementation of the io.RawIOBase interface, but it gets the job done.
class Substream(io.RawIOBase):
"""Represents a view of a subset of a file like object"""
def __init__(self, file: io.RawIOBase, start, size):
self.file = file
self.start = start
self.size = size
self.p = 0
def seek(self, offset, origin=0):
if origin == 0:
self.p = offset
elif origin == 1:
self.p += offset
# TODO: origin == 2
else:
raise ValueError("Unexpected origin: {}".format(origin))
def read(self, n):
prev = self.file.tell()
self.file.seek(self.start + self.p)
data = self.file.read(n if self.p + n <= self.size else self.size - self.p)
self.p += len(data)
self.file.seek(prev)
return data
Use it like so
with open(filename) as f:
print(Substream(f, 10, 100).read(10))
I wonder if this can be done on file descriptor level instead somehow...?

Pygame sound file just beeping [duplicate]

I tried pygame for playing wav file like this:
import pygame
pygame.init()
pygame.mixer.music.load("mysound.wav")
pygame.mixer.music.play()
pygame.event.wait()
but It change the voice and I don't know why!
I read this link solutions and can't solve my problem with playing wave file!
for this solution I dont know what should I import?
s = Sound()
s.read('sound.wav')
s.play()
and for this solution /dev/dsp dosen't exist in new version of linux :
from wave import open as waveOpen
from ossaudiodev import open as ossOpen
s = waveOpen('tada.wav','rb')
(nc,sw,fr,nf,comptype, compname) = s.getparams( )
dsp = ossOpen('/dev/dsp','w')
try:
from ossaudiodev import AFMT_S16_NE
except ImportError:
if byteorder == "little":
AFMT_S16_NE = ossaudiodev.AFMT_S16_LE
else:
AFMT_S16_NE = ossaudiodev.AFMT_S16_BE
dsp.setparameters(AFMT_S16_NE, nc, fr)
data = s.readframes(nf)
s.close()
dsp.write(data)
dsp.close()
and when I tried pyglet It give me this error:
import pyglet
music = pyglet.resource.media('mysound.wav')
music.play()
pyglet.app.run()
--------------------------
nima#ca005 Desktop]$ python play.py
Traceback (most recent call last):
File "play.py", line 4, in <module>
music = pyglet.resource.media('mysound.wav')
File "/usr/lib/python2.7/site-packages/pyglet/resource.py", line 587, in media
return media.load(path, streaming=streaming)
File "/usr/lib/python2.7/site-packages/pyglet/media/__init__.py", line 1386, in load
source = _source_class(filename, file)
File "/usr/lib/python2.7/site-packages/pyglet/media/riff.py", line 194, in __init__
format = wave_form.get_format_chunk()
File "/usr/lib/python2.7/site-packages/pyglet/media/riff.py", line 174, in get_format_chunk
for chunk in self.get_chunks():
File "/usr/lib/python2.7/site-packages/pyglet/media/riff.py", line 110, in get_chunks
chunk = cls(self.file, name, length, offset)
File "/usr/lib/python2.7/site-packages/pyglet/media/riff.py", line 155, in __init__
raise RIFFFormatException('Size of format chunk is incorrect.')
pyglet.media.riff.RIFFFormatException: Size of format chunk is incorrect.
AL lib: ReleaseALC: 1 device not closed
You can use PyAudio. An example here on my Linux it works:
#!usr/bin/env python
#coding=utf-8
import pyaudio
import wave
#define stream chunk
chunk = 1024
#open a wav format music
f = wave.open(r"/usr/share/sounds/alsa/Rear_Center.wav","rb")
#instantiate PyAudio
p = pyaudio.PyAudio()
#open stream
stream = p.open(format = p.get_format_from_width(f.getsampwidth()),
channels = f.getnchannels(),
rate = f.getframerate(),
output = True)
#read data
data = f.readframes(chunk)
#play stream
while data:
stream.write(data)
data = f.readframes(chunk)
#stop stream
stream.stop_stream()
stream.close()
#close PyAudio
p.terminate()
Works for me on Windows:
https://pypi.org/project/playsound/
>>> from playsound import playsound
>>> playsound('/path/to/a/sound/file/you/want/to/play.wav')
NOTE: This has a bug in Windows where it doesn't close the stream.
I've added a PR for a fix here:
https://github.com/TaylorSMarks/playsound/pull/53/commits/53240d970aef483b38fc6d364a0ae0ad6f8bf9a0
The reason pygame changes your audio is mixer defaults to a 22k sample rate:
initialize the mixer module
pygame.mixer.init(frequency=22050, size=-16, channels=2, buffer=4096): return None
Your wav is probably 8k. So when pygame plays it, it plays roughly twice as fast. So specify your wav frequency in the init.
Pyglet has some problems correctly reading RIFF headers. If you have a very basic wav file (with exactly a 16 byte fmt block) with no other information in the fmt chunk (like 'fact' data), it works. But it makes no provision for additional data in the chunks, so it's really not adhering to the RIFF interface specification.
PyGame has 2 different modules for playing sound and music, the pygame.mixer module and the pygame.mixer.music module. This module contains classes for loading Sound objects and controlling playback. The difference is explained in the documentation:
The difference between the music playback and regular Sound playback is that the music is streamed, and never actually loaded all at once. The mixer system only supports a single music stream at once.
If you want to play a single wav file, you have to initialize the module and create a pygame.mixer.Sound() object from the file. Invoke play() to start playing the file. Finally, you have to wait for the file to play.
Use get_length() to get the length of the sound in seconds and wait for the sound to finish:
(The argument to pygame.time.wait() is in milliseconds)
import pygame
pygame.mixer.init()
my_sound = pygame.mixer.Sound('mysound.wav')
my_sound.play()
pygame.time.wait(int(my_sound.get_length() * 1000))
Alternatively you can use pygame.mixer.get_busy to test if a sound is being mixed. Query the status of the mixer continuously in a loop:
import pygame
pygame.init()
pygame.mixer.init()
my_sound = pygame.mixer.Sound('mysound.wav')
my_sound.play()
while pygame.mixer.get_busy():
pygame.time.delay(10)
pygame.event.poll()
Windows
winsound
If you are a Windows user,the easiest way is to use winsound.You don't even need to install it.
Not recommended, too few functions
import winsound
winsound.PlaySound("Wet Hands.wav", winsound.SND_FILENAME)
# add winsound.SND_ASYNC flag if you want to wait for it.
# like winsound.PlaySound("Wet Hands.wav", winsound.SND_FILENAME | winsound.SND_ASYNC)
mp3play
If you are looking for more advanced functions, you can try mp3play.
Unluckily,mp3play is only available in Python2 and Windows.
If you want to use it on other platforms,use playsound despite its poor functions.If you want to use it in Python3,I will give you the modified version which is available on Python 3.(at the bottom of the answer)
Also,mp3play is really good at playing wave files, and it gives you more choices.
import time
import mp3play
music = mp3play.load("Wet Hands.wav")
music.play()
time.sleep(music.seconds())
Cross-platform
playsound
Playsound is very easy to use,but it is not recommended because you can't pause or get some infomation of the music, and errors often occurs.Unless other ways doesn't work at all, you may try this.
import playsound
playsound.playsound("Wet Hands.wav", block=True)
pygame
I'm using this code and it works on Ubuntu 22.04 after my test.
If it doesn't work on your machine, consider updating your pygame lib.
import pygame
pygame.mixer.init()
pygame.mixer.music.load("Wet Hands.wav")
pygame.mixer.music.play()
while pygame.mixer.music.get_busy():
pass
pyglet
This works on Windows but it doesn't work on my Ubuntu, so I can do nothing.
import pyglet
import time
sound = pyglet.media.load("Wet Hands.wav", "Wet Hands.wav")
sound.play()
time.sleep(sound.duration)
Conclusion
It seems that you are using Linux,so playsound may be your choice.My code maybe cannot solve your problem by using pygame and pyglet,because I always use Windows.If none of the solutions work on your machine,I suggest you run the program on Windows...
To other users seeing my answer, I have done many tests among many libraries,so if you are using Windows,you may try mp3play which can play both mp3 and wave files, and mp3play is the most pythonic, easy, light-weight and functional library.
mp3play in Python3
just copy the code below and create a file named mp3play.py in your working directory and paste the content.
import random
from ctypes import windll, c_buffer
class _mci:
def __init__(self):
self.w32mci = windll.winmm.mciSendStringA
self.w32mcierror = windll.winmm.mciGetErrorStringA
def send(self, command):
buffer = c_buffer(255)
command = command.encode(encoding="utf-8")
errorcode = self.w32mci(command, buffer, 254, 0)
if errorcode:
return errorcode, self.get_error(errorcode)
else:
return errorcode, buffer.value
def get_error(self, error):
error = int(error)
buffer = c_buffer(255)
self.w32mcierror(error, buffer, 254)
return buffer.value
def directsend(self, txt):
(err, buf) = self.send(txt)
# if err != 0:
# print('Error %s for "%s": %s' % (str(err), txt, buf))
return err, buf
class _AudioClip(object):
def __init__(self, filename):
filename = filename.replace('/', '\\')
self.filename = filename
self._alias = 'mp3_%s' % str(random.random())
self._mci = _mci()
self._mci.directsend(r'open "%s" alias %s' % (filename, self._alias))
self._mci.directsend('set %s time format milliseconds' % self._alias)
err, buf = self._mci.directsend('status %s length' % self._alias)
self._length_ms = int(buf)
def volume(self, level):
"""Sets the volume between 0 and 100."""
self._mci.directsend('setaudio %s volume to %d' %
(self._alias, level * 10))
def play(self, start_ms=None, end_ms=None):
start_ms = 0 if not start_ms else start_ms
end_ms = self.milliseconds() if not end_ms else end_ms
err, buf = self._mci.directsend('play %s from %d to %d'
% (self._alias, start_ms, end_ms))
def isplaying(self):
return self._mode() == 'playing'
def _mode(self):
err, buf = self._mci.directsend('status %s mode' % self._alias)
return buf
def pause(self):
self._mci.directsend('pause %s' % self._alias)
def unpause(self):
self._mci.directsend('resume %s' % self._alias)
def ispaused(self):
return self._mode() == 'paused'
def stop(self):
self._mci.directsend('stop %s' % self._alias)
self._mci.directsend('seek %s to start' % self._alias)
def milliseconds(self):
return self._length_ms
def __del__(self):
self._mci.directsend('close %s' % self._alias)
_PlatformSpecificAudioClip = _AudioClip
class AudioClip(object):
__slots__ = ['_clip']
def __init__(self, filename):
self._clip = _PlatformSpecificAudioClip(filename)
def play(self, start_ms=None, end_ms=None):
if end_ms is not None and end_ms < start_ms:
return
else:
return self._clip.play(start_ms, end_ms)
def volume(self, level):
assert 0 <= level <= 100
return self._clip.volume(level)
def isplaying(self):
return self._clip.isplaying()
def pause(self):
return self._clip.pause()
def unpause(self):
return self._clip.unpause()
def ispaused(self):
return self._clip.ispaused()
def stop(self):
return self._clip.stop()
def seconds(self):
return int(round(float(self.milliseconds()) / 1000))
def milliseconds(self):
return self._clip.milliseconds()
def load(filename):
"""Return an AudioClip for the given filename."""
return AudioClip(filename)

Resources