Slicing MediaRecorder stream bytes in Python WebSocket produces invalid data when found processing input from FFmpeg - audio

I have a WebSocket created using Python WebSocket.
I have a JS client.
The client uses .getUserMedia to capture only audio stream. The client uses MediaRecorder(stream), and .ondataavailable to send the WebSocket a message as a blob.
WebSocket receives the message successfully using message = websocket.recv(), and I am able to write the received audio bytes to a wav file by extending a list.
audio_bytes.extend(message)
if len(audio_bytes) > 20000:
by = io.BytesIO(bytes(audio_bytes))
sound = AudioSegment.from_file(by).export('new_sample.wav', format='wav')
Works fine.
However, I don't want to keep all the bytes in a list for the duration of running this script.
When I clear the list and extend the bytes as before, not picking up the stream from the start causes an issue, or if I slice the original list and try to write to a WAV file using the same method as before.
audio_bytes.extend(message)
if len(audio_bytes) > 20000:
audio_bytes2 = audio_bytes[3000:12000]
by = io.BytesIO(bytes(audio_bytes2))
sound = AudioSegment.from_file(by).export('new_sample2.wav', format='wav')
I want to be able to basically clear the list after the length reaches > 20000, and use y, sr = librosa.load('new_sample.wav'), do further analysis and repeat.
I'm not very experienced in working with audio. I know the headers take up a certain amount of bytes at the start (around 54?).
I stored the first 1000 elements of the list in a header variable header = audio_bytes[:1000] from the first message received in the WebSocket, and extended this header to the new list I created when the length of message received reaches > 20000.
I tested this, and although the first 1000 bytes stayed the same, I was still receiving from FFmpeg:
Invalid data found when processing input
I have also written the bytes directly to a .txt file and loaded the .txt file directly with librosa, again, successful when its the full stream but if I chunk the stream and try to load a chunk at the end, librosa gives the error
Format not recognised

Related

How to split an image file into chunks of n bytes to be sent to an api in Node.js?

I'm trying to upload an image to an API that requires it to be sent in chunks of n bytes at a time (The chunk size is dynamic and I get that earlier). The parameters for the request are the chunk index and the image payload. So if I have the file, how would I go about splitting it into n-byte chunks to send in an axios request? Thanks!
You can use split-file npm package to do that
There are lots of options to get a chunk of a file. Here are some of those options:
let stream = fs.createReadStream(filename, {start: firstByte, end: lastByte});
Then, you can pipe that stream to a response or attach your own data listener and as the bytes from the stream come in, you do something with them. The start and end options will automatically limit the stream to that particular chunk of the file.
You could also open the file, then read a specific chunk with fs.read() where you pass the position and length arguments.

Why does zlib decompression break after an http request is reinitated?

I have a python script that "streams" a very large gzip file using urllib3 and feeds it into a zlib.decompressobj. This zlib decompression object is configured to read gzip compression. If this initial http connection is interrupted then the zlib.decompressobj begins to throw errors after the connection is "resumed". See my source code below if you want to cut to the chase.
These errors occur despite the fact that the script initiates a new http connection using the Range header to specify the number of bytes previously read. It resumes from the completed read point present when the connection was broken. I believe this arbitrary resume point is the source of my problem.
If I don't try to decompress the chunks of data being read in by urllib3, but instead just write them to a file, everything works just fine. Without trying to decompress the stream everything works even when there is an interruption. The completed archive is valid, it is the same size as one downloaded by a browser and the MD5 hash of the .gz file is the same as if I had downloaded the file directly with Chrome.
On the other hand, if I try to decompress the chunks of data coming in after the interruption, even with the Range header specified, the zlib library throws all kinds of errors. The most recent was Error -3 while decompressing data: invalid block type
Additional Notes:
The site that I am using has the Accept-Range flag set to bytes meaning that I am able to submit modified Range headers to the server.
I am not using the requests library in this script as it ultimately manages urllib3. I am instead using urllib3 directly in an attempt to cut out the middle man.
This script it an oversimplification of my ultimate goal, which is to stream the compressed data directly from where it is hosted, enrich it and store it in a MySQL database on the local network.
I am heavily resource constrained inside of the docker container where this processing will occur.
The genesis of this question is present in a question I asked almost 3 weeks ago: requests.iter_content() thinks file is complete but it's not
The most common problem I am encountering with the urllib3 (and requests) library is the IncompleteRead(self._fp_bytes_read, self.length_remaining) error.
This error only appears if the urllib3 library has been patched to raise an exception when an incomplete read occurs.
My best guess:
I am guessing that the break in the data stream being fed to zlib.decompressobj is causing zlib to somehow lose context and start attempting to decompress the data again in an odd location. Sometimes it will resume, however the data stream is garbled, making me believe the byte location used as the new Range header fell at the front of some bytes which are then incorrectly interpreted as headers. I do not know how to counteract this and I have been trying to solve it for several weeks. The fact that the data are still valid when downloaded whole (without being decompressed prior to completion) even with an interruption occurs, makes me believe that some "loss of context" within zlib is the cause.
Source Code: (Has been updated to include a "buffer")
This code is a little bit slapped together so forgive me. Also, this target gzip file is quite a lot smaller than the actual file I will be using. Additionally, the target file in this example will no longer be available from Rapid7 in about a month's time. You may choose to substitute a different .gz file if that suits you.
import urllib3
import certifi
import inspect
import os
import time
import zlib
def patch_urllib3():
"""Set urllib3's enforce_content_length to True by default."""
previous_init = urllib3.HTTPResponse.__init__
def new_init(self, *args, **kwargs):
previous_init(self, *args, enforce_content_length = True, **kwargs)
urllib3.HTTPResponse.__init__ = new_init
#Path the urllib3 module to throw an exception for IncompleteRead
patch_urllib3()
#Set the target URL
url = "https://opendata.rapid7.com/sonar.http/2021-11-27-1638020044-http_get_8899.json.gz"
#Set the local filename
local_filename = '2021-11-27-1638020044-http_get_8899_script.json.gz'
#Configure the PoolManager to handle https (I think...)
http = urllib3.PoolManager(ca_certs=certifi.where())
#Initiate start bytes at 0 then update as download occurs
sum_bytes_read=0
session_bytes_read=0
total_bytes_read=0
#Dummy variable to silence console output from file write
writer=0
#Set zlib window bits to 16 bits for gzip decompression
decompressor = zlib.decompressobj(zlib.MAX_WBITS|16)
#Build a buffer list
buf_list=[]
i=0
while True:
print("Building request. Bytes read:",total_bytes_read)
resp = http.request(
'GET',
url,
timeout=urllib3.Timeout(connect=15, read=40),
preload_content=False)
print("Setting headers.")
#This header should cause the request to resume at "total_bytes_read"
resp.headers['Range'] = 'bytes=%s' % (total_bytes_read)
print("Local filename:",local_filename)
#If file already exists then append to it
if os.path.exists(local_filename):
print("File already exists.")
try:
print("Starting appended download.")
with open(local_filename, 'ab') as f:
for chunk in resp.stream(2048):
buf_list.append(chunk)
#Use i to offset the chunk being read from the "buffer"
#I.E. load 3 chunks (0,1,2) in the buffer list before starting to read from it
if i >2:
buffered_chunk=buf_list.pop(0)
writer=f.write(buffered_chunk)
#Comment out the below line to stop the error from occurring.
#File download should complete successfully even if interrupted when the following line is commented out.
decompressed_chunk=decompressor.decompress(buffered_chunk)
#Increment i so that the buffer list will fill before reading from it
i=i+1
session_bytes_read = resp._fp_bytes_read
#Sum bytes read is an updated value that isn't stored. It is only used for console print
sum_bytes_read = total_bytes_read + session_bytes_read
print("[+] Bytes read:",str(format(sum_bytes_read, ",")), end='\r')
print("\nAppended download complete.")
break
except Exception as e:
print(e)
#Add to total bytes read to current session bytes each time the loop needs to repeat
total_bytes_read=total_bytes_read+session_bytes_read
print("Bytes Read:",total_bytes_read)
#Mod the total_bytes back to the nearest chunk size so it can be - re-requested
total_bytes_read=total_bytes_read-(total_bytes_read%2048)-2048
print("Rounded bytes Read:",total_bytes_read)
#Pop the last entry off of the buffer since it may be incomplete
buf_list.pop()
#reset i so that the buffer has to rebuilt
i=0
print("Sleeping for 30 seconds before re-attempt...")
time.sleep(30)
#If file doesn't already exist then write to it directly
else:
print("File does not exist.")
try:
print("Starting initial download.")
with open(local_filename, 'wb') as f:
for chunk in resp.stream(2048):
buf_list.append(chunk)
#Use i to offset the chunk being read from the "buffer"
#I.E. load 3 chunks (0,1,2) in the buffer list before starting to read from it
if i > 2:
buffered_chunk=buf_list.pop(0)
#print("Buffered Chunk",str(i-2),"-",buffered_chunk)
writer=f.write(buffered_chunk)
decompressed_chunk=decompressor.decompress(buffered_chunk)
#Increment i so that the buffer list will fill before reading from it
i=i+1
session_bytes_read = resp._fp_bytes_read
print("[+] Bytes read:",str(format(session_bytes_read, ",")), end='\r')
print("\nInitial download complete.")
break
except Exception as e:
print(e)
#Set the total bytes read equal to the session bytes since this is the first failure
total_bytes_read=session_bytes_read
print("Bytes Read:",total_bytes_read)
#Mod the total_bytes back to the nearest chunk size so it can be - re-requested
total_bytes_read=total_bytes_read-(total_bytes_read%2048)-2048
print("Rounded bytes Read:",total_bytes_read)
#Pop the last entry off of the buffer since it may be incomplete
buf_list.pop()
#reset i so that the buffer has to rebuilt
i=0
print("Sleeping for 30 seconds before re-attempt...")
time.sleep(30)
print("Looping...")
#Finish writing from buffer into file
#BE SURE TO SET TO "APPEND" with "ab" or you will overwrite the start of the file
f = open(local_filename, 'ab')
print("[+] Finishing write from buffer.")
while not len(buf_list) == 0:
buffered_chunk=buf_list.pop(0)
writer=f.write(buffered_chunk)
decompressed_chunk=decompressor.decompress(buffered_chunk)
#Flush and close the file
f.flush()
f.close()
resp.release_conn()
Reproducing the error
To reproduce the error perform the following actions:
Run the script and let the download start
Be sure that line 65 decompressed_chunk=decompressor.decompress(chunk) is not commented out
Turn off your network connection until an exception is raised
Turn your network connection back on immediately.
If the decompressor.decompress(chunk) line is removed from the script then it will download the file and the data can be successfully decompressed from the file itself. However, if line 65 is present and an interruption occurs, the zlib library will not be able to continue decompressing the data stream. I need to decompress the data stream as I cannot store the actual file I am trying to use.
Is there some way to prevent this from occurring? I have now attempted to add a "buffer" list that stores the chunks; the script discards the last chunk after a failure and moves back to a point in the file that preceded the "failed" chunk. I am able to re-establish the connection and even pull back all the data correctly but even with a "buffer" my ability to decompress the stream is interrupted. I must not be smoothly recovering the data back to the buffer somehow.
Visualization:
I put this together very quickly in an attempt to better describe what I am trying to do...
I bet Mark Adler is hiding out there somewhere...
r+b doesn't append. You would need to use ab for that. It appears that on the re-try, you are reading the entire gzip file again from the start. With r+b, that file is written correctly to your output file, by overwriting what was read before.
However, you are feeding the initial read to the decompressor, and then the start of the file again. Not surprisingly, the decompressor then soon detects invalid compressed data.

How to input audio as bytes in moviepy

I have audio as bytes in the form of:
b'ID3\x04\x00\x00\x00\x00\x00#TSSE\x00\x00\x00\x0f\x00\x00\x03Lavf57.71.100\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\...
That I got from Amazon web services:
import boto3
client = boto3.client('polly')
response = client.synthesize_speech(
Engine='neural',
LanguageCode='en-GB',
OutputFormat='mp3',
SampleRate='8000',
Text='hey whats up this is a test',
VoiceId='Brian'
)
And I want to input it into moviepy audiofile using
AudioFileClip()
AudioFileClip takes filename or an array representing a sound. I know I can save the audio as a file and read it, but I would like to have AudioFileClip take the bytes output I showed above.
I tried:
AudioFileClip(response['AudioStream'].read())
But this gives the error:
TypeError: endswith first arg must be bytes or a tuple of bytes, not
str
What can I do?
You need to convert the stream of audio to a different type. (Thats why its called TypeError). You are putting it as a string and it wants a byte format.
You can convert a str to a byte by using the bytearrayfunction!
https://docs.python.org/3/library/functions.html#func-bytearray
You can also look at this question:
Best way to convert string to bytes in Python 3?
For more help just comment on this anwser, and Ill try to help you as soon as possible.
Hope this can help you on your project,
PythonMasterLua

Win32api for serial communication in excel VBA

I have data acquisition devices I would like to pull information from. I've started a small project in Python 3.5 using pyserial to communicate to a device. I can send commands and receive data.
import serial
ser = serial.Serial()
ser.port = 'COM1'
ser.baudrate = 9600
ser.parity = PARITY_NONE
ser.timeout=.5
ser.open()
ser.write(b'#02\r')
print(ser.readline())
ser.close()
This sends a command to retrieve data in the buffer, and when I use the readline command, I pull in data.
b'>-999999-999999-999999-999999 -999999\r'
I've created an excel sheet to host data tables and test criteria which I am judging performance of our machines on. This was initially for manual user input, but I decided I'd try and see if I can automate this directly in excel. I've poured over several webpages, found several companies that would ask for payment for code- etc. I've finally settled on work done by The Scarms which uses the WIN32API to deal with serial I/O vs. the original mscomm32.ocx driver.
I've been able to bring his files into my project, and used the sample code to start. I can send a message, and visually verify it from the device I'm communicating through, but I don't get any reply from my end data acquisition device.
strData = "#02\r"
lngSize = Len(strData)
lngStatus = CommWrite(intPortID, strData)
The variable strData is a string. When sending a message using pyserial, it's prefaced by "b" which (to my knowledge) signals it to be encoded to bits before sent through the serial port.
I've been trying to look through the modCOMM code that gets added to VBA from the code provided by the link above, but I can't seem to get an input at all. Am I sending the information incorrectly using the WIN32API?
How do I send this command over the bus properly in order to get a response from the end device?
The end device in question is an Advantech ADAM 4017+.

How to determine HTTP range request start & end bytes (nodejs + mongodb)

I wonder if it is possible to determine the start & end bytes in HTTP range requests or let the browser somehow know where to start and let it use some user defined chunk size or so.
I have a file in my database and it is split into multiple chunks, each chunk is 2 MB.
eg. 20 MB file => 10 chunks
When the browser starts downloading the file (video file), I have studied Chrome, it firsts checks the 'range=bytes 0-' byte range and if the server sucessfully responds with the 'right' bytes and 206 headers back, then it sends another request for the end bytes of the file eg 'range=bytes 1900000-',
It just checkes if your server responds well for the partial response
On the serverside I have coded my app so that it will send 2 MB partials if you ask it nicely to it :)
What I want the browser to do
range=bytes 0-'
range=bytes 2000000-4000000 bytes'
range=bytes 4000000-6000000 bytes'
But if you ask a partial which doesnt fit in a 2mb chunk it will give an error. Or it will just not play from the right position for a audio/video file.
range=bytes 2500000-4000000 bytes'
range=bytes 0-1000000 bytes'
this will give an error because I cannot start to send from a part of a chunk. Otherwise I have to slice my chunks and do some buffer operations. But I want to keep it clean.
If this is possible please let me know.
I am assuming that you are streaming an mp4 file? Different parts (boxes) of the mp4 have different purposes, its not possible to jump to a random position in the file and start playing without out first identifying the location of each frame by preloading the index (moov). The moov can be at the beginning, or end of a file, so the browser MAY need the end of the of file first. It can determine this by starting from the beginning, and looking for the moov, if it is not at the start, there will be a pointer to the location of the next box. It can leap frog through the file until it finds the index. Once the moov header is downloaded, The browser will know the EXACT byte offset and size of every single frame in the video, and can jump around the file as you seek. This is all possible because the browser knows how to parse mp4 natively. TLDR. No, your solution will not work.

Resources