pyav - cannot save stream as mono - audio

I'm trying to use pyav to convert arbitrary audio file to a low quality, mono, wave file.
I almost managed to do it, but it's stereo, and I couldn't find how to make it mono. Furthermore, I think I made some mistake here, as I had to repeat the rate in the output_container.add_stream and in the AudioResampler - it seems redundant, and I can't understand what would happen if those numbers won't match.
My code is:
import av
input_file = 'some.mp3'
output_file = 'new.wav'
rate = 22000
output_container = av.open(output_file, 'w')
# can I tell `output_stream` to just use `resampler`'s info?
# or, if not, how can I tell it to have only 1 channel?
output_stream = output_container.add_stream('pcm_u8', rate)
resampler = av.audio.resampler.AudioResampler('u8p', 'mono', rate)
input_container = av.open(input_file)
for frame in input_container.decode(audio=0):
out_frames = resampler.resample(frame)
for out_frame in out_frames:
for packet in output_stream.encode(out_frame):
output_container.mux(packet)
output_container.close()
And not related to my main question, but any comments regarding my code, or pointing out mistakes, are welcomed. I hardly could find usage examples to use a reference, and PyAV API documentation isn't very detailed...

Searching around in StackOverflow, I found https://stackoverflow.com/a/72386137/1543290 which has:
out_stream = out_container.add_stream(
'pcm_s16le',
rate=sample_rate,
layout='mono'
)
So, the answer is adding layout='mono'.
Sadly, this parameter is not documented.

Related

How do I detect certain things that I said in a speech recognizer script

I am trying to make a voice activated virtual assistant of sorts using python, but I am not sure how to detect and distinguish between the different voice commands. Currently it just repeats back to you, "You Said [whatever i said]" but i want it to respond differently to different things that I say. I am quite new to python and don't know what I should do. Does anyone know how I could do this?
You have to define what you want it to do. The last two lines of this tell the program to do something if the input is hello. So when you run it, you say "hello" and it will have a different response. If it does not detect that you said "hello" then it will not do anything. I might recommend finding a project on github where they have already done an assistant like this and start to try to understand what they did and edit to the specifications you want.
import speech_recognition as sr
sample_rate = 48000
chunk_size = 2048
r = sr.Recognizer()
device_id = 1
with sr.Microphone(device_index=device_id, sample_rate=sample_rate, chunk_size=chunk_size) as source:
print("Say something...")
r.adjust_for_ambient_noise(source)
audio = r.listen(source)
text = r.recognize_google(audio)
if text.lower() == "hello":
print("Hi, how are you?")

Using lame from within SoxSharp

I've always used this command line to create an mp3 with Bit rate: 32kBit / s and Sample rate: 22050 Hz:
"lame -b 32 --resample 22050 input.wav output.mp3"
Now I wanted to use SoxSharp for that, and it has an mp3 option and uses libmp3lame.dll, so I guess it should work.
However, I'm unable to figure the right parameters.
The available parameters for the mp3 output are listed below.
Using nSox As Sox = New Sox("d:\dev\projects\sox-14-4-0\sox.exe")
nSox.Output.Type = FileType.MP3
nSox.Output.SampleRate = I guess that would be 22050 in my case?
nSox.Output.Channels = 1 'yep, I want mono
nSox.Output.Encoding = // not sure what to make of it
nSox.Output.SampleSize = // not sure what to make of it
nSox.Output.ByteOrder = // I guess I shouldn't touch that
nSox.Output.ReverseBits = // I guess I shouldn't touch that
nSox.Output.Compression = // absolutely not sure what I should choose here
nSox.Process("input.wav", "output.mp3")
End Using
Does anybody see where I should insert my "32"? And is .SampleRate = 22050 correct in my case?? The Windows file property dialogue doesn't give me any real hints if I do it correctly, and Audacity converts the audio to the format of my project.
Thank you very much for the help!
Investigating on the source code of SoxSharp, it can't even handle the most basic lame commands out of the box. Basically everything has to be put in the "CustomArguments" property.

Adding watermark to video

I am able to use the moviepy library to add a watermark to a section of video. However when I do this it is taking the watermarked segment, and creating a new file with it. I am trying to figure out if it is possible to simply splice in the edited part back into the original video, as moviepy is EXTREMELY slow writing to the disk, so the smaller the segment the better.
I was thinking maybe using shutil?
video = mp.VideoFileClip("C:\\Users\\admin\\Desktop\\Test\\demovideo.mp4").subclip(10,20)
logo = (mp.ImageClip("C:\\Users\\admin\\Desktop\\Watermark\\watermarkpic.png")
.set_duration(20)
.resize(height=20) # if you need to resize...
.margin(right=8, bottom=8, opacity=0) # (optional) logo-border padding
.set_pos(("right","bottom")))
final = mp.CompositeVideoClip([video, logo])
final.write_videofile("C:\\Users\\admin\\Desktop\\output\\demovideo(watermarked).mp4", audio = True, progress_bar = False)
Is there a way to copy the 10 second watermarked snippet back into the original video file? Or is there another library that allows me to do this?
What is slow in your use case is the fact that Moviepy needs to decode and reencode each frame of the movie. If you want speed, I believe there are ways to ask FFMPEG to copy video segments without rencoding.
So you could use ffmpeg to cut the video into 3 subclips (before.mp4/fragment.mp4/after.mp4), only process fragment.mp4, then reconcatenate all clips together with ffmpeg.
The cutting into 3 clips using ffmpeg can be done from moviepy:
https://github.com/Zulko/moviepy/blob/master/moviepy/video/io/ffmpeg_tools.py#L27
However for concatenating everything together you may need to call ffmpeg directly.

audio file isn't being parsed with Google Speech

This question is a followup to a previous question.
The snippet of code below almost works...it runs without error yet gives back a None value for results_list. This means it is accessing the file (I think) but just can't extract anything from it.
I have a file, sample.wav, living publicly here: https://storage.googleapis.com/speech_proj_files/sample.wav
I am trying to access it by specifying source_uri='gs://speech_proj_files/sample.wav'.
I don't understand why this isn't working. I don't think it's a permissions problem. My session is instantiated fine. The code chugs for a second, yet always comes up with no result. How can I debug this?? Any advice is much appreciated.
from google.cloud import speech
speech_client = speech.Client()
audio_sample = speech_client.sample(
content=None,
source_uri='gs://speech_proj_files/sample.wav',
encoding='LINEAR16',
sample_rate_hertz= 44100)
results_list = audio_sample.async_recognize(language_code='en-US')
Ah, that's my fault from the last question. That's the async_recognize command, not the sync_recognize command.
That library has three recognize commands. sync_recognize reads the whole file and returns the results. That's probably the one you want. Remove the letter "a" and try again.
Here's an example Python program that does this: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/cloud-client/transcribe.py
FYI, here's a summary of the other types:
async_recognize starts a long-running, server-side operation to translate the whole file. You can make further calls to the server to see whether it's finished with the operation.poll() method and, when complete, can get the results via operation.results.
The third type is streaming_recognize, which sends you results continually as they are processed. This can be useful for long files where you want some results immediately, or if you're continuously uploading live audio.
I finally got something to work:
import time
from google.cloud import speech
speech_client = speech.Client()
sample = speech_client.sample(
content = None
, 'gs://speech_proj_files/sample.wav'
, encoding='LINEAR16'
, sample_rate= 44100
, 'languageCode': 'en-US'
)
retry_count = 100
operation = sample.async_recognize(language_code='en-US')
while retry_count > 0 and not operation.complete:
retry_count -= 1
time.sleep(10)
operation.poll() # API call
print(operation.complete)
print(operation.results[0].transcript)
print(operation.results[0].confidence)
for op in operation.results:
print op.transcript
Then something like
for op in operation.results:
print op.transcript

Combining multiple audio files in Python (with delay)

I'm looking to combine a range of different audio files (mp3) in Python. One of the requirements is that I need to be able to specify a delay at the end of each file. To illustrate, something like:
[file1.mp3--------3 seconds----------][delay---------2 seconds--------][file2.mp3]-------------4 seconds][delay---------2 seconds][file3.mp3----------3 seconds---------]
Does anyone here know of any mp3 libraries that can accomplish this? Python isn't really a necessity here. If it'll be easier in another language, that'll be fine.
I think FFmpeg can do this, given the right arguments. No real need to use a library.
To combine wav or aiff files, you can do something like this: (inspiration from here)
import aifc
def concatenate(*items):
data = []
for item in items:
f = aifc.open(item, 'rb')
data.append([f.getparams(), f.readframes(f.getnframes())])
f.close()
output = aifc.open('output.aif', 'wb')
output.setparams(data[0][0])
for item in data:
output.writeframes(item[1])
output.close()
See the link for the wav format (it's pretty much the same, but with the wave library)
To add silence, I would just make a one second silent file using your favorite audio editor and then concatenate in the proper amount of silence.

Resources