Setting up Google Cloud Speech API to transcribe interviews - speech-to-text

I've got over 100 hours of audio associated with video interviews for a documentary that need to be transcribed to text - hopefully with some kind of timecode markers every 30 seconds or so so the video can easily be matched up to the text in the edit suite.
The files are BWAV 24 bit 96khz and WAV 16 bit 48khz and last anywhere from 20 minutes to 2 hours.
What kind of resources need to be setup in a VM to do this kind of activity? I suspect it will be rather compute intensive so the VM might need 32 cores and a fair amount of memory, but there is no need for realtime response so it is ok if priorities are low and it takes several hours to process a file. My budget is miniscule - $300 is about the most we can afford for all the files (which is one reason we aren't sending these files out to a transcription service at $75+/hour).
I've already got a Cloud Platform account but have never used it. There is no point in my floundering around if someone has already done something similar and can give me some help.

Related

record screen with high quality and minimum size in ElectronJS (Windows)

as I said in the title, I need to record my screen from an electron app.
my needs are:
high quality (720p or 1080p)
minimum size
record audio + screen + mic
low impact on PC hardware while recording
no need for any wait after the recorder stopped
by minimum size I mean about 400MB on 720p and 700MB on 1080p for a 3 to 4 hours recording. we already could achieve this by bandicam and obs and it's possible
I already tried:
the simple MediaStreamRecorder API using RecordRTC.Js; produces huge file sizes, like 1GB per hour for 720p video.
compressing the output video using FFmpeg; it can take up to 1 hour for 3 hours recording
save every chunk with 'ondataavailable' event and right after, run FFmpeg and convert and reduce the size and append all the compressed files (also by FFmpeg); there are two problems. 1, because of different PTS but it can be fixed by tunning compress command args. 2, the main problem is the audio data headers are only available in the first chunk and this approach causes a video that only has audio for the first few seconds
recording the video with FFmpeg itself; the end-users need to change some things manually (Stereo Mix), the configs are too complex, it causes the whole PC to work slower while recording (like fps drop; even if I set -threads to 1), in some cases after recording is finished it needs many times to wrap it all up
searched through the internet to find applications that can be used from the command line; I couldn't find much, the famous applications like bandicam and obs have command line args but there are not many args to play with and I can't set many options which leads to other problems
I don't know what else I can do, please tell me if u know a way or simple tool that can be used through CLI to achieve this and guide me through this
I end up using the portable mode of high-level 3d-party applications like obs-studio and adding them to our final package. I also created a js file to control the application using CLI
this way I could pre-set my options (such as crf value, etc) and now our average output size for a 3:30 hour value with 1080p resolution is about 700MB which is impressive

ESP32: BLE transmission speed is very slow

I am trying to build an Android app that interfaces with the ESP32 using BLE. I am using the RxBluetoothKotlin library from Vincent Masselis for the Android side. For the ESP32 side, I am using the default Kolban libraries that are included in the Arduino IDE. My phone is a OnePlus 5T and my ESP32 is a MH ET Live ESP32DevKIT. My Android app can be found here, and my ESP32 program here.
The whole system works pretty much perfectly for me in terms of pure functionality. That is to say, every button does what it's supposed to do, and I get the exact behaviour I had expected to get. However, the communication itself is very slow. Around 200 bytes/second. My test button in the Android app requests a bunch of text data from the ESP32, and displays this in a dialog. It also lists a number which represents the time between request and reception in milliseconds. Using this, I get around 2 seconds for 440 bytes of data. When I send less data, the time decreases approximately linearly with data size. 40 bytes of data will take around 200ms, and 20 bytes or under typically takes less than 100ms.
This seems rather slow to me. From what I understand, I should be able to at least get a few kilobytes per second. I have tried to check the speed using nRF Connect, but I get the same 2 seconds timespan for my data transfer. This suggests that the problem is not in my app, since I also have it with a completely different app. I also put the code in my main loop inside of callbacks instead (which I probably should have done in the first place), but this didn't change things at all. I have tried taking the microcontroller and my phone to a few different locations, hoping to eliminate interference. I have tried to mess with BLEDevice::setPower and BLEDevice::setMTU, as well as setting RxBluetoothGatt.requestMtu(500) on the Android side. Everything so far seems to have had little to no effect. The only thing that did anything, was adding the line "pServer->updatePeerMTU(0,500);" in my loop during the connection phase. This caused the first 23 bytes of data to be repeated whenever I pressed the test button in my app, and made the data transfer take about 3 seconds. If I'm lucky, I can get maybe a bit under 1.8 seconds for 440 bytes, but this is a very small change when I'm expecting an order of magnitude of difference, and might even be down to pure chance rather than anything I did.
Does anyone have an idea of how to increase my transfer speed?
The data transmission speed is mainly influenced by the Bluetooth LE connection interval (between 7.5 ms and 4 seconds) and is negotiated between the master (central unit) and the peripheral device. The master establishes a connection with a parameter set and the peripheral can propose to change this parameter set. In the end, however, the central unit decides which parameter set is to be used.
But the Bluetooth connection interval cannot be changed by an Android applications directly, which normally act as the central role. Instead it can request a connection priority which is known to have an influence on the connection interval.

Speaker Recognition and Response Time?

I know that Speaker Recognition is in preview and the only available location is the West Coast, and Im hoping that's why I am seeing this 'delay'.
Im on the East Coast (NY) and with just 3 speakers in my search it takes 6 seconds to return a confirmation. Dont get me wrong, 6 seconds is impressive for what it does but that long of delay makes the use case more limited than a quicker reply.
Main question is - Should I see a quicker reply once the service adds location closer? (Its not like the latency should cause a big issue...) - Or is there anything else that may speed up replies - or, of course, is this simply 'The way its going to be'??
Thanks!
I assume you're talking about Microsoft Speaker Recognition.
The processing time is a function of the audio length. For a 15 seconds audio, you can expect less than 1 second latency, and yes, in general, you should see quicker response when the service expands to closer locations.

Download last 30 seconds of an mp3

Is it possible to download only the last 30 seconds of an mp3? Or is it necessary to download the whole thing and crop it after the fact? I would be downloading via http, i.e. I have the URL of the file but that's it.
No, it is not possible... at least not without knowing some more information first.
The real problem here is determining at what byte offset the last 30 seconds is. This is a product of knowing:
Sample Rate
Bit Depth (per sample)
# of Channels
CBR or VBR
Bit Rate
Even then, you're not going to get that with a VBR MP3 file, and even with CBR, who knows how big the ID3 and other crap at the beginning of the file is. Even if you know all of that, there is still some variability, as you have the problem of the bit reservoir.
The only way to know would be to download the whole file and use a tool such as FFMPEG to find out the right offset. Then if you want to play it, you'll want to add the appropriate headers, and make sure you are trimming on an eligible frame, or fix the bit reservoir yourself.
Now, if this could all be figured out server-side ahead of time, then yes, you could request the offset from the server, and then download from there. As for how to download it, your question is very incomplete and didn't mention what protocol you were using, so I cannot help you there.

Debug NAudio MP3 reading difference?

My code using NAudio to read one particular MP3 gets different results than several other commercial apps.
Specifically: My NAudio-based code finds ~1.4 sec of silence at the beginning of this MP3 before "audible audio" (a drum pickup) starts, whereas other apps (Windows Media Player, RealPlayer, WavePad) show ~2.5 sec of silence before that same drum pickup.
The particular MP3 is "Like A Rolling Stone" downloaded from Amazon.com. Tested several other MP3s and none show any similar difference between my code and other apps. Most MP3s don't start with such a long silence so I suspect that's the source of the difference.
Debugging problems:
I can't actually find a way to even prove that the other apps are right and NAudio/me is wrong, i.e. to compare block-by-block my code's results to a "known good reference implementation"; therefore I can't even precisely define the "error" I need to debug.
Since my code reads thousands of samples during those 1.4 sec with no obvious errors, I can't think how to narrow down where/when in the input stream to look for a bug.
The heart of the NAudio code is a P/Invoke call to acmStreamConvert(), which is a Windows "black box" call which I can't think how to error-check.
Can anyone think of any tricks/techniques to debug this?
The NAudio ACM code was never originally intended for MP3s, but for decoding constant bit rate telephony codecs. One day I tried setting up the WaveFormat to specify MP3 as an experiment, and what came out sounded good enough. However, I have always felt a bit nervous about decoding MP3s (especially VBR) with ACM (e.g. what comes out if ID3 tags or album art get passed in - could that account for extra silence?), and I've never been 100% convinced that NAudio does it right - there is very little documentation on how exactly you are supposed to use the ACM codecs. Sadly there is no managed MP3 decoder with a license I can use in NAudio, so ACM remains the only option for the time being.
I'm not sure what approach other media players take to playing back MP3, but I suspect many of them have their own built-in MP3 decoders, rather than relying on the operating system.
I've found some partial answers to my own Q:
Since my problem boils down to consuming too much MP3 w/o producing enough PCM, I used conditional-on-hit-count breakpoints to find just where this was happening, then drilled into that.
This showed me that some acmStreamConvert() calls are returning success, consuming 417 src bytes, but producing 0 "dest bytes used".
Next I plan to try acmStreamSize() to ask the codec how many src bytes it "wants" to consume, rather than "telling" it to consume 417.
Edit (followup): I fixed it!
It came down to passing acmStreamConvert() enough src bytes to make it happy. Giving it its acmStreamSize() requested size fixed the problem in some places but then it popped up in others; giving it its requested size times 3 seems to cure the "0 dest bytes used" result in all MP3s I've tested.
With this fix, acmStreamConvert() then sometimes returned much larger converted chunks (almost 32 KB), so I also had to modify some other NAudio code to pass in larger destination buffers to hold the results.

Resources