Problem with transcription of short audios (yes, no ) with Google - dialogflow-es

I'm having difficulties when trying to transcribe short user audio answers such as "yes", or "no".
I'm using dialogFlow detectIntent function, using audio as input, but same thing happens using Google Speech-To-Text API. I assume both use the same algorithms. Basically the problem is that there are a lot of cases where the response is empty
Audios clips are taken from a phone call (MULAW, 8KHz) and encoding and sample rate match what I'm sending in the request, because it works with almost all the audios.
We only have a problem with short responses. We hear the audio and the word (yes/no) is quite clear, but both dialogFlow and Google Speech-To-Text returns an empty response.
Did someone have the same problem? Is there any configuration that can be applied to solve or mitigate this problem?

Related

Mixing OPUS Audio on a server for a meeting

I am currently trying to challange myself and write a little meeting tool in order to learn stuff. I am currently stuck at the audio part. My audio stack is working, each user sends an OPUS stream (20ms packets) to the server. I am not thinking about how I will handle the audio. The outcome shall be: All users receive all audio, but not the own audio (so he does not hear himself). I want the meeting to support as many concurrent users as possible.
I have the following ideas, but none feels quite right:
send all audio streams to all users, which would mean bigger traffic, mixing would be done on the client side
mix the audio on a per-user-basis, which means if I have n users I would need to include n encodings in each frame.
mix the whole audio together, for alle users, send it to all users, each user will receive a second opus package containing the "own" opus audio packet which was sent to the server (or it will be numbered and stored on the client side so it does not need retransmission). I dont know if after decoding I can remove the "own" audio from the stream without getting some unclean audio.
How is this normally done? Are there options I am missing? I have profiled all steps involved, the most expensive part is encoding (one 20ms encoding takes about 600ns), decoding and mixing need near to no-time at all (5-10ns for each step).
Currently I would prefer option 3, but I do not find informations if the audio will be clean or if it will result in washy audio or some cracking.
The whole thing is written in C++, but I did not include the tag since I dont need code examples, just informations on this topic. I tried googling a lot, read a lot of documentation of opus, but did not find anything related to this.

What would you recommend using for audio file transcribing into a .txt?

I am working on a small school project where I have to take a lot of audio files and transcribe them into .txt files. I am a beginner at programming.
So far I've tried alexkras method using Google's Cloud Speech API. But I can't use this for mass transcribing as it is done by converting the audio to .wav using an external software(This can be done through ffmpeg too so not a big deal) and splitting up the new .wav file into <60s parts as Cloud Speech can only transcribe <60s at a time which is a big loss in trans unless you upload them to GCS but this is also a problem for mass transcribing as some .wav files are large enough(A 1 hour podcast I used turned into 800mb file) the process is slowed down.
The next one I tried is using gcloud SDK and directly transcribing audio files on the GCS using a small code in my terminal, now the problem I observed here is the transcription is not complete and it shows the transcription this way,
Example from Google:
{
"#type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
"results": [
{
"alternatives": [
{
"confidence": 0.9840146,
"transcript": "how old is the Brooklyn Bridge"
}
]
}
]
}
Which is not ideal, maybe there is a way to transfer it into a text file but the transcriptions I've done so far are not complete, I got like a total of <30 lines of text from a 11-minute video.
The most effective method I've tried is the alexkras method but as I've said above there are problems with that too(In my case). I've been looking into Machine Learning methods for speech-to-text so it can recognise or transcribe audios with accent too.
Do you guys know any method to help me transcribe mass audios into text effectively? It would have been so happy with alexkras method if it wasn't for the splitting of files or uploading it to GSC. I would greatly appreciate any help or suggestions or guidance with this. Thank you.
you can try the Watson STT API, the file/stream size limitation is 100MBs, which means that if using the proper encoding you can decode files up to several hours long. You can use sox or ffmpeg for the audio conversion if needed, the lighter weight codec is audio/ogg
https://www.ibm.com/watson/developercloud/speech-to-text/api/v1/#recognize_sessionless12
see the curl example to get you started
I've just been exploring the AWS Transcribe product. It requires an AWS account, which can be obtained free, with a credit card for payment if you go over the free limits.
It provides up to 60 minutes per month of audio transcription. If you go beyond 60 minutes of audio, you'll need to pay a bit less than $1.50 per hour of audio transcribed.
The transcription results in a .JSON file that is not easy to read. But, there is a php script on GitHub that turns the .JSON file into a very easy-to-read transcript.
I've found it to be pretty accurate, and relatively easy to use. I'd look into it if I were you.

Manipulating audio to bypass content ID detection

I'm using YouTube's "auto-generated" captions feature to generate transcripts of mp3 files. I do this by first converting the mp3 to a blank mp4, uploading to YouTube, waiting for the auto generated captions to appear, then extracting the SRT file.
The issue I'm having though is that a few of the mp3 files I've uploaded have been flagged as having copyrighted content, and as such no auto-generated captions have been made for them.
I have no desire to publish the mp3s on YouTube, they're uploaded as unlisted videos and all I require are the SRT files. Is there a way to manipulate the audio to bypass YouTube's content ID system? I've tried altering the pitch in Audacity, but it doesn't matter how subtle or extreme the pitch change is, they're still flagged as having copyrighted content. Is there anything else I can do to the audio other than adjusting the pitch that might work?
I'm hoping this post doesn't breach any rules on here, and I can't stress enough that I'm not looking to publish these mp3s, I just want the auto-generated SRTs.
No one can know how to cheat on Content ID
Obviously, as Content ID is a private algorithm developed by Google, no one can know for sure how do they detect copyrighted audio in a video.
But, we can assume that one of the first things they did was to make their algorithm pitch-independent. Otherwise, everyone would change the pitch of their videos and cheat on Content ID easily.
How to use Youtube to get your subtitles anyway
If I am not mistaken, Content ID blocks you because of musical content, rather than vocal content. Thus, to address your original problem, one solution would be to detect musical content (based on spectral analysis) and cut it from the original audio. If the problem is with pure vocal content as well, you could try to filter it heavily and that might work.
Other solutions
Youtube being made by Google, why not using directly the Speech API that Google offers and which most likely perform audio transcription on Youtube? And if results are not satisfying, you could try other services (IBM, Microsoft, Amazon and others have theirs).

How to anonymize (mask) audio (human voice) using javascript

I'm hoping to record the audio of some stories from remote study participants via web browsers. I would like to give them an option of anonymizing their voices before they submit their audio clips. Is there a way to do that in Javascript (or any other library--for example, Python--that I can invoke in the background on the server before serving it back to the participant to verify before they submit?
This youtube video comes really close to what I would like to accomplish. Thanks in advance for your suggestions and advice!
You can use the PitchShifter object of SoundTouchJS to change the 'key' of the input, and even the playback rate if necessary. It might be helpful to run further convolvers against the AudioNode as well, to futher anonamize it.

Detecting ads in audio streams?

I have never tried, but just curious if there is any possibility to detect ads in audio streams? I mean except machine learning or something. Some specifics about byte stream during adverts. Maybe kind of different loud value?
From a purely audio standpoint, this isn't possible. There is nothing distinguishable between an advertisement and other audio content. Sure, you could argue that a station playing music will have different spectral characteristics than when talking comes on for an advertisement, but what about ads that also play music? How do you distinguish between an announcer and someone reading an ad? What if the ad is embedded in normal content?
Now, some stations do provide metadata which occasionally contain ad information. If you look at the length of a particular content item, your ads are usually going to be under a minute or 30 seconds. How you get this metadata and deal with it depend on the kind of stream you're working with.
There are techniques emerging to do this and they tend to leverage databases of known adverts to get around the theoretical problems that Brad correctly highlights in his answer.
One of the references below however, uses a techniques based on detecting slight differences in the audio when an ad starts as the initial detection trigger.
Some techniques also use both audio and visual streams to aid detection - for example the Google paper below uses first audio matching and then the video to validate/verify.
Some sources that might be worth looking at for anyone interested in this area (I realise it is an old question but it is still topical):
http://www.xavieranguera.com/papers/cimca_2008.pdf
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/55.pdf
https://www.audiblemagic.com/wp-content/uploads/2014/02/ad_detection_datasheet_150406.pdf

Resources