YouTube Song Lyric Recognition - audio

Many YouTube videos have automatic captions for lyrics. We believe that they are using the Google Speech Recognition API. However, when we use the Google Speech Recognition API (or any speech recognition API), we do not get accurate lyrics. Sometimes, we only get one line from the song. Why may this be?
Does anyone have suggestions for acquiring the real-time lyrics from a song? Or an API/library for training audio?
Thank you for your help!

In case anyone else is wondering, the youtube-transcript-api Python API can be used to get the transcripts/subtitles for a given YouTube video.

Related

In google's speech to text live streaming, if user does not speak anything then google charge me?

I'm using google's speech to text converter. Now I have to track each user's usage so I'm thinking about does google charge me if user does not speak over live streaming?
Thanks
Yes, you still have to pay. Use voice activity detection on your side before sending the audio.

Using Cortana for dictation of documents

I'm currently doing research about Cortana as I'm interested in doing some development of custom skills for it. Currently I'm using Cortana to invoke Windows Speech Recognition where I can then use WSR as a means to dictate text into Word. I'm experimenting with this as a possibility to be used for recording and generating a transcript in real time for meetings.
Now this is quite a hassle as I've found and I'm curious to know if there is something that I can do to integrate a bot within Cortana for the same purpose. I've looked up and done some reading about Azure Bot Framework, Cognitive Services, LUIS, etc.
Is it possible to develop such a solution using the above mentioned services ?
Thank you in advance !
Yes, it is possible.
You can feed the streams to the Speech to Text API, then chunk the audio according to the returned Offset and Duration of each phrase, then send those chunks to the Speaker Recognition API to identify the speaker by name so you'd have a name for each chunk to put with it's transcribed phrase and create a dialog out of
Since you're considering it mainly for meetings, the solution you've mentioned was announced a while ago as a feature of Microsoft Teams, and it is going to be publicly available in the near feature, you can also watch a demo that was presented at Build 2018 from here

Get the audio data from Google Assistant

As of now (is using api.ai) what I see is, I get the string format of what the user speaks.
I would like to access the raw audio file the user speaks to interact with Google Assistant using the api.ai platform.
Is there a way to get the audio file ?
[UPDATE]:
We are aiming to evaluate the quality of speech of the user hence we would need to run the algorithms on the audio.
No, there is currently no way to get the audio content of what has been sent.
(However, the team is looking to understand the use cases of why you might want or need this feature, so you may want to elaborate on your question further.)

Audio hosting service that offers transcriptions of uploaded file?

Similar to how YouTube captions videos, is there any audio hosting service out there that will transcribe audio and provide a written transcription for accessibility purposes?
No.
You could upload the audio to YouTube as a video file and get its auto-captions, terrible as they are, then extract those.
You should know that YouTube's auto-captioning should never (never) be relied on. You can instead use it to generate a rough time-based set of captions that you can then download and correct.
The easiest way to do that is via No More Craptions, which will take a YouTube video with auto-captions and walk you through correcting them in a simple interface.
You may then download your completed work as a transcript as well. When you do that, remember to offer a plain text link near the audio file / player on the page with a clear indication of what the user will receive.
Let me reiterate — never rely on YouTube auto-captions. Always correct whatever YouTube provides. Always.

Convert live audio stream into text while conferencing using WebRTC

I am implementing a system like video conferencing using WebRTC and NodeJS.
but i want to add some extra feature to it , suppose there is one moderator and 5 audiences who is asking question , so 1 is busy with 1 moderator , rest audiences record their questions ,which will be converted to text and will be shown on moderator's screen , so that based on that moderator can answer as per his requirement and leave unwanted questions. hope you can imagine the system.
first thing is , is it doable?
if yes , any help will be appreciated.
You should simply try Google Speech Recognition API, same as Traslator.js do. Speech Recognition API can convert audio into text which can be further played as voice using either Google Translation API or meSpeak.js.
RecordRTC.js can be used only for wav/webm recordings. It is incapable to convert voice into text.
Updated at: 11:23 am -- Saturday, 7 June 2014 (UTC)
Personally I think Google Translation API is the only "Official" i.e. "non-free" API. Speech Recognition API is naively supported both on chrome and Firefox and it is part of some kind of specification, though submitted by Google developers.
Web Speech API Specification: https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

Resources