How is Alexa programmed to sing? - audio

If you say "Alexa, sing for me", she will choose one of several songs that have been created with her voice. The voice(s) for each of these songs must have been created somehow.
At first, I thought that SSML would provide the tools necessary to do this, especially the <prosody> tag which has parameters for pitch and rate (duration).
I thought perhaps each syllable of singing could have its pronunciation specified with <phoneme> and its pitch and duration specified with <prosody>, with <break> tags in between:
<speak>
<prosody rate="20%">
<phoneme alphabet="x-sampa" ph="U">oo</phoneme>
<break strength="none" />
</prosody>
<prosody rate="20%" pitch="+50%">
<phoneme alphabet="x-sampa" ph="U">oo</phoneme>
<break strength="none" />
</prosody>
<prosody rate="20%">
<phoneme alphabet="x-sampa" ph="U">oo</phoneme>
</prosody>
</speak>
However, when executed, Alexa applies her built-in inflection (to sound like a real human), and so the tone is not flat. These "ooh" sounds (above), for example, each have a falling tone. (They also have a noticeable break between phonemes even tho "no break" was explicitly specified.)
So then, how did the Alexa voice which is heard singing all of those songs get programmed? Was it via tools currently only available to Amazon developers?
It's also perplexing to me that I am apparently the only person on the internet even asking this question (based on zero results in stackoverflow, google, etc), especially this late in the game. Aren't there loads of musicians out there who would love to be able to make Alexa sing whatever they want?
Edit: Guys, I thought it was common knowledge, but there is no human voice actor behind Alexa. Her voice is completely computer-generated.

Alexa's voice is completely computer generated and so are the songs. Research is on-going into generating a singing synthesizer model (#1 and #2).
Here's a video by Popgun Labs regarding how they make their AI sing. Although I am unable to find how Amazon and Google do this, my guess it will be something similar.
EDIT: My earlier answer was based on an extension page and drew incorrect inconclusions.

My prediction would be either something really fancy like Natural Language Processing or something around that lines, AI/ML or they just had the voice actor sing out something or sing particular tones and just cut them together, i don't own an Alexa but i do have a HomePod mini and an iPhone and the way it pronounces our local singer names like "sidhu moosewala" or "amrit maan" (off topic but still related) i believe they just cut and put together words in a "clean" and 'flowing" way.

Perhaps her voice is simply autotuned.
Certainly, pitch-shifting tools can force any desired pitch from any audio source, and I presume such tools can force duration changes as well.

Related

How to automatically transcribe a Skype meeting, correctly attributed to each participant?

Assuming each participant agrees to the recording and transcription of the Skype call, is there a way to transcribe the meeting (either live or offline or both) such that it produces a text transcript where each spoken text is correctly attributed to the speaker. The transcript could then be input to any variety of search or NLP algorithms.
The top 3 Google search hits of "automatically transcribe Skype" refer to apps which make manual transcription easier:
(1) http://www.dummies.com/how-to/content/how-to-convert-skype-audio-to-text-with-transcribe.html
(2) http://ask.metafilter.com/231400/How-to-record-and-transcribe-Skype-conversation
(3) https://www.ttetranscripts.com/blog/how-to-record-and-transcribe-your-skype-conversations
While it would be trivial to record the audio and send it to a speech-to-text engine, I doubt it would be very high quality because the best results are usually speaker dependent models (else we wouldn't have to take time to train Dragon Naturally Speaking).
But, before we can choose speaker dependent transcription models, we need to know which segment of the audio belongs to which speaker. There's 2 ways that this is solved:
There is an easy way to retrieve all the audio that came from each participant, e.g. you just record all the audio from each speaker's microphone during the call, and you don't have to do any segmentation.
In case the first option isn't feasible or prohibitive in some way, we have to use a Speaker Diarization algorithm, which segments the audio into N clusters/speakers (most algorithms allow for being told how many speakers in the audio, but some can figure this out on their own). For real-time transcript as the call goes on, I imagine we'd need some fancy Real Time Speaker Diarization algorithm.
In any case, once the segmentation is solved, each participant has their trained speaker model, which is then applied to their portions of the audio. At the end of the day, everyone gets a nice conversation transcript and later one we can do fancy things like topic analysis or maybe Big Brother wants to sift over everyone's project meetings without having to listen to hours of audio.
My question is, what would be a way to implement this in practice?

Detecting ads in audio streams?

I have never tried, but just curious if there is any possibility to detect ads in audio streams? I mean except machine learning or something. Some specifics about byte stream during adverts. Maybe kind of different loud value?
From a purely audio standpoint, this isn't possible. There is nothing distinguishable between an advertisement and other audio content. Sure, you could argue that a station playing music will have different spectral characteristics than when talking comes on for an advertisement, but what about ads that also play music? How do you distinguish between an announcer and someone reading an ad? What if the ad is embedded in normal content?
Now, some stations do provide metadata which occasionally contain ad information. If you look at the length of a particular content item, your ads are usually going to be under a minute or 30 seconds. How you get this metadata and deal with it depend on the kind of stream you're working with.
There are techniques emerging to do this and they tend to leverage databases of known adverts to get around the theoretical problems that Brad correctly highlights in his answer.
One of the references below however, uses a techniques based on detecting slight differences in the audio when an ad starts as the initial detection trigger.
Some techniques also use both audio and visual streams to aid detection - for example the Google paper below uses first audio matching and then the video to validate/verify.
Some sources that might be worth looking at for anyone interested in this area (I realise it is an old question but it is still topical):
http://www.xavieranguera.com/papers/cimca_2008.pdf
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/55.pdf
https://www.audiblemagic.com/wp-content/uploads/2014/02/ad_detection_datasheet_150406.pdf

CMU Sphinx for Voice/Speaker Recognition

I'm looking for a way to match a known data set, let's say a list of MP3s or wav files, each which is a sample of someone speaking. At this point I know file ABC is of Person X speaking.
I would then like to take another sample, and do some voice matching to show who this voice is most likely of, given then known data set.
Also, I don't necessarily care what the person has said, as long as I can find a match, i.e I don't need any transcribing or otherwise.
I'm aware CMU Sphinx doesn't do voice recognition, and it's primarily used for voice-to-text, but I have seen other systems, eg: the LIUM Speaker Diarization (http://cmusphinx.sourceforge.net/wiki/speakerdiarization) or the VoiceID project (https://code.google.com/p/voiceid/) which uses CMU as a base for this type of work.
If I am to use CMU, how can I do voice matching?
Also, if CMU Sphinx isn't the best framework, is there an alternate that's open source?
This is a subject which would be adequate in complexity for a PhD thesis. There are no good and reliable systems as of right now.
The task you're up for is a very complex one. How you should approach it depends on your situation.
do you have a limited amount of people? how many?
how much data do you have for each person?
If you have very few people to recognize, you may attempt something as simple as obtaining formants of those people and comparing them to a sample.
Otherwise - you have to contact some academics who work on the subject or jury rig a solution of your own. Either way, as I said, it is a difficult problem.

Suggestion for creating custom sound recognition software to toggle audio

I need to develop a program that toggles a particular audio track on or off when it recognizes a parrot scream or screech. The software would need to recognize a particular range of sounds and allow some variations in the range (as a parrot likely won't replicate its sreeches EXACTLY each time).
Example: Bird screeches, no audio. Bird stops screeching for five seconds, audio track praising the bird plays. Regular chattering needs to be ignored completely, as it is not to be discouraged.
I've heard of java libraries that have speech recognition with dictionaries built in, but the software would need to be taught the particular sounds that my particular parrot makes - not words or any random bird sound. In addition as I mentioned above, it would need to allow for slight variation in the sound, as the screech will likely never be 100% identical to the recorded version.
What would be the best way to go about this/what language should I look into?
Edit: Alternatively (and perhaps this would be a more simple solution), is there a way to make the audio toggle based on the volume of input? So it wouldn't matter what kind of sound the parrot makes, just how loud it is?
This question seems to be tightly related to voice recognition. I would recomend taking a look at this post: How to convert human voice into digital format?

HOW-TO Make computer sing

I'm trying to develop an online application where the user writes some text and the software sings it back to the user.
I can currently generate the audio file with the words spoken by the computer using espeak, but I have no idea how to make it sound like a song, how to add rhythm to it.
I'm able to change the pitch and tempo using rubberband, but that's as far as I've gotten.
Does anyone have a clue how to make this happen?
If you want to use rubberband to change duration and pitch, then I think the hard part is going to be mapping from phonemes/syllables in the text to corresponding audio ranges in the speech systhesis output, for which I have no simple suggestion. (Ideally you'd get inside the speech synthesiser so that it would provide you with the mapping from phonemes to audio location.)
A simpler alternative might be to try Speech Synthesizer Markup Language - SSML. It has a "pitch" and "duration" elements that can absolutely specify pitch in Hz and duration in seconds. You can also specify volume, for controlling dynamics.
Given this, you could try to convert the text into a SSML document, and mark up words/syllables/phonemees with pitch/duration and volume attributes.
I've ended up using Festival's singing mode. It sounds reasonably well, except for the fact it only works with English voices.

Resources