I'm trying to develop an online application where the user writes some text and the software sings it back to the user.
I can currently generate the audio file with the words spoken by the computer using espeak, but I have no idea how to make it sound like a song, how to add rhythm to it.
I'm able to change the pitch and tempo using rubberband, but that's as far as I've gotten.
Does anyone have a clue how to make this happen?
If you want to use rubberband to change duration and pitch, then I think the hard part is going to be mapping from phonemes/syllables in the text to corresponding audio ranges in the speech systhesis output, for which I have no simple suggestion. (Ideally you'd get inside the speech synthesiser so that it would provide you with the mapping from phonemes to audio location.)
A simpler alternative might be to try Speech Synthesizer Markup Language - SSML. It has a "pitch" and "duration" elements that can absolutely specify pitch in Hz and duration in seconds. You can also specify volume, for controlling dynamics.
Given this, you could try to convert the text into a SSML document, and mark up words/syllables/phonemees with pitch/duration and volume attributes.
I've ended up using Festival's singing mode. It sounds reasonably well, except for the fact it only works with English voices.
Related
If you say "Alexa, sing for me", she will choose one of several songs that have been created with her voice. The voice(s) for each of these songs must have been created somehow.
At first, I thought that SSML would provide the tools necessary to do this, especially the <prosody> tag which has parameters for pitch and rate (duration).
I thought perhaps each syllable of singing could have its pronunciation specified with <phoneme> and its pitch and duration specified with <prosody>, with <break> tags in between:
<speak>
<prosody rate="20%">
<phoneme alphabet="x-sampa" ph="U">oo</phoneme>
<break strength="none" />
</prosody>
<prosody rate="20%" pitch="+50%">
<phoneme alphabet="x-sampa" ph="U">oo</phoneme>
<break strength="none" />
</prosody>
<prosody rate="20%">
<phoneme alphabet="x-sampa" ph="U">oo</phoneme>
</prosody>
</speak>
However, when executed, Alexa applies her built-in inflection (to sound like a real human), and so the tone is not flat. These "ooh" sounds (above), for example, each have a falling tone. (They also have a noticeable break between phonemes even tho "no break" was explicitly specified.)
So then, how did the Alexa voice which is heard singing all of those songs get programmed? Was it via tools currently only available to Amazon developers?
It's also perplexing to me that I am apparently the only person on the internet even asking this question (based on zero results in stackoverflow, google, etc), especially this late in the game. Aren't there loads of musicians out there who would love to be able to make Alexa sing whatever they want?
Edit: Guys, I thought it was common knowledge, but there is no human voice actor behind Alexa. Her voice is completely computer-generated.
Alexa's voice is completely computer generated and so are the songs. Research is on-going into generating a singing synthesizer model (#1 and #2).
Here's a video by Popgun Labs regarding how they make their AI sing. Although I am unable to find how Amazon and Google do this, my guess it will be something similar.
EDIT: My earlier answer was based on an extension page and drew incorrect inconclusions.
My prediction would be either something really fancy like Natural Language Processing or something around that lines, AI/ML or they just had the voice actor sing out something or sing particular tones and just cut them together, i don't own an Alexa but i do have a HomePod mini and an iPhone and the way it pronounces our local singer names like "sidhu moosewala" or "amrit maan" (off topic but still related) i believe they just cut and put together words in a "clean" and 'flowing" way.
Perhaps her voice is simply autotuned.
Certainly, pitch-shifting tools can force any desired pitch from any audio source, and I presume such tools can force duration changes as well.
I'm looking for a program that is able to recognize individual audio samples from my computer and reroute them to trigger WAV files from a library. In my project, it would need to be realtime as the latency would not be a desired result. I tried using dictation software that would recognize words to trigger opening a file and that's the direction where I want to go, but instead of words I want it to be sounds and it would happen in realtime. I'm not sure where to go and am just looking for some guidance. Does anyone have any suggestions of what I should do?
That's a fairly broad question, but I can tell you how I would do it. (Hardly the only way, but where I would start.)
If you're looking for real time input, the Java Sound library (excellent tutorial here) allows for that. (Just note that microphone input from a web page is difficult on anything, due to major security concerns, so this would be a desktop application.)
If it needs to be real time, the first thing I would suggest is stream and multithread the hell out of it. I would suggest the Java 8 Stream API, but since you're looking for subsamples that match a specific pattern, then each data point will have to be aware of the state of its neighbors, and that isn't easy with streams.
You will probably want to know if a sound roughly resembles an audio profile, so for that, I would pick a tolerance on just how close you want it to be for a match (remembering that samples may not line up 100% anyway, so "exact" is not an option), and then look up Hidden Markov Models. I suggest these because they're what voice recognition software typically uses, and while your sounds may not be voices, it will give you an idea of what has already been done.
You'll also want to maintain a limited list of audio samples in memory. Specifically, you will likely need the most recent data, because an audio signal is a time-variant signal, and you can't get a match from just one point. I wouldn't make it much longer than the longest sample you're looking to recognize, as audio takes up a boatload of memory.
Lastly (for audio), I would recommend picking a standard format for comparison. Make it as good as gets you decent results, and start high. You will want to convert everything to that format before you compare it.
Once you recognize a specific sound, it's basically a Command Pattern. Specific sounds can be mapped, even with a java.util.HashMap, to specific files, which (if there are few enough) you might even have pre-loaded.
Lastly, it's worth looking at the Java Speech API. It's not part of the JDK and it's quite dated, but you might get some good advice from its implementation.
This is of course the advice of a Java-preferring programmer, but I imagine that there might be some decent libraries in Python and Ruby to help you as well; and of course there's something in C somewhere. This may sound like a lot, but most of the material is already implemented and ready-to-go.
Hopefully this helps, let's look forward to other answers.
I am new here, so sorry in advance if I make any mistakes!
Problem: I need to analyze this music .wav file, particularly for its frequency, amplitude, and pitch over specific intervals of time.
Is there any easy to use software and steps I can take that can help me accomplish this?
I have tried audacity, sonicvisualizer, and sigview, but I am unsure how to utilize these softwares appropiately to achieve my specific goal.
Thanks in advance!
Praat is good for these kinds of things. It has been specifically designed for speech research, but it can (and has) been used for analysing music as well.
It has a scripting language that allows for automation, and can analyse the things you mention for specific intervals or for the whole sound. Take a look at the documentation, specifically the sections on Pitch, Intensity, and spectral analysis.
Sigview is super easy to use and is your best bet if you want to be the most scientific about it (without investing too much $$ or time).
To use SigView, drag & drop an audio file into sigview. Highlight the portion of the waveform you're interested in and right click, select the option 'open selection in new window'. When you're looking at the portion of the waveform you're interested in, hit ctrl-f and it will perform an FFT on the segment. Right click and select show peaks so see the amplitude & frequency of each peak.
I was wondering if there was a tool similar to jCrop, with the exception that instead of an image I'd allow the user to crop an audio file? Google didn't give me any useful results sadly :(
The reason why I'm asking is that I'm making a tool to convert audio files to popular ringtone formats, and only letting the user specify the offsets in numbers is somewhat inconvenient. Obviously the tool doesn't have to be in javascript - anything that fits into a website is ok.
Here's a browser-based audio editor written in Flash that you could probably adapt (it supports cropping):
http://www.hisschemoller.com/2010/audio-editor-1-0/
One thing I found a bit confusing is that you have to hold down the play button on the editor to play the full sound.
I need to develop a program that toggles a particular audio track on or off when it recognizes a parrot scream or screech. The software would need to recognize a particular range of sounds and allow some variations in the range (as a parrot likely won't replicate its sreeches EXACTLY each time).
Example: Bird screeches, no audio. Bird stops screeching for five seconds, audio track praising the bird plays. Regular chattering needs to be ignored completely, as it is not to be discouraged.
I've heard of java libraries that have speech recognition with dictionaries built in, but the software would need to be taught the particular sounds that my particular parrot makes - not words or any random bird sound. In addition as I mentioned above, it would need to allow for slight variation in the sound, as the screech will likely never be 100% identical to the recorded version.
What would be the best way to go about this/what language should I look into?
Edit: Alternatively (and perhaps this would be a more simple solution), is there a way to make the audio toggle based on the volume of input? So it wouldn't matter what kind of sound the parrot makes, just how loud it is?
This question seems to be tightly related to voice recognition. I would recomend taking a look at this post: How to convert human voice into digital format?