As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have always wondered how many different search techniques existed, for searching text, for searching images and even for videos.
However, I have never come across a solution that searched for content within audio files.
For example: Let us assume that I have about 200 podcasts downloaded to my PC in the form of mp3, wav and ogg files. They are all named generically say podcast1.mp3, podcast2.mp3, etc. So, it is not possible to know what the content is, without actually hearing them. Lets say that, I am interested in finding out, which the podcasts talk about 'game programming'. I want the results to be shown as:
Podcast1.mp3 - 3 result(s) at time index(es) - 0:16:21, 0:43:45, 1:12:31
Podcast21.ogg - 1 result(s) at time index(es) - 0:12:01
So my questions:
How could one approach this problem?
Are there are suitable algorithms developed to do something like this?
One idea the cropped up in my mind was that, one could use a 'speech-to-text' software to get transcripts along with time indexes for each of the audio files, then parse the transcript to get the output.
I was considering this as one of my hobby projects.
Thanks!
If you want to search for text (i.e. what is being said) inside an audio stream you would have to process it with some kind of speech recognition algorithm and store the text as meta data associated with the files. For video you could also do text recognition for text inside the video. Evernote already does this for text inside image files, but has no support for audio as far as I know.
Something similar is possible when using audio to search for audio. I don't know the details of these algorithms, but I'm guessing they involve some kind of frequency analysis. Shazam is using this kind of technology to identify songs based on audio clips.
Here are some Wikipedia articles that may be useful:
Speech recognition
Fast Fourier transform
Frequency analysis (frequency spectrum)
Optical character recognition (OCR)
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to create a software that can convert readable-texts(non-English) to Audio sound output.
After some searches what I have realized that most of the existing audio readers are too robotic and lacks the human-speech like effects.
I am looking for some algorithm/paper-work, which can give me some idea on how to proceed/implement such a thing.
or
Does anyone know, How some of the world's best Text-Reader software works?
My expectation are:
Reduced Robotic-like sound, and more of Human-like sound
High Quality Output
Light weight, yet Fast process speed
**Please edit this question, if anyone thinks some points are missing on this aspect.
Some small steps might help you give some basic Idea of what happens-
You need to create a dictionary of words, each word with its name and sound.
Create your own signal processor, this will help you add effects to your sound, like you might want robotic, or a female version or something else.
Parse the text file you want to read in array formats, dividing each word and punctuations, to form an array and. eg. "I want to die, this isn't a correct way to live." this will form an array as {I:want:to:die:,:this:isn't:a:correct:way:to:live:.}
Use the punctuation to implement life like parameters like , for small pause and . for longer pauses in your audio reader.
Use the words to take out audio from your database(dictionary) list in point 1.
Play the whole array continuously with a pause between each array element, will work similar to spaces
I think these are major ways to do this. To make it faster you can use advanced sound processing tools, to cache small sound data and add data on fly while you are modulating sound signals.
Might this help you.
Could be nice if you can tell us what kind of app you'll create (Movil, Web, Desktop) and also in what code you'll develop it (Php, Java, C++, etc). Because if you search in google, you'll find a lot plugins for website that convert text to audio that you can download them and see the code.
Also it's hard to find an app that not sound like a robot and if you find it maybe you'll pay for it.
The "robotic" aspect of text to speech that you are concerned about is a matter of the quality of "prosody". This is an active research area. You could probably get a PhD for working on improving prosody in TTS systems. If you would like to read about current research you can try searching for "improving prosody in text to speech".
A big part of the problem is having an accurate model of speech prosody in a given language. The thesis "MeLos: Analysis and Modelling of Speech Prosody and Speaking Style" by Nicolas Obin (2012) contains a survey of the state of the art in speech prosody modelling. Or try searching for "text to speech prosody survey state of the art".
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
let's consider a variation of the "WAV to MIDI" conversion problem. I'm aware of the complexity of such a problem and I know that a vast literature about the more general Music Information Retrieval (MIR) subject exists.
But let's suppose here that we already have both the WAV and the MIDI representation of a music piece, so we actually don't have to discover pitches inside the WAV signal from scatch... we "just" have to match the pitches detected (using a suitable algorithm) with the NoteOn events contained in the MIDI representation. I definitely suppose we should use the information contained in the MIDI file to give some hints to the pitch detection algorithm.
Such a matching tool could be very useful, for example for MIDI "humanization": we could make the MIDI representation more expressive using the information retrieved from the WAV signal to "fine tune" note onsets, durations, dynamics, etc...
Does anybody know if such a problem has already been addressed in literature?
Any form of contribution or assistance will be greatly appreciated.
Thanks in advance.
At the 2010 Music Hackday in London some people used the MATCH Vamp plugin to align score to Youtube videos. It was very impressive! Maybe their source code could be of use. I don't know how well MATCH works on audio generated from MIDI files, but that could be worth a try. Here's a link: http://wiki.musichackday.org/index.php?title=Auto_Score_Tubing
This guy appears to have done something similar: http://www.musanim.com/wavalign/ His results are definitely interesting.
This seems like an interesting idea. What are you trying to do, is it just match the notes pitch? Or do you have something else in mind?
One possible thing that you could look into is if you know the note (as an integer value I think its been a while) that will be used to pass into the noteOn method, you may be able to do something with that to map it with a wav signal. IT depends on what you are trying to do.
Also, there are some things that you could also play around with in (I think it is called) the midi controller. Such as: modulation, pitch, volume, pan, or play a couple of notes simultaneously. What you could do with this though, is have a background thread that can change some of those effects as the note is being played. For example, you could have a note get quieter the longer it is played, or have a note that with pan between the left and right speakers, etc
I havnt really played with this code in a long time, but there are some examples of using a midi controller.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
At my high school we can take a class where we basically learn about a subject on our own for a semester. I was thinking that I want to learn about "sound programming," but I realized that I have no idea what that entails. I'm interested in learning about, for example, how a synthesizer works and how sound works in computer science. I really want to focus on the low-level code part, not so much the composition part. Is this a feasible subject? Are there any good tutorials out there for somebody completely new to this?
I know C++ and am using Windows. The first answer in this is something that interests me (although it's over my head).
"Sound programming" is a very broad field. First of all, it is definitely a feasible subject, but since you need to cram stuff into a single semester you will need to limit your scope. I can see that you're looking for a place to start, so here are some ideas to get you thinking.
Since you have mentioned both "how sound works in computer science" and "synthesizers", it's worth pointing out the difference between analogue sound, sampled sound and synthesized sound, as they are different concepts. I'll explain them briefly here.
Analogue sound is sound as we humans typically interpret it -- vibrations of air sensed by the human ear. You can think of sound as a one-dimensional signal, where the independent variable is time and the dependent variable is amplitude of vibration. Analogue sound is continuous both in the time and amplitude domain. Older sound recording methods (e.g. magnetic tape) used an analogue sound representation. Analogue sound is not frequently used with computers (computers aren't good with storing continuous-domain data), but understanding analogue signals is important nevertheless. Expect to see plenty of math (e.g. complex numbers, Fourier transforms) if you go down this path.
Sampled sound is the sound representation that lends itself well to processing with a computer. People are most familiar with sampled sound through CDs and other musical recordings. An analogue signal is sampled at some frequency (e.g. 44.1KHz for CD recording). So a sampled sound signal is discrete in the time domain. If the signal is quantized then it will be discrete in the amplitude domain as well. Formats like MP3 are sampled formats. There's lots of things to study in this field if you're interested, such as restoration (removing static, etc) and compression (again, codecs MP3, Ogg Vorbis). It's a lot of fun because there's lots to experiment with and code.
Both analogue and sampled sound dig deeply into a field called Digital Signal Processing. Google around for that to get a feel of what it's like. It's often taught as a course at universities, so if you're really keen you can have a look at some lecture slides or even try some of the earlier, simpler projects.
Synthesized sound is a representation that is suited for reproduction of a music track, where the instruments playing the track are known beforehand. Think of it as sheet music for the computer. Somebody has to write the sheet music -- you can't just record it like analogue or sampled sound. This makes synthesized sound a completely different representation to analogue sound and sampled sound Also, the computer needs to know what the instruments are (e.g. piano) so that it can play (synthesize) the track. If it doesn't know the instrument, it either gives up or picks a close match (e.g. replaces the piano with electric keyboard). I have never worked with synthesizers before so I can't comment on the learning curve for them.
So, based on what I wrote -- pick a direction that interests you more, Google around and then refine your question.
EDIT
A good book to read is this. You can probably look around related titles in Amazon and find something newer, but it's been a while since I did my audio processing shopping.
And if you have half an hour to spare, then have a look at this video tutorial. It covers sound, image and video processing -- they're actually closely related fields.
Consider working through the book "Who Is Fourier?: A Mathematical Adventure". You could adapt the examples to make small programming assignments that demonstrate the basic concepts. After you're done you should be able to use the fft to make a spectrogram of your voice as you pronounce the vowels a,e,i,o,u -- identifying the fundamental frequency and the formants of each vowel.
I recommend learning Python and the modules NumPy, SciPy, and matplotlib (there's a ton there, so beyond the basic tutorials, just learn as you go). The iPython shell has the option "-pylab -p scipy" to automatically import the most common tools into your namespace. You can record and play audio using PyAudio. There's also Pygame, which expands on SDL (Simple DirectMedia layer), and pyglet, which uses OpenAL (the OpenGL of audio; it does 3D audio and effects).
As to C/C++, there's IT++, SPUC, and FFTW for signal processing, and SDL/SDL_mixer and OpenAL/ALmixer for interfacing with hardware and audio files.
I would recommend this book : http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=8218
(part of it is available here :
http://books.google.com/books?id=nZ-TetwzVcIC&printsec=frontcover&dq=computer+musical+tutorial&hl=pt-BR&ei=D-dKTaKsBMOB8gbF4KDcDg&sa=X&oi=book_result&ct=result&resnum=1&ved=0CDgQ6AEwAA#v=onepage&q=computer%20musical%20tutorial&f=false )
And another thing you could look is at puredata , it's a open source graphical environment for sound programming, and it's great for beginners. ( http://puredata.info/ )
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm searching ways to identify scores, when someone is playing i.e. guitar. How can I manage that?
I've heard that midi stores music data as musical scores. I wonder if it's a good solution.
MIDI does store musical scores, but it doesn't (normally) extract them from recorded sounds. You can't take an mp3 file and "convert it to MIDI", in a standard or entirely reliable way.
You create a MIDI file using a recorder (or "sequencer"), which might be a desktop application where you "write the score" like a composer does, or it might be a musical device like a keyboard, which records which keys you press, how hard and for how long, and interprets that as a score.
A MIDI player takes the data/score, and reproduces it using its own voice (or "sound font" if you like). So the advantage of MIDI data is firstly that the voice is already available on the playback device (and so the data is very compact), and second that the same data ("tune") can be played using different voices ("instruments")[*].
I believe there are MIDI guitars, but I don't know how "good" they are. The tone of an electric guitar comes in part from resonances of the solid body. This could of course be imitated by the voice at playback time, but there are bound to be some things that you can do with an electric guitar but which the MIDI format cannot capture or represent (for example I'd guess feedback is impossible).
Software exists to extract MIDI data from recorded sound - this is a bit like the way OCR extracts ASCII character data from images of text. It's not a major means of recording someone's guitar-playing, but if what you want is to get a first approximation to the score/tabs, you could try it.
Here's a randomly-selected example, found by Googling "convert from wav to MIDI":
http://www.pluto.dti.ne.jp/~araki/amazingmidi/
[*] But members of the audience, you find yourselves wondering, "what is this mindless automaton which bangs out the tunes it's instructed to, without comprehension or any aesthetic sense". Ladies and gentlemen, Colin Sell at the piano.
Music recognition/retrieval is an extremely difficult and almost entirely unsolved AI problem. Try to extract the frequency from a signal file of someone playing a single unwavering note one time - it's much more difficult than just "apply Fourier transform, read off solution". Compound that with polyphony, noise, rubato, vibrato/portamento, plus the fact that (contrary to speech recognition) we don't even have a working a-priori model of what music actually is, and you begin to see the difficulty. There are absolutely fascinating research papers and even entire conferences on the topic, but in the short term, you're just plain out of luck.
Are you aware you are attempting something extremely difficult? It's a very complex topic you could spend years researching yourself or pay $$$$$$ for existing commercial solutions.
MIDI is a reasonable choice for your output format.
For the rest you will need Fast Fourier Transforms working off a high-resolution capture of the input analogue sounds plus at least seven years of musical theory.
Good luck.
If the player is playing in tune, there will be very distinct frequencies in the signal, or at least frequencies with a mathematical separation. It may be possible to characterise a signal using spectral analysis to distinguish music from noise; or at least melodic music from noise - avant guarde experimental music may not pass ;). The distinction may become more difficult with multiple instrumentalists, percussion, and non-standard or poor tuning; traditional Chinese or Indian music for example uses different scales than western music.
Extracting the frequencies in the signal will require signal processing techniques such as Fast Fourier Transform. Categorising the signal as music/not music could be done by statistical analysis, or AI techniques such as neural networks or fuzzy logic
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Is there any free software tool or combination that allows me to identify the pitch of a recorded singing session?
The idea is to display some kind of graph with the current pitch in a time line along with markers for the standard notes (C3, C#3, D, etc). I don't need pitch correction and I don't need it to be done in real time, either.
I know that once there was a plugin for Rosegarden that did that, but it has gone missing.
Checkout Audacity. It came out of a project to do musical pitch analysis.
Not exactly what you are looking for, but the Singstar lookalike Ultrastar-NG at least does something like this.
http://ultrastar-ng.sourceforge.net/
I'm unaware of any software package that has this built in. If you're interested in writing something like this, you'll want to look at Discrete Fourier Transforms. This turns a time-series sample into a collection of frequencies. But this leaves you with no information about when the various frequencies occur, so you must do a windowed Fourier Transform, with windows of whatever time-resolution you want. Increasing the time resolution decreases the frequency resolution, however.
The simplest thing to do is to figure out the largest frequency component in any window and call that the frequency. But real music (a) has chords and (b) has overtones and undertones. In addition singing often has "tremolo", where the singer varies the actual pitch around the theoretical pitch the music is marked at.
Praat will at least do automatic pitch estimation of complex sounds. Though I don't know if it can mark the standard notes as you requested.
Rob