I want to learn audio programming [closed] - audio

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
At my high school we can take a class where we basically learn about a subject on our own for a semester. I was thinking that I want to learn about "sound programming," but I realized that I have no idea what that entails. I'm interested in learning about, for example, how a synthesizer works and how sound works in computer science. I really want to focus on the low-level code part, not so much the composition part. Is this a feasible subject? Are there any good tutorials out there for somebody completely new to this?
I know C++ and am using Windows. The first answer in this is something that interests me (although it's over my head).

"Sound programming" is a very broad field. First of all, it is definitely a feasible subject, but since you need to cram stuff into a single semester you will need to limit your scope. I can see that you're looking for a place to start, so here are some ideas to get you thinking.
Since you have mentioned both "how sound works in computer science" and "synthesizers", it's worth pointing out the difference between analogue sound, sampled sound and synthesized sound, as they are different concepts. I'll explain them briefly here.
Analogue sound is sound as we humans typically interpret it -- vibrations of air sensed by the human ear. You can think of sound as a one-dimensional signal, where the independent variable is time and the dependent variable is amplitude of vibration. Analogue sound is continuous both in the time and amplitude domain. Older sound recording methods (e.g. magnetic tape) used an analogue sound representation. Analogue sound is not frequently used with computers (computers aren't good with storing continuous-domain data), but understanding analogue signals is important nevertheless. Expect to see plenty of math (e.g. complex numbers, Fourier transforms) if you go down this path.
Sampled sound is the sound representation that lends itself well to processing with a computer. People are most familiar with sampled sound through CDs and other musical recordings. An analogue signal is sampled at some frequency (e.g. 44.1KHz for CD recording). So a sampled sound signal is discrete in the time domain. If the signal is quantized then it will be discrete in the amplitude domain as well. Formats like MP3 are sampled formats. There's lots of things to study in this field if you're interested, such as restoration (removing static, etc) and compression (again, codecs MP3, Ogg Vorbis). It's a lot of fun because there's lots to experiment with and code.
Both analogue and sampled sound dig deeply into a field called Digital Signal Processing. Google around for that to get a feel of what it's like. It's often taught as a course at universities, so if you're really keen you can have a look at some lecture slides or even try some of the earlier, simpler projects.
Synthesized sound is a representation that is suited for reproduction of a music track, where the instruments playing the track are known beforehand. Think of it as sheet music for the computer. Somebody has to write the sheet music -- you can't just record it like analogue or sampled sound. This makes synthesized sound a completely different representation to analogue sound and sampled sound Also, the computer needs to know what the instruments are (e.g. piano) so that it can play (synthesize) the track. If it doesn't know the instrument, it either gives up or picks a close match (e.g. replaces the piano with electric keyboard). I have never worked with synthesizers before so I can't comment on the learning curve for them.
So, based on what I wrote -- pick a direction that interests you more, Google around and then refine your question.
EDIT
A good book to read is this. You can probably look around related titles in Amazon and find something newer, but it's been a while since I did my audio processing shopping.
And if you have half an hour to spare, then have a look at this video tutorial. It covers sound, image and video processing -- they're actually closely related fields.

Consider working through the book "Who Is Fourier?: A Mathematical Adventure". You could adapt the examples to make small programming assignments that demonstrate the basic concepts. After you're done you should be able to use the fft to make a spectrogram of your voice as you pronounce the vowels a,e,i,o,u -- identifying the fundamental frequency and the formants of each vowel.
I recommend learning Python and the modules NumPy, SciPy, and matplotlib (there's a ton there, so beyond the basic tutorials, just learn as you go). The iPython shell has the option "-pylab -p scipy" to automatically import the most common tools into your namespace. You can record and play audio using PyAudio. There's also Pygame, which expands on SDL (Simple DirectMedia layer), and pyglet, which uses OpenAL (the OpenGL of audio; it does 3D audio and effects).
As to C/C++, there's IT++, SPUC, and FFTW for signal processing, and SDL/SDL_mixer and OpenAL/ALmixer for interfacing with hardware and audio files.

I would recommend this book : http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=8218
(part of it is available here :
http://books.google.com/books?id=nZ-TetwzVcIC&printsec=frontcover&dq=computer+musical+tutorial&hl=pt-BR&ei=D-dKTaKsBMOB8gbF4KDcDg&sa=X&oi=book_result&ct=result&resnum=1&ved=0CDgQ6AEwAA#v=onepage&q=computer%20musical%20tutorial&f=false )
And another thing you could look is at puredata , it's a open source graphical environment for sound programming, and it's great for beginners. ( http://puredata.info/ )

Related

APCS final project: Converting an audio file to a simpler MIDI file

Lets say I have the audio file for Happy Birthday. I want to convert that audio file into an audio file that sounds like this : happy birthday.
First, I'd like to know if I have the ability to program this? Can a highschooler who's almost finished with APCS program this?
If I can:
How would I change the bpm of the song? I've searched through a bunch of websites, but they weren't very helpful.
I know that audio files can be represented in waveforms. How would I scan for each individual wave in an audio file (I need this to isolate the notes)?
This is a very ambitious project, actually. One reason is that it involves using digital signal processing tools like FFT (Fast fourier transforms) to analyze the sound to pick out the pitches. You might be able to find a library that can do this, but as far as coding such a tool, that would involve a steep learning curve.
If you would like to look further into this, there is a good online resource called "The Scientists and Engineers Guide to Digital Signal Processing". I was able to work through and understand the discrete fourier transform with only high school math (lots of trig) and a bit of calculus. It was a lift, though.
Trying to analyze rhythm is also no easy task. Even with advanced tools provided in professional notation system such as Finale, people have trouble playing rhythms in time well enough for the best transcription tools. Algorithms that "quantize" the beats help but also limit the amount of detail that can be included in the playback.
My guess is that as interesting and worthwhile as this project would be, to bring it to completion before the semester ends would require putting together prebuilt pieces. A lot of programming is done that way, these days.
If you scale the project back to something like just getting your code to analyze a short sample of a single note and give its pitch, that would be both impressive and doable with a lot of work. It could be done with a DFT algorithm instead of requiring FFT, reducing the amount of info you'd have to acquire first. That way, you'd only have to work your way up to understanding and implementing the material on this link which is about calculating the DFT. Notice that there is example code in BASIC. The code examples throughout this book are a big help.

Can FFT be used to find drum solos/breaks in audio files?

Is it possible with FFT to find a drum solo, or a drum break, in an audio file? Is this something FFT is able to do and are there any resources online that could aid me with learning?
In general, a FFT is not a good choice for detecting the onset of percussion sounds:
An FFT is always calculated over a window of samples (in effect a period of time) and yields the magnitude of signal within the bin and its phase offset. You can therefore determine that there is signal at that particular bin, but not its onset time. The best time resolution available is the window period. Of course, you can make the period shorter at the expense of frequency resolution.
Percussion sounds tend to look like noise and spread across the spectrum. This would be OK if you only had percussions sounds, but is not great in real-life polyphonic content.
However, you might be able to find some inference from the different characteristics of the spectra of a drum solo vs instrumental sections of a track.
The problem of finding the time at which percussion sounds start in music is described in academic journals as onset dectection and is one of the many techniques used for feature extraction; the wider field is known as Music Information Retrieval. Your problem sounds like one of identifying sections in audio files and this might be described as partitioning
A good place to start is Sonic Visualiser which is a tool written specifically for MIR applications. Plug-ins exist for various types of feature extraction. From these you will be able to easily find the large body of academic work in this area. There is an added bonus that the existing plug-ins are all open source too.
I'd look here, there was a bit of discussion with great pointers on the Gamedev SE: https://gamedev.stackexchange.com/questions/9761/beat-detection-and-fft :-)

Basics of Digital Audio

I have recently started going through sound card drivers in Linux[ALSA].
Can a link or reference be suggested where I can get good basics of Audio like :
Sampling rate,bit size etc.
I want to know exactly how samples are stored in Audio files on a computer and reverse of this which is how samples(numbers) are played back.
The Audacity tutorial is a good place to start. Another introduction that covers similar ground. The PureData tutorial at flossmanuals is also a good starting point. Wikipedia is a good source once you have the basics down.
Audio is input into a computer via an analog-to-digital converter (ADC). Digital audio is output via a digital-to-analog converter (DAC).
Sample rate is the number of times per second at which the analog signal is measured and stored digitally. You can think of the sample rate as the time resolution of an audio signal. Bit size is the number of bits used to store each sample. You can think of it as analogous to the color depth of an image pixel.
David Cottle's SuperCollider book also has a great introduction to digital audio.
I was in the same situation, and certainly this kind of information is out there but you need to do some research first. This is what I have found:
Digital Audio processing is a branch of DSP (Digital Signal
Processing).
DSP is one of the most powerful technologies that will
shape science and engineering in the twenty-first century.
Revolutionary changes have already been made in a broad range of
fields: communications, medical imaging, radar & sonar, high fidelity
music reproduction, and oil prospecting, to name just a few. Each of
these areas has developed a deep DSP technology, with its own
algorithms, mathematics, and specialized techniques…
This quote was taken from a very helpful guide that covers every topic in depth called the “The Scientist and Engineer's Guide to
Digital Signal Processing”. And though you are not asking for DSP specifically there’s a chapter that covers all digital audio related topics with a very good explanation.
You can find it in the chapter 22 - Audio Processing, and covers all this topics:
Human Hearing: how the sound is perceived by our ears, this is the
basis of how then the sound is then generated artificially.
Timbre: explains the properties of sound, like loudness, pitch and
timbre.
Sound Quality vs. Data Rate: once you know the previous concepts
we start to translate it to the electronic side.
High Fidelity Audio: gives you a picture of how sound is then
processed digitally.
Companding: here you can find how sound is then processed and
compressed for telecommunications.
Speech Synthesis and Recognition: More processes applied to the
sound, like filters, synthesis, etc.
Nonlinear Audio Processing: this is more advanced but understandable,
for sound treatment and other topics.
It explains the basics of sound in the real world, in case you might want to take a look, and then it explains how the sound is processed in the computer including what you are asking for.
But there are other topics that can be found in wikipedia that are more specific, let’s say the “Digital audio” page that explains every detail of this topic, this site can be used as a reference for further research, just in the beginning you can find a few links to sample rate, sound waves, digital forms, standards, bit depth, telecommunications, etc. There are a few things you might need to study more, like the nyquist-shannon theorem, fourier transforms, complex numbers and so on, but this is only used in very specific and advanced topics that you might not review or use. But I mention it just in case you are interested. You can find information in both the DSP guide book and wikipedia although you need to study some math.
I’ve been using python to develop and study these subjects with code since it has a lot of useful libraries, like numpy, sound device, scipy, etc. And then you can start plating with sound. On youtube you can find lots of videos that also guide you on how to do this. I’ve found synthesis, filters, voice recognition, you can create wav files with just code, which is great. But also I’ve seen projects in C/C++, Javascript, and other languages, so it might help you to keep learning and coding fun things.
There are a few other references across the internet but you might need to know what you are looking for, this book and the wikipedia page would be the best starting points for me, since it gives you the basics and explains in depth every topic. Then depending on the goal you want to achieve you can then start looking for more information.

Ways to identify (musical) scores [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm searching ways to identify scores, when someone is playing i.e. guitar. How can I manage that?
I've heard that midi stores music data as musical scores. I wonder if it's a good solution.
MIDI does store musical scores, but it doesn't (normally) extract them from recorded sounds. You can't take an mp3 file and "convert it to MIDI", in a standard or entirely reliable way.
You create a MIDI file using a recorder (or "sequencer"), which might be a desktop application where you "write the score" like a composer does, or it might be a musical device like a keyboard, which records which keys you press, how hard and for how long, and interprets that as a score.
A MIDI player takes the data/score, and reproduces it using its own voice (or "sound font" if you like). So the advantage of MIDI data is firstly that the voice is already available on the playback device (and so the data is very compact), and second that the same data ("tune") can be played using different voices ("instruments")[*].
I believe there are MIDI guitars, but I don't know how "good" they are. The tone of an electric guitar comes in part from resonances of the solid body. This could of course be imitated by the voice at playback time, but there are bound to be some things that you can do with an electric guitar but which the MIDI format cannot capture or represent (for example I'd guess feedback is impossible).
Software exists to extract MIDI data from recorded sound - this is a bit like the way OCR extracts ASCII character data from images of text. It's not a major means of recording someone's guitar-playing, but if what you want is to get a first approximation to the score/tabs, you could try it.
Here's a randomly-selected example, found by Googling "convert from wav to MIDI":
http://www.pluto.dti.ne.jp/~araki/amazingmidi/
[*] But members of the audience, you find yourselves wondering, "what is this mindless automaton which bangs out the tunes it's instructed to, without comprehension or any aesthetic sense". Ladies and gentlemen, Colin Sell at the piano.
Music recognition/retrieval is an extremely difficult and almost entirely unsolved AI problem. Try to extract the frequency from a signal file of someone playing a single unwavering note one time - it's much more difficult than just "apply Fourier transform, read off solution". Compound that with polyphony, noise, rubato, vibrato/portamento, plus the fact that (contrary to speech recognition) we don't even have a working a-priori model of what music actually is, and you begin to see the difficulty. There are absolutely fascinating research papers and even entire conferences on the topic, but in the short term, you're just plain out of luck.
Are you aware you are attempting something extremely difficult? It's a very complex topic you could spend years researching yourself or pay $$$$$$ for existing commercial solutions.
MIDI is a reasonable choice for your output format.
For the rest you will need Fast Fourier Transforms working off a high-resolution capture of the input analogue sounds plus at least seven years of musical theory.
Good luck.
If the player is playing in tune, there will be very distinct frequencies in the signal, or at least frequencies with a mathematical separation. It may be possible to characterise a signal using spectral analysis to distinguish music from noise; or at least melodic music from noise - avant guarde experimental music may not pass ;). The distinction may become more difficult with multiple instrumentalists, percussion, and non-standard or poor tuning; traditional Chinese or Indian music for example uses different scales than western music.
Extracting the frequencies in the signal will require signal processing techniques such as Fast Fourier Transform. Categorising the signal as music/not music could be done by statistical analysis, or AI techniques such as neural networks or fuzzy logic

3D Audio Engine

Despite all the advances in 3D graphic engines, it strikes me as odd that the same level of attention hasn't been given to audio. Modern games do real-time rendering of 3D scenes, yet we still get more-or-less pre-canned audio accompanying those scenes.
Imagine - if you will - a 3D engine that models not just the physical appearance of items, but also their audio properties. And from these models it can dynamically generate audio based on the materials that come into contact, their velocity, distance from your virtual ears, etcetera. Now, when you're crouching behind the sandbags with bullets flying over your head, each one will yield a unique and realistic sound.
The obvious application of such a technology would be gaming, but I'm sure there are many other possibilities.
Is such a technology being actively developed? Does anyone know of any projects that attempt to achieve this?
Thanks,
Kent
I once did some research toward improving OpenAL, and the problem with simulating 3D audio is that so many of the cues that your mind uses — the slightly different attenuation at various angles, the frequency difference between sounds in front of you and those behind you — are quite specific to your own head and are not quite the same for anyone else!
If you want, say, a pair of headphones to really make it sound like a creature is in the leaves ahead and in front of the character in a game, then you actually have to take that player into a studio, measure how their own particular ears and head change the amplitude and phase of the sound at different distances (amplitude and phase are different, and are both quite important to the way your brain processes sound direction), and then teach the game to attenuate and phase-shift the sounds for that particular player.
There do exist "standard heads" that have been mocked up with plastic and used to get generic frequency-response curves for the various directions around the head, but an average or standard will never sound quite right to most players.
Thus the current technology is basically to sell the player five cheap speakers, have them place them around their desk, and then the sounds — while not particularly well reproduced — actually do sound like they're coming from behind or beside the player because, well, they are coming from the speaker behind the player. :-)
But some games do bother to be careful to compute how sound would be muffled and attenuated through walls and doors (which can get difficult to simulate, because the ear receives the same sound at a few milliseconds different delay through various materials and reflective surfaces in the environment, all of which would have to be included if things were to sound realistic). They tend to keep their libraries under wraps, however, so public reference implementations like OpenAL tend to be pretty primitive.
Edit: here is a link to an online data set that I found at the time, that could be used as a starting point for creating a more realistic OpenAL sound field, from MIT:
http://sound.media.mit.edu/resources/KEMAR.html
Enjoy! :-)
Aureal did this back in 1998. I still have one of their cards, although I'd need Windows 98 to run it.
Imagine ray-tracing, but with audio. A game using the Aureal API would provide geometric environment information (e.g. a 3D map) and the audio card would ray-trace sound. It was exactly like hearing real things in the world around you. You could focus your eyes on the sound sources and attend to given sources in a noisy environment.
As I understand it, Creative destroyed Aureal by means of legal expenses in a series of patent infringement claims (which were all rejected).
In the public domain world, OpenAL exists - an audio version of OpenGL. I think development stopped a long time ago. They had a very simple 3D audio approach, no geometry - no better than EAX in software.
EAX 4.0 (and I think there is a later version?) finally - after a decade - I think have incoporated some of the geometric information ray-tracing approach Aureal used (Creative bought up their IP after they folded).
The Source (Half-Life 2) engine on the SoundBlaster X-Fi already does this.
It really is something to hear. You can definitely hear the difference between an echo against concrete vs wood vs glass, etc...
A little known side area is voip. While games are having actively developed software, you are likely to spent time talking to others while you are gaming as well.
Mumble ( http://mumble.sourceforge.net/ ) is software that uses plugins to determine who is ingame with you. It will then position its audio in a 360 degree area around you, so the left is to the left, behind you sounds like as such. This made a creepily realistic addition, and while trying it out it led to funny games of "marko, polo".
Audio took a massive back turn in vista, where hardware was not allowed to be used to accelerate it anymore. This killed EAX as it was in the XP days. Software wrappers are gradually getting built now.
Very interesting field indeed. So interesting, that I'm going to do my master's degree thesis on this subject. In particular, it's use in first person shooters.
My literature research so far has made it clear that this particular field has little theoretical background. Not a lot of research has been done in this field, and most theory is based on movie-audio theory.
As for practical applications, I haven't found any so far. Of course, there are plenty titles and packages which support real-time audio-effect processing and apply them depending on the general surroundings of the auditor. e.g.: auditor enters a hall, so a echo/reverb effect is applied on the sound samples. This is rather crude. An analogy for visuals would be to subtract 20% of the RGB-value of the entire image when someone turns off (or shoots ;) ) one of five lightbulbs in the room. It's a start, but not very realisic at all.
The best work I found was a (2007) PhD thesis by Mark Nicholas Grimshaw, University of Waikato , called The Accoustic Ecology of the First-Person Shooter
This huge pager proposes a theoretical setup for such an engine, as well as formulating a wealth of taxonomies and terms for analysing game-audio. Also he argues that the importance of audio for first person shooters is greatly overlooked, as audio is a powerful force for emergence into the game world.
Just think about it. Imagine playing a game on a monitor with no sound but picture perfect graphics. Next, imagine hearing game realisic (game) sounds all around you, while closing your eyes. The latter will give you a much greater sense of 'being there'.
So why haven't game developers dove into this full-hearted already? I think the answer to that is clear: it's much harder to sell. Improved images is easy to sell: you just give a picture or movie and it's easy to see how much prettier it is. It's even easily quantifyable (e.g. more pixels=better picture). For sound it's not so easy. Realism in sound is much more sub-conscious, and therefor harder to market.
The effects the real world has on sounds are subconsciously percieved. Most people never even notice most of them. Some of these effects cannot even conciously be heard. Still, they all play a part in the percieved realism of the sound. There is an easy experiment you can do yourself which illustrates this. Next time you're walking on the sidewalk, listen carefully to the background sounds of the enviroment: wind blowing through leaves, all the cars on distant roads, etc.. Then, listen to how this sound changes when you walk nearer or further from a wall, or when you walk under an overhanging balcony, or when you pass an open door even. Do it, listen carefully, and you'll notice a big difference in sound. Probably much bigger than you ever remembered.
In a game world, these type of changes aren't reflected. And even though you don't (yet) consciously miss them, your subconsciously do, and this will have a negative effect on your level of emergence.
So, how good does audio have to be in comparison to the image? More practical: which physical effects in the real world contribute the most to the percieved realism. Does this percieved realism depend on the sound and/or the situation? These are the questions I wish to answer with my research. After that, my idea is to design a practical framework for an audio engine which could variably apply some effects to some or all game audio, depending (dynamically) on the amount of available computing power. Yup, I'm setting the bar pretty high :)
I'll be starting per September 2009. If anyone's interested, I'm thinking about setting up a blog to share my progress and findings.
Janne Louw
(BSc Computer Sciences Universiteit Leiden, The Netherlands)

Resources