Reading and understanding a raw audio file(specifically MP3) - audio

I am trying to understand what the raw data from an audio file looks like and how to get that data. I want to take the data and analyze it and see if I am able to make a program that can recognize patterns in a song such as in a hip hop song, finding the same beat in a chorus. In my head I think this could be a doable task if the data is in an integer form.
I've looked up many tutorials for this but all the tutorials use other libraries or don't explain it in a way I understand(more than likely the source of my issue).
I am wondering if there is someone out there that can help me understand a few things.
1). In an MP3 file, what is actually being stored in the file. Is it an integer that tells the radio/amp/audioPlayer a frequency, another integer for amplitude, etc...(over simplified because I don't know what other data is stored in an audio file).
2). If it is stored in an integer format, is there a way to read the integers and analyze it. If it is not stored in an integer format, how is it stored, and is there a way to convert it to an integer format?
3). In visual representations of an audio files like this one, it seems more clear what is what. It seems like the frequency is where on the circle the audio is represented, and the amplitude is how high it jumps. Is this right? Or does it just appear that way and I am understanding it incorrectly.
4). Is this task harder than I think it is? Considering I haven't found any good explanations or tutorials on how to do so, I am skeptical on how easy this would be.
(Sorry if this was poorly phrased, first question on stack and I am just illiterate :^)

Related

How to decrease pitch of audio file in nodejs server side?

I have a .MP3 file stored on my server, and I'd like to modify it to be a bit lower in pitch. I know this can be achieved by increasing the length of the audio, however, I don't know of any libraries in node that can do this.
I've tried using the node web audio api, and soundbank-pitch-shift, but the former doesn't seem to have the capabilities of pitch shifting (AFAIK), and the latter seems designed toward client
I need the solution within the realm of node ONLY- that means no external programs, etc., and it needs to be automated as well, so I can't manually pitch shift.
An ideal solution would be a function that takes a file/filepath as an input, and then creates (or overwrites) another MP3 file but with the pitch shifted by x amount, but really, any solution that produces something with a lower pitch than the original, works.
I'm totally lost here. Please help.
An audio file is basically a list of numbers. Those numbers are read one at a time at a particular speed called the 'sample rate'. The sample rate is otherwise defined as the number of audio samples read every second e.g. if an audio files sample rate is 44100, then there are 44100 samples (or numbers) read every second.
If you are with me so far, the simplest way to lower the pitch of an audio file is to play the file back at a lower sample rate (which is normally fixed in place). In most cases you wont be able to do this, so you need to achieve the same effect by resampling the file i.e adding new samples to the file in between the old samples to make it literally longer. For this you would need to understand interpolation.
The drawback to this technique in either case is that the sound will also play back at a slower speed, as well as at a lower pitch. If it is a problem that the sound has slowed down as well as lowered in pitch as a result of your processing, then you will also have to use a timestretching algorithm to fix the playback speed.
You may also have problems doing this using MP3 files. In this case you may have to uncompress the data in the MP3 file before you can operate on it in such a way that changes the pitch of the file. WAV files are more ideal in audio processing. In any case, you essentially need to turn the file into a list of floating point numbers, and change those numbers to be effectively read back at a slower rate.
Other methods of pitch shifting would probably need to involve the use of ffts, and would be a more complicated affair to say the least.
I am not familiar with nodejs I'm afraid.
I managed to get it working with help from Ollie M's answer and node-lame.
I hadn't known previously that sample rate could affect the speed, but thanks to Ollie, suddenly this problem became a lot more simple.
Using node-lame, all I did was take one of the examples (mp32wav.js), and make it so that I change the parameter sampleRate of the format object, so that it is lower than the base sample rate, which in my application was always a static 24,000. I could also make it dynamic since node-lame can grab the parameters of the input file in the format object.
Ollie, however perfectly describes the drawback with this method
The drawback to this technique in either case is that the sound will
also play back at a slower speed, as well as at a lower pitch. If it
is a problem that the sound has slowed down as well as lowered in
pitch as a result of your processing, then you will also have to use a
timestretching algorithm to fix the playback speed.
I don't have a particular need to implement a time stretching algorithm at the moment (thankfully, because that's a whole other can of worms), since I have the ability to change the initial speed of the file, but others may in the future.
See https://www.npmjs.com/package/audio-decode, https://github.com/audiojs/audio-buffer, and related linked at bottom of audio-buffer readme.

Realtime Sound Routing...Trigger a Sound with Another Sound

I'm looking for a program that is able to recognize individual audio samples from my computer and reroute them to trigger WAV files from a library. In my project, it would need to be realtime as the latency would not be a desired result. I tried using dictation software that would recognize words to trigger opening a file and that's the direction where I want to go, but instead of words I want it to be sounds and it would happen in realtime. I'm not sure where to go and am just looking for some guidance. Does anyone have any suggestions of what I should do?
That's a fairly broad question, but I can tell you how I would do it. (Hardly the only way, but where I would start.)
If you're looking for real time input, the Java Sound library (excellent tutorial here) allows for that. (Just note that microphone input from a web page is difficult on anything, due to major security concerns, so this would be a desktop application.)
If it needs to be real time, the first thing I would suggest is stream and multithread the hell out of it. I would suggest the Java 8 Stream API, but since you're looking for subsamples that match a specific pattern, then each data point will have to be aware of the state of its neighbors, and that isn't easy with streams.
You will probably want to know if a sound roughly resembles an audio profile, so for that, I would pick a tolerance on just how close you want it to be for a match (remembering that samples may not line up 100% anyway, so "exact" is not an option), and then look up Hidden Markov Models. I suggest these because they're what voice recognition software typically uses, and while your sounds may not be voices, it will give you an idea of what has already been done.
You'll also want to maintain a limited list of audio samples in memory. Specifically, you will likely need the most recent data, because an audio signal is a time-variant signal, and you can't get a match from just one point. I wouldn't make it much longer than the longest sample you're looking to recognize, as audio takes up a boatload of memory.
Lastly (for audio), I would recommend picking a standard format for comparison. Make it as good as gets you decent results, and start high. You will want to convert everything to that format before you compare it.
Once you recognize a specific sound, it's basically a Command Pattern. Specific sounds can be mapped, even with a java.util.HashMap, to specific files, which (if there are few enough) you might even have pre-loaded.
Lastly, it's worth looking at the Java Speech API. It's not part of the JDK and it's quite dated, but you might get some good advice from its implementation.
This is of course the advice of a Java-preferring programmer, but I imagine that there might be some decent libraries in Python and Ruby to help you as well; and of course there's something in C somewhere. This may sound like a lot, but most of the material is already implemented and ready-to-go.
Hopefully this helps, let's look forward to other answers.

sampling wav files in to get amplitude at a specific time

i am wondering if there is any way to cycle through a .wav file to get the amplitude/DB of a specific point in the wav file. i am reading it into a byte array now but that has no help to me from what i can see.
i am using this in conjunction with some hardware i have developed that encodes light data into binary and outputs audio. i wont get into the details but i need to be able to do this in c# or c++. i cant find any info on this anywhere. i have never programmed anything relating to audio so excuse me if this is a very easy thing.
i dont have anything started since this is the starting point so if anybody can point me to some functions, libraries, or methods to being able to collect the amplitude of the wave at a specific time in the file, i would greatly appreciate it.
i hope this is enough info, and thank you in advance if you are kind enough to help.
It is possible and it is done in a straightforward way: the file with PCM audio contains one value for every channel, for every (1/sample-rate) of second.
The values however might vary: 8-bit, 16-bit, single precision floating point values. You certainly have to take this into account and this is the reason you cannot take the bytes from byte array directly.
The .WAV file also has a header preceding the actual payload.

Estimating the time-position in an audio using data?

I am wondering on how to estimate where I am currently in an audio with regards to time, by using the data.
For example, I read data by byte[8192] blocks. How can I know how much byte[8192] is equivalent to in time?
If this is some sort of raw-ish encoding, like PCM, this is simple. The length in time is a function of the sample rate, bit depth, and number of channels. 30 seconds of 16-bit audio at 44.1kHz in mono is 2.5MB. However, you also need to factor in headers and container format crapola. WAV files for example can have a lot of other stuff in them.
Compressed formats are much more tricky. You can never be sure where you are without playing through the file to get to where you are. Of course you can always guesstimate based on the percentage of the file length, if that is good enough for your case.
I think this is not what he was asking.
First you have to tell us what kind of data you are using. WAV? MP3? Usually without knowing where that block came from - so you know if you have some kind of frame information and where to find it - you are not able to determine that block's position.
If you have the full stream and this data then you can do a search

How to mix audio samples?

My question is not completely programming-related, but nevertheless I think SO is the right place to ask.
In my program I generate some audio data and save the track to a WAV file. Everything works fine with one sound generator. But now I want to add more generators and mix the generated audio data into one file. Unfortunately it is more complicated than it seems at first sight.
Moreover I didn't find much useful information on how to mix a set of audio samples.
So is there anyone who can give me advice?
edit:
I'm programming in C++. But it doesn't matter, since I was interested in the theory behind mixing two audio tracks. The problem I have is that I cannot just sum up the samples, because this often produces distorted sound.
I assume your problem is that for every audio source you're adding in, you're having to lower the levels.
If the app gives control to a user, just let them control the levels directly. Hotness is their responsibility, not yours. This is "summing."
If the mixing is automated, you're about to go on a journey. You'll probably need compression, if not limiting. (Limiting is an extreme version of compression.)
Note that anything you do to the audio (including compression and limiting) is a form of distortion, so you WILL have coloration of the audio. Your choice of compression and limiting algorithms will affect the sound.
Since you're not generating the audio in real time, you have the possibility of doing "brick wall" limiting. That's because you have foreknowledge of the levels. Realtime limiting is more limited because you can't know what's coming up--you have to be reactive.
Is this music, sound effects, voices, what?
Programmers here deal with this all the time.
Mixing audio samples means adding them together, that's all. Typically you do add them into a larger data type so that you can detect overflow and clamp the values before casting back into your destination buffer. If you know beforehand that you will have overflow then you can scale their amplitudes prior to addition - simply multiply by a floating point value between 0 and 1, again keeping in mind the issue of precision, perhaps converting to a larger data type first.
If you have a specific problem that is not addressed by this, feel free to update your original question.
dirty mix of two samples
mix = (a + b) - a * b * sign(a + b)
You never said what programming language and platform, however for now I'll assume Windows using C#.
http://www.codeplex.com/naudio
Great open source library that really covers off lots of the stuff you'd encounter during most audio operations.

Resources