How to detect average pitch in a time-formatted sound wave?

How to detect average pitch in a time-formatted sound wave? - audio

For a sound wave input like the ones generated by OpenAL microphone capture, how can one go about detecting the "average" pitch in the wave? (e.g. if it's a recording of a violin at 440 hz, I want to return ~440 hz). What's the most basic/intuitive way? Is there a reason to use a different method?
Thanks

Related

Algorithm to deal with Audio click/pop sounds

I am making a sound engine where I can play and stop sound. My issue is if a user wants to stop the sound I immediately stop it ie I send 0 as PCM value. This has the consequence of producing a pop / click sound because the PCM value drops from lets say 0.7 to 0 immediately causing a pop/click sound which is very annoying to hear.
Here is a discussion about this.
I am looking for an algorithm or a way to deal with these audio clicks / pops. What is the best practice for dealing audio clicks? Is there a universal way to go about this? I am very new to audio DSP and I could not find a good answer for this.

When you cut off the sound abruptly, you are multiplying it by a step-shaped signal.
When you multiply two signals together, you convolve their frequencies. A step-shape has energy at all frequencies, so the multiplication will spread the energy from the sound over all frequencies, making an audible pop.
Instead, you want to fade the sound out over 30ms or so -- that is still very fast, and will sound like an abrupt stop, but there will be no audible pop.
You should use a curve shaped like 1-t2 to modulate the volume, or something else without significant high-frequency components. That way, when it is convolved with the original sound in the frequency domain, it won't produce any new frequencies.

Audio clip sounds different after importing to AE (audio is a hearing test 20khz to 20hz)

I'm trying to create a quick video in After Effects on audible frequencies.
I'm using this audio clip (20 seconds). The clip starts from 20khz and goes down to 20hz.
As you can see (or hear), we can't hear anything in the beginning at 20khz. Most of us start hearing the frequency at 16khz-15khz.
But when I import this audio clip into AE, it sounds completely different. The sound starts playing from the beginning and it's very loud and sounds nothing like the clip I downloaded.
Here's how the audio clip sounds after export: https://www.mboxdrive.com/soundonly.mp3
What's going on here and how do I fix it?

Two things here.
First off, the sound is exhibiting "aliasing" artifacts. This happens when there it pitch content to the sound that is higher in frequency than the Nyquist frequency for the given sample rate. So, either you are using a low sample rate, or the pitch you are generating has harmonics that are causing the aliasing.
Check to make sure you are using a sine wave (if done correctly there should be no additional harmonic content to the tone besides the fundamental pitch), and that your sample rate is above 40K in order to play a 20K sound without aliasing. The most common sampling rate these days is 44100 fps, but you may be using 8000 fps, which is also still employed and would not work for your application.
Second point, you probably want to change the rate at which you travel through the range of pitches. It sounds like you are going along linearly, but the ear hears things exponentially. The difference in pitch from 100 to 200, for example, is the same (in terms of our perception) as the difference from 1000 to 2000. So you might want to make your rate of descent reflect this if the goal is to spend equal time at every perceived pitch level.

'Mono' FFT Visualization of a Stereo Analog Audio Source

I have created a really basic FFT visualizer using a Teensy microcontroller, a display panel, and a pair of headphone jacks. I used kosme's FFT library for Arduino: https://github.com/kosme/arduinoFFT
Analog audio flows into the headphone input and to a junction where the microcontroller samples it. That junction is also connected to an audio out jack so that audio can be passed to some speakers.
This is all fine and good, but currently I'm only sampling the left audio channel. Any time music is stereo separated, the visualization cannot account for any sound on the right channel. I want to rectify this but I'm not sure whether I should start with hardware or software.
Is there a circuit I should build to mix the left and right audio channels? I figure I could do something like so:
But I'm pretty sure that my schematic is misguided. I included bias voltage to try and DC couple the audio signal so that it will properly ride over the diodes. Making sure that the output matches the input is important to me though.
Or maybe should this best be approached in software? Should I instead just be sampling both channels separately and then doing some math to combine them?

Combining the stereo channels of one end of the fork without combining the other two is very difficult. Working in software is much easier.
If you take two sets of samples, you've doubled the amount of math that the microcontroller needs to do.
But if you take readings from both pins and divide them by two, you can add them together and have one set of samples which represents the 'mono' signal.
Keep in mind that human ears have an uneven response to sound volumes, so a 'medium' volume reading on both pins, summed and halved, will result in a 'lower-medium' value. It's better to divide by 1.5 or 1.75 if you can spare the cycles for more complicated division.

Making Sound To High To Hear Or Undetecable with Sox/FFMPEG

I want to make a sound that is too high to be detected by the human ear. From my understanding, humans can hear sounds between 20hz and 44000hz.
With sox, I am making a sound that is 50000hz. The problem is I can still hear it. The command I am using is this:
sox -n -r 50000 output.wav rate -L -s 50050 synth 3 sine
Either I have super good hearing or I am doing something wrong. How can I make this sound undetectable with SOX of FFMPEG?

Human hearing is generally considered to range between 20Hz and 20kHz, although most people don't hear much above 16kHz. Digital signals can only represent frequencies up to half of their sampling rate, known as the Nyquist frequency, and so, in order to accurately reproduce audio for the human ear, a sampling rate of at least 40kHz is needed. In practice, a sampling rate of 44.1kHz or 48kHz is almost always used, leaving plenty of space for an inaudable sound somewhere in the 20-22kHz range.
For example, this command generates a WAV file with a sampling rate of 48kHz containing a sine wave at 22kHz that is completely inaudible to me:
sox -n -r 48000 output.wav synth 3 sine 22000
I think part of your problem was that you were using the wrong syntax to specify the pitch to sox. This question has some good information about using SoX to generate simple tones.

Basic unit of Sound?

If we consider computer graphics to be the art of image synthesis where the basic unit is a pixel.
What is the basic unit of sound synthesis?
[This relates to programming as I want to generate this via a computer program.]
Thanks!

The basic unit is a sample
In a WAVE file, the sample is just an integer specifying where to move the speaker head to.
The sample rate determines how often a new sample is fed to the speakers (I'm not entirely sure how this part works, but it does get converted to an analog signal first). The samples are typically laid out in the file one right after another.
When you plot all the samples with x-axis being time and y-axis being sample_value, you can see the waveform.
In a wave file, samples can (theoretically) be any bit-size from 0-65535, which remains constant throughout the wave file. But typically 16 or 24 bits are used.

Computer graphics can also have vector shapes as basic units, not just pixels. Generally, vector graphics are generated via computer tools while captured data tends to appear as a grid of pixels (corresponding to an array of sensors in a camera or other capture device). Obviously there is considerable crossover between those classifications.
Similarly, there are sampled (such as .WAV) and generative (such as .MIDI) forms of computer audio. In the sampled case, the smallest unit is a single sample. Just like an array of pixels in the brightness, x- and y-dimensions come together to form an image, an array of samples in the loudness and time dimensions come together to form a sound. In the generative case, it will be something more like a single tone rendered in a particular voice just like vector graphics have paths drawn with particular textures.

A pixel can have a value and be encoded in digital bitmap samples. The same properties apply to sound and digital audio samples.
A pixel is a physical device that can only render the amplitudes of 3 frequencies of light (Red, Green, Blue) at a time. A speaker is a physical device that can render the amplitudes of a wide range of frequencies (~40,000) at a time. The bit resolution of a sample (number of bits used to to store the value of a sample) mainly determines how many colors/tones can be rendered - the fidelity of the physical playback device.
Also, as patterns of pixels can be encoded or compressed, most patterns of sound samples are also encoded or compressed (or both).

The fundamental unit of signal processing (of which audio is a special case) would be the sample.
The frequency at which you need to sample a signal depends on the maximum frequency present in the waveform. Sampling theorem states that it is normally sufficient to sample at twice the frequency of the maximum frequency present in the signal.
http://en.wikipedia.org/wiki/Sampling_theorem
The human ear is sensitive to sounds up to around 20kHz (the upper frequency lowers with age). This is why music on CD is sampled at 44kHz.
It is often more useful to think of music as being comprised of individual frequencies.
http://www.phys.unsw.edu.au/jw/sound.spectrum.html
Most sound analysis and creation is based on this idea.
Related concepts:
Psychoacoustics: Human perception of sound. Relates to modern sound compression techniques such as mp3.
Fourier series: How complex waveforms are composed of individual frequencies.

I would say the basic unit of sound synthesis is the sine wave. But your definition of synthesis is perhaps different to what audio people would refer to sound synthesis. Sound systhesis is the creation of sound using the fundamental components of sound.
With sine waves, we can synthesise sounds using many techniques such as substractive synthesis, additive synthesis or FM synthesis.
Fourier theory states that every sound is a summation of sine waves of differing phases, frequencies and amplitudes.
OK, so how do we represent a sine wave on a computer? well, a sine wave will be generated using a buffer(array) of 'samples' that have been generated by a function or read from a table. The same technique applies to any sound captured on a computer.
A 'sample' is typically represented as number between -1 and 1 that directly correlates to the amplitude of a sound at a given moment in time. A typical sound recorded at 16 bit depth, would have 65536 (2pow16) possible amplitude values. When being recorded, typically, a sample will be captured 44.1k per second of sound. This is called the sampling frequency rate, or simply the sample rate.
Upon playback from you computer, each sample will pass though an Digital to Analogue converter and generate a vibration on your pc speaker and will in turn cause your ear to percieve the recorded sound.

Sound can be expressed as several different units, but the most common in synthesis/computer music is decibels (dB), which are a relative logarithmic measure of amplitude. Specifically they are normally relative to the maximum amplitude of the audio system.
When measuring sound in "real life", the units are normally A-weighted Decibels or dB(A).
The frequency of a sound (i.e. its pitch) is its amplitude over time, or in the digital world, its amplitude over samples. The number of samples per unit of real time is called the sampling rate; conventional hi-fi systems have sampling rates of 44 kHz (44,000 samples per second) and synthesis/recording software usually supports up to 96 kHz.
Everything sound in the digital domain can be represented as a waveform with the X-axis representing the time (or sample number) and the Y-axis representing the amplitude.

frequency and amplitude of the wave are what make up sound.
That is for a tone.
Music or for that matter most noise is a composite of multiple simultaneous sound waves superimposed on one another.
The unit for amplitute is the
Bel. (We use tenths of a Bel
therefore the term decibel)
The unit for frequency is the
Hertz.
That being said synthesis of music is a large field.

Bitmapped graphics are based on sampling the amplitude of light in a 2D space, where each sample is digitized to a given bit depth and often converted to a logarithmic representation at a different bit depth. The samples are always positive, since you can't be darker than pure black. Each of these samples is called a pixel.
Sound recording is most often based on sampling the magnitude of sound pressure at a microphone, where the samples are taken at constant time intervals. These samples can be positive or negative with respect to perfect silence. Most often these samples are not converted to a logarithm, even though sound is perceived in a logarithmic fashion just as light is. There is no special term to refer to these samples as there is with pixels.
The Bels and Decibels mentioned by others are useful in the context of measuring peak or average sound levels. They are not used to describe the individual sound samples.
You might also find it useful to know how sound file formats compare to image file formats. WAVE is an uncompressed format specific to Windows and is analogous to BMP. MP3 is a lossy compression analogous to JPEG. FLAC is a lossless compression analogous to 24-bit PNG.

If computer graphics are colored dots in 2 dimensional space representing a 3 dimensional space, then sound synthesis is amplitude values regularly partitioned in time representing musical events.
If you want your result to sound like music (the kind of music most people like at least), then you are either going to use some standard synthesis techniques, or literally waste decades of your life reinventing them from scratch.
The most basic techniques are additive synthesis, in which the individual elements are the frequencies, amplitudes, and phases of sine oscillators; subtractive synthesis, where you work with filter coefficients and a complex input waveform; frequency modulation synthesis, where you work with modulation depths and rates of stages of modulation; granular synthesis where short (hundredths to tenths of a second long) enveloped pieces of a recorded sound or an artificial waveform are combined in immense numbers. Each of these in practice uses parameters that evolve over the course of a note, and often you will mix elements of various techniques into a larger instrument.
I recommend this book, though it doesn't have the math for many concepts it at least lays the ground for the concepts used, and gives a nice overview of the techniques.
You wouldn't waste your time going sample by sample to do music in practice any more than you would waste your time going pixel by pixel to render 3d (in other words yeah go sample by sample if making a tool for other people to make music with, but that is way too low a level if you are interested in the task of making music).

Probably the envelope. A tone/note has a shape described by: attack decay sustain release

The byte, or word, depending on the bit-depth of the sound.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to detect average pitch in a time-formatted sound wave? - audio

Related

Algorithm to deal with Audio click/pop sounds

Audio clip sounds different after importing to AE (audio is a hearing test 20khz to 20hz)

'Mono' FFT Visualization of a Stereo Analog Audio Source

Making Sound To High To Hear Or Undetecable with Sox/FFMPEG

Basic unit of Sound?

Categories

Resources