Azure Kinect DK raw data - azure

I'm considering using the Azure Kinect for a product that I am working on but something that I saw in one of the documents that I read gives me pause:
"The depth camera transmits raw modulated IR images to the host PC. On
the PC, the GPU accelerated depth engine software converts the raw
signal into depth maps."
I'm unsure if this is the result of a non-technical writer trying to interpret what an engineer told them or if this should be interpreted as literal truth.
I'm assuming that the LEDs are being modulated and that the phase shift is what is being used to determine the depths. If the sensor inside the Kinect is just a regular camera that has a fast global shutter that can be timed precisely, I suppose that you could take several images in a row and work out on a pixel by pixel basis the phase shift for the light that it is receiving. I'm working with a resource-limited embedded computer that may not have the processing power to deal with working out the depth for every pixel in real time. Is this what is actually happening where the host computer really is determining the depth or does the Kinect directly give you distances?

Related

I need to analyse many audio WAV files for characteristic noise, ideas?

I need to be able to analyze (search thru) hundreds of WAV files and detect but not remove static noise. As done currently now, I must listen to each conversation and find the characteristic noise/static manually, which takes too much time. Ideally, I would need a program that can read each new WAV file and be able to detect characteristic signatures of the static noise such as periods of bursts of white noise or full audio band, high amplitude noise (like AM radio noise over phone conversation such as a wall of white noise) or bursts of peek high frequency high amplitude (as in crackling on the phone line) in a background of normal voice. I do not need to remove the noise but simply detect it and flag the recording for further troubleshooting. Ideas?
I can listen to the recordings and find the static or crackling but this takes time. I need an automated or batch process that can run on its own and flag the troubled call recordings (WAV files for a phone PBX). These are SIP and analog conversations depending on the leg of the conversation so RTSP/SIP packet analysis might be an option, but the raw WAV file is the simplest. I can use Audacity, but this still requires opening each file and looking at the visual representation of the audio spectrometry and is only a little faster than listening to each call but still cumbersome.
I currently have no code or methods for this task. I simply listen to each call wav file to find the noise.
I need a batch Wav file search that can render wav file recordings that contain the characteristic noise or static or crackling over the recording phone conversation.
Unless you can tell the program how the noise looks like, it's going to be challenging to run any sort of batch processing. I was facing a similar challenge and that prompted me to develop (free and open source) software to help user in audio exploration, analysis and signal separation:
App: https://audioexplorer.online/
Docs: https://tracek.github.io/audio-explorer/
Source code: https://github.com/tracek/audio-explorer
Essentially, it visualises audio as a 2d scatter plot rather than only "linear", as in waveform or spectrogram. When you upload audio the following happens:
Onsets are detected (based on high-frequency content algorithm from aubio) according to the threshold you set. Set it to None if you want all.
Per each audio fragment, calculate audio features based on your selection. There's no universal best set of features, all depends on the application. You might try for starter with e.g. Pitch statistics. Consider setting proper values for bandpass filter and sample length (that's the length of audio fragment we're going to use). Sample length could be in future established dynamically. Check docs for more info.
The result is that for each fragment you have many features, e.g. 6 or 60. That means we have then k-dimensional (where k is number of features) structure, which we then project to 2d space with dimensionality reduction algorithm of your selection. Uniform Manifold Approximation and Projection is a sound choice.
In theory, the resulting embedding should be such that similar sounds (according to features we have selected) are closely together, while different further apart. Your noise should be now separated from your "not noise" and form cluster.
When you hover over the graph, in right-upper corner a set of icons appears. One is lasso selection. Use it to mark points, inspect spectrogram and e.g. download table with features that describe that signal. At that moment you can also reduce the noise (extra button appears) in a similar way to Audacity - it analyses the spectrum and reduces these frequencies with some smoothing.
It does not completely solve your problem right now, but could severely cut the effort. Going through hundreds of wavs could take better part of the day, but you will be done. Want it automated? There's CLI (command-line interface) that I am developing at the same time. In not-too-distant future it should take what you have labelled as noise and signal and then use supervised machine learning to go through everything in batch mode.
Suggestions / feedback? Drop an issue on GitHub.

Using microphone input to create a music visualization in real time on a 3D globe

I am involved in a side project that has a loop of LEDs around 1.5m in diameter with a rotor on the bottom which spins the loop. A raspberry pi controls the LEDs so that they create what appears to be a 3D globe of light. I am interested in a project that takes a microphone input and turns it into a column of pixels which is rendered on the loop in real time. The goal of this is to see if we can have it react to music in real-time. So far I've come up with this idea:
Using a FFT to quickly turn the input sound into a function that maps to certain pixels to certain colors based on the amplitude of the resultant function at frequencies, so the equator of the globe would respond to the strength of the lower-frequency sound, progressing upwards towards the poles which would respond to high frequency sound.
I can think of a few potential problems, including:
Performance on a raspberry pi. If the response lags too far behind the music it wouldn't seem to the observer to be responding to the specific song he/she is also hearing.
Without detecting the beat or some overall characteristic of the music that people understand it might be difficult for the observers to understand the output is correlated to the music.
The rotor has different speeds, so the image is only stationary if the rate of spin is matched perfectly to the refresh rate of the LEDs. This is a problem, but also possibly helpful because I might be able to turn down both the refresh rate and the rotor speed to reduce the computational load on the raspberry pi.
With that backstory, I should probably now ask a question. In general, how would you go about doing this? I have some experience with parallel computing and numerical methods but I am totally ignorant of music and tone and what-not. Part of my problem is I know the raspberry pi is the newest model, but I am not sure what its parallel capabilities are. I need to find a few linux friendly tools or libraries that can do an FFT on an ARM processor, and be able to do the post-processing in real time. I think a delay of ~0.25s or about would be acceptable. I feel like I'm in over my head so I thought id ask you guys for input.
Thanks!

Calculate A-weighted (or B or C) SPL decibel iOS

How can I calculate A-weighted and C-weighted dB sound levels from the microphone on iOS?
Here is what I have tried, but the reading I get is far below the sound level meter I have next to my iPhone:
Using the Novocain library, which I have slightly modified to set the audio session mode to Measurement.
Using the Maximilian audio library to run the incoming audio frames through an FFT and converting the amplitudes into dB.
Using the Maximilian audio libraries's Octave Analyser to place the FFT output into octave bins from 10hz to 20480hz.
For each octave bin I am apply a db gain of the relevant dB-weighing (e.g. apply -70.f db gain to the db value stored in the 10hz bin to get an A-weighted dB gain).
Added the db values of each bin together by reducing each dB bin to an amplitude, and the gain to an an amplitude, making the addition, and converting back to a dB value again.
Is this on the correct track, I have my doubts? Can someone outline an approach? Suggest a library and/or other example (I have looked).
To note – I would like approximate dB(A) and dB(C) values, this does not need to be scientific. Not sure how to compensate for the frequency response of the microphone, could the above technique be correct if it were compensating for the response of the microphone?
I don't think you can measure physical Sound Pressure Levels from a device. In step 2 you "convert the amplitudes into dB." However the amplitude that you record from the device has arbitrary units. When recording 16-bit audio, the audio is represented as numbers in the range -32768 to +32767. If you are working with floating point data then this is normalised by 32768 so that it has a range of (approximately) -1 to +1.
The device's microphone has to cope with a wide variety of sound levels. Generally devices will have some form of Automatic Gain Control which adapts to the current average sound level. This means that if you measure a peak value of 1.0 then you have no way of knowing the actual SPL it corresponded to. You can convert a recording to a series of dBs, but this uses a different definition of dB: as a power ratio. This has no correlation with SPL measurements such as dB(A).
It may be possible to produce an approximate dB(A) measurement if you are able to turn off the AGC and calibrate your device against your sound meter
EDIT: JASA have published a paper with a detailed comparison of existing SPL measurement apps with stats for the comparison of different generations of iPhone: Evaluation of smartphone sound measurement applications

Obstacle avoidance using 2 fixed cameras on a robot

I will be start working on a robotics project which involves a mobile robot that has mounted 2 cameras (1.3 MP) fixed at a distance of 0.5m in between.I also have a few ultrasonic sensors, but they have only a 10 metter range and my enviroment is rather large (as an example, take a large warehouse with many pillars, boxes, walls .etc) .My main task is to identify obstacles and also find a roughly "best" route that the robot must take in order to navigate in a "rough" enviroment (the ground floor is not smooth at all). All the image processing is not made on the robot, but on a computer with NVIDIA GT425 2Gb Ram.
My questions are :
Should I mount the cameras on a rotative suport, so that they take pictures on a wider angle?
It is posible creating a reasonable 3D reconstruction based on only 2 views at such a small distance in between? If so, to what degree I can use this for obstacle avoidance and a best route construction?
If a roughly accurate 3D representation of the enviroment can be made, how can it be used as creating a map of the enviroment? (Consider the following example: the robot must sweep an fairly large area and it would be energy efficient if it would not go through the same place (or course) twice;however when a 3D reconstruction is made from one direction, how can it tell if it has already been there if it comes from the opposite direction )
I have found this response on a similar question , but I am still concerned with the accuracy of 3D reconstruction (for example a couple of boxes situated at 100m considering the small resolution and distance between the cameras).
I am just starting gathering information for this project, so if you haved worked on something similar please give me some guidelines (and some links:D) on how should I approach this specific task.
Thanks in advance,
Tamash
If you want to do obstacle avoidance, it is probably easiest to use the ultrasonic sensors. If the robot is moving at speeds suitable for a human environment then their range of 10m gives you ample time to stop the robot. Keep in mind that no system will guarantee that you don't accidentally hit something.
(2) It is posible creating a reasonable 3D reconstruction based on only 2 views at such a small distance in between? If so, to what degree I can use this for obstacle avoidance and a best route construction?
Yes, this is possible. Have a look at ROS and their vSLAM. http://www.ros.org/wiki/vslam and http://www.ros.org/wiki/slam_gmapping would be two of many possible resources.
however when a 3D reconstruction is made from one direction, how can it tell if it has already been there if it comes from the opposite direction
Well, you are trying to find your position given a measurement and a map. That should be possible, and it wouldn't matter from which direction the map was created. However, there is the loop closure problem. Because you are creating a 3D map at the same time as you are trying to find your way around, you don't know whether you are at a new place or at a place you have seen before.
CONCLUSION
This is a difficult task!
Actually, it's more than one. First you have simple obstacle avoidance (i.e. Don't drive into things.). Then you want to do simultaneous localisation and mapping (SLAM, read Wikipedia on that) and finally you want to do path planning (i.e. sweeping the floor without covering area twice).
I hope that helps?
I'd say no if you mean each eye rotating independently. You won't get the accuracy you need to do the stereo correspondence and make calibration a nightmare. But if you want the whole "head" of the robot to pivot, then that may be doable. But you should have some good encoders on the joints.
If you use ROS, there are some tools which help you turn the two stereo images into a 3d point cloud. http://www.ros.org/wiki/stereo_image_proc. There is a tradeoff between your baseline (the distance between the cameras) and your resolution at different ranges. large baseline = greater resolution at large distances, but it also has a large minimum distance. I don't think i would expect more than a few centimeters of accuracy from a static stereo rig. and this accuracy only gets worse when you compound there robot's location uncertainty.
2.5. for mapping and obstacle avoidance the first thing i would try to do is segment out the ground plane. the ground plane goes to mapping, and everything above is an obstacle. check out PCL for some point cloud operating functions: http://pointclouds.org/
if you can't simply put a planar laser on the robot like a SICK or Hokuyo, then i might try to convert the 3d point cloud into a pseudo-laser-scan then use some off the shelf SLAM instead of trying to do visual slam. i think you'll have better results.
Other thoughts:
now that the Microsoft Kinect has been released, it is usually easier (and cheaper) to simply use that to get a 3d point cloud instead of doing actual stereo.
This project sounds a lot like the DARPA LAGR program. (learning applied to ground robots). That program is over, but you may be able to track down papers published from it.

Real time pitch detection

I'm trying to do real time pitch detection of a users singing, but I'm running into alot of problems. I've tried lots of methods, including FFT (FFT Problem (Returns random results)) and autocorrelation (Autocorrelation pitch detection returns random results with mic input), but I can't seem to get any methods to give a good result. Can anyone suggest a method for real-time pitch tracking or how to improve on a method I already have? I can't seem to find any good C / C++ methods for real time pitch detection.
Thanks,
Niall.
Edit: Just to note, i've checked that the mic input data is correct, and that when using a sine wave the results are more or less the correct pitch.
Edit: Sorry this is late, but at the moment, im visualizing the autocolleration by taking the values out of the results array, and each index, and plotting the index on the X axis and the value on the Y axis (both are divided by 100000 or something, and im using OpenGL), plugging the data into a VST host and using VST plugins isn't an option to me. At the moment, it just looks like some random dots. Am i doing it correctly, or can you please point me torwards some code for doing it or help me understand how to visualize the raw audio data and autocorrelation data.
Taking a step back... To get this working you MUST figure out a way to plot intermediate steps of this process. What you're trying to do is not particularly hard, but it is error prone and fiddly. Clipping, windowing, bad wiring, aliasing, DC offsets, reading the wrong channels, the weird FFT frequency axis, impedance mismatches, frame size errors... who knows. But if you can plot the raw data, and then plot the FFT, all will become clear.
I found several open source implementations of real-time pitch tracking
dywapitchtrack uses a wavelet-based algorithm
"Realtime C# Pitch Tracker" uses a modified autocorrelation approach now removed from Codeplex - try searching on GitHub
aubio (mentioned by piem; several algorithms are available)
There are also some pitch trackers out there which might not be designed for real-time, but may be usable that way for all I know, and could also be useful as a reference to compare your real-time tracker to:
Praat is an open source package sometimes used for pitch extraction by linguists and you can find the algorithm documented at http://www.fon.hum.uva.nl/paul/praat.html
Snack and WaveSurfer also contain a pitch extractor
I know this answer isn't going to make everyone happy but here goes.
This stuff is hard, very hard. Firstly go read as many tutorials as you can find on FFT, Autocorrelation, Wavelets. Although I'm still struggling with DSP I did get some insights from the following.
https://www.coursera.org/course/audio the course isn't running at the moment but the videos are still available.
http://miracle.otago.ac.nz/tartini/papers/Philip_McLeod_PhD.pdf thesis about the development of a pitch recognition algorithm.
http://dsp.stackexchange.com a whole site dedicated to digital signal processing.
If like me you didn't do enough maths to completely follow the tutorials don't give up as some of the diagrams and examples still helped me to understand what was going on.
Next is test data and testing. Write yourself a library that generates test files to use in checking your algorithm/s.
1) A super simple pure sine wave generator. So say you are looking at writing YAT(Yet Another Tuner) then use your sine generator to create a series of files around 440Hz say from 420-460Hz in varying increments and see how sensitive and accurate your code is. Can it resolve to within 5Hz, 1Hz, finer still?
2) Then upgrade your sine wave generator so that it adds a series of weaker harmonics to the signal.
3) Next are real world variations on harmonics. So whilst for most stringed instruments you'll see a series of harmonics as simple multiples of the fundamental frequency F0, for instruments like clarinets and flutes because of the way the air behaves in the chamber the even harmonics will be missing or very weak. And for some instruments F0 is missing but can be determined from the distribution of the other harmonics. F0 being what the human ear perceives as pitch.
4) Throw in some deliberate distortion by shifting the harmonic peak frequencies up and down in an irregular manner
The point being that if you are creating files with known results then its easier to verify that what you are building actually works, bugs aside of course.
There are also a number of "libraries" out there containing sound samples.
https://freesound.org from the Coursera series mentioned above.
http://theremin.music.uiowa.edu/MIS.html
Next be aware that your microphone is not perfect and unless you have spent thousands of dollars on it will have a fairly variable frequency response range. In particular if you are working with low notes then cheaper microphones, read the inbuilt ones in your PC or Phone, have significant rolloff starting at around 80-100Hz. For reasonably good external ones you might get down to 30-40Hz. Go find the data on your microphone.
You can also check what happens by playing the tone through speakers and then recording with you favourite microphone. But of course now we are talking about 2 sets of frequency response curves.
When it comes to performance there are a number of freely available libraries out there although do be aware of the various licensing models.
Above all don't give up after your first couple of tries. Best of luck.
Here's the C++ source code for an unusual two-stage algorithm that I devised which can do Realtime Pitch Detection on polyphonic MP3 files while being played on Windows. This free application (PitchScope Player, available on web) is frequently used to detect the notes of a guitar or saxophone solo upon a MP3 recording. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a MP3 music file. Note onsets are accurately inferred by a significant change in the most dominant pitch (a musical note) at any given moment during the MP3 recording.
When a single key is pressed upon a piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies. The elements of this composite of vibrations at differing frequencies are referred to as harmonics or partials. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ). Linked at the bottom, is a snapshot of the actual harmonics which occur during a polyphonic MP3 recording of a guitar solo.
Instead of a FFT, I use a modified DFT transform, with logarithmic frequency spacing, to first detect these possible harmonics by looking for frequencies with peak levels (see diagram below). Because of the way that I gather data for my modified Log DFT, I do NOT have to apply a Windowing Function to the signal, nor do add and overlap. And I have created the DFT so its frequency channels are logarithmically located in order to directly align with the frequencies where harmonics are created by the notes on a guitar, saxophone, etc.
Now being retired, I have decided to release the source code for my pitch detection engine within a free demonstration app called PitchScope Player. PitchScope Player is available on the web, and you could download the executable for Windows to see my algorithm at work on a mp3 file of your choosing. The below link to GitHub.com will lead you to my full source code where you can view how I detect the harmonics with a custom Logarithmic DFT transform, and then look for partials (harmonics) whose frequencies satisfy the correct integer relationship which defines a 'pitch'.
My Pitch Detection Algorithm is actually a two-stage process: a) First the ScalePitch is detected ('ScalePitch' has 12 possible pitch values: {E, F, F#, G, G#, A, A#, B, C, C#, D, D#} ) b) and after ScalePitch is determined, then the Octave is calculated by examining all the harmonics for the 4 possible Octave-Candidate notes. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a polyphonic MP3 file. That usually corresponds to the notes of an instrumental solo. Those interested in the C++ source code for my Two-Stage Pitch Detection algorithm might want to start at the Estimate_ScalePitch() function within the SPitchCalc.cpp file at GitHub.com.
https://github.com/CreativeDetectors/PitchScope_Player
Below is the image of a Logarithmic DFT (created by my C++ software) for 3 seconds of a guitar solo on a polyphonic mp3 recording. It shows how the harmonics appear for individual notes on a guitar, while playing a solo. For each note on this Logarithmic DFT we can see its multiple harmonics extending vertically, because each harmonic will have the same time-width. After the Octave of the note is determined, then we know the frequency of the Fundamental.
I had a similar problem with microphone input on a project I did a few years back - turned out to be due to a DC offset.
Make sure you remove any bias before attempting FFT or whatever other method you are using.
It is also possible that you are running into headroom or clipping problems.
Graphs are the best way to diagnose most problems with audio.
Take a look at this sample application:
http://www.codeproject.com/KB/audio-video/SoundCatcher.aspx
I realize the app is in C# and you need C++, and I realize this is .Net/Windows and you're on a mac... But I figured his FFT implementation might be a starting reference point. Try to compare your FFT implementation to his. (His is the iterative, breadth-first version of Cooley-Tukey's FFT). Are they similar?
Also, the "random" behavior you're describing might be because you're grabbing data returned by your sound card directly without assembling the values from the byte-array properly. Did you ask your sound card to sample 16 bit values, and then gave it a byte-array to store the values in? If so, remember that two consecutive bytes in the returned array make up one 16-bit audio sample.
Java code for a real-time real detector is available at http://code.google.com/p/freqazoid/.
It works fairly well on any computer running post-2008 real-time Java. The project has been dropped and could be picked up by any interested party. Contact me if you want further details.
Check out aubio, and open source library which includes several state-of-the-art methods for pitch tracking.
I have asked a similar question here:
C/C++/Obj-C Real-time algorithm to ascertain Note (not Pitch) from Vocal Input
EDIT:
Performous contains a C++ module for realtime pitch detection
Also Yin Pitch-Tracking algorithm
You could do real time pitch detection, be it of a singer's voice, with TarsosDSP
https://github.com/JorenSix/TarsosDSP
just in case anyone hasn't heard of it yet :-)
Can you adapt anything from instrument tuners? My delightfully compact guitar tuner is able to detect the pitch of the strings pretty well. I see this reference to a piano tuner which explains an algorithm to some extent.
Here are some open source libraries that implement pitch detection:
WORLD : speech analysis/synthesis toolkit. This is especially suitable if your source signal is voice.
aubio : audio feature extraction library. Implements many pitch detection algorithms.
Pitch detection : a collection of pitch detection algorithms implemented in C++.
dywapitchtrack : a high quality pitch detection algorithm.
YIN : another implementation of the YIN algorithm in a single C++ source file.

Resources