Use Kalman filter to track the position of an object, but need to know the position of that object as an input of Kalman filter. What is going on? - object

I am trying to study how to use Kalman filter in tracking an object (ball) moving in a video sequence by myself so please explain it to me as I am a child.
By some algorithms (color analysis, optical flow...), I am able to get a binary image of each video frame in which there is the tracking object ( white pixels) and background (black pixels) -> I know the object size, object centroid, object position -> Just simple draw a bounding box around the object --> Finish. Why do I need to use Kalman filter here?
Ok, somebody told me that because I can not detect the object in each video frame because of noise, I need to use Kalman filter to estimate the position of the object. Ok, fine. But as I know, I need to provide the input to Kalman filter. They are previous state and measurement.
previous state ( so I think it is the position, the velocity, acceleration...of the object in the previous frame) -> Ok, this is fine to me.
measurement of current state: Here is what I can not understand. What can measurement be?
- The position of the object in the current frame? It is funny because if I know the position of the object, all I need is just to draw a simple boundingbox (rectangular) around the object. Why I need Kalman filter here anymore? Therefore, it is impossible to take the position of the object in the current frame as measurement value.
- "Kalman Filter Based Tracking in an Video Surveillance System" article says
The main role of the Kalman filtering block is to assign a tracking
filter to each of the measurements entering the system from the
optical flow analysis block.
If you read the full paper, you will see that the author takes the maximum number of blob and the minimum size of the blob as an input to the Kalman filter. How can those parameters be used as measurement?
I think I am in a loop now. I want to use Kalman filter to track the position of an object, but I need to know the position of that object as an input of Kalman filter. What is going on?
And 1 more question, I dont understand the term "number of Kalman filter". In a video sequence, if there are 2 objects need to track -> need to use 2 Kalman filter? Is that what it means?

You don't use the Kalman filter to give you an initial estimate of something; you use it to give you an improved estimate based on a series of noisy estimates.
To make this easier to understand, imagine you're measuring something that is not dynamic, like the height of an adult. You measure once, but you're not sure of the accuracy of the result, so you measure again for 10 consecutive days, and each measurement is slightly different, say a few millimeters apart. So which measurement should you choose as the best value? I think it's easy to see that taking the average will give you a better estimate of the person's true height than using any single measurement.
OK, but what has that to do with the Kalman filter?
The Kalman filter is essentially taking an average of a series of measurements, as above, but for dynamic systems. For instance, let's say you're measuring the position of a marathon runner along a race track, using information provided by a GPS + transmitter unit attached to the runner. The GPS gives you one reading per minute. But those readings are inaccurate, and you want to improve your knowledge of the runner's current position. You can do that in the following way:
Step 1) Using the last few readings, you can estimate the runner's velocity and estimate where he will be at any time in the future (this is the prediction part of the Kalman filter).
Step 2) Whenever you receive a new GPS reading, do a weighted average of the reading and of your estimate obtained in step 1 (this is the update part of the Kalman filter). The result of the weighted average is a new estimate that lies in between the predicted and measured position, and is more accurate than either by itself.
Note that you must specify the model you want the Kalman filter to use in the prediction part. In the marathon runner example you could use a constant velocity model.

The purpose of the Kalman filter is to mitigate the noise and other inaccuracies in your measurements. In your case, the measurement is the x,y position of the object that has been segmented out of the frame. If you can perfectly segmement out the ball and only the ball from the background for every frame, there is no need for the Kalman filter since your measurements in effect contain no noise.
In most applications, perfect measurements cannot be guaranteed for a number of reasons (change in lighting, change in background, other moving objects, etc.) so there needs to be a way of filtering the measurements to produce the best estimate of the true track.
What the Kalman Filter does is use a model to predict what the next position should be assuming the model holds true, and then compares that estimate to the actual measurement you pass in. The actual measurement is used in conjunction with the prediction and noise characteristics to form the final position estimate and update a characterization of the noise (measure of how much the measurements are differing from the model).
The model could be anything that models the system you are trying to track. A common model is a constant velocity model which just assumes that the object will continue to move with the same velocity as in the previous estimate. This is not to say that this model will not track something with a changing velocity since the measurements will reflect the change in velocity and affect the estimate.
There are a number of ways you can attack the problem of tracking multiple objects at once. The simplest way is to use an independent Kalman filter for each track. This is where the Kalman filter really starts to pay off because if you are using the simple approach of just using the centroid of a bounding box, what happens if the two objects cross one another? Can you again differentiate which object is which after they separate? With the Kalman filter, you have the model and prediction that will help keep the track correct when other objects are interfering.
There are also more advanced ways of tracking multiple objects jointly like a JPDAF.

Jason has given a good start on what Kalman filter is. In regard to your question as to how the paper can use the maximum number of blobs and the minimum size of the blob, this is exactly the power of Kalman filter.
A measurement needs not be a position, a velocity or an acceleration. A measurement can be any quantity that you can observe at a time instance. If you can define a model that predict your measurement in the next time instance given the current measurement, Kalman filter can help you mitigate the noise.
I would suggest you look into more introductory materials on Image Processing and Computer Vision. These materials will almost always cover Kalman filter.
Here is a SIGGRAPH course on trackers. It is not introductory but should give you a more in-depth look at the topic.
http://www.cs.unc.edu/~tracker/media/pdf/SIGGRAPH2001_CoursePack_08.pdf

In the case that you can find the ball exactly in every frame, you don't need a Kalman filter. Just because you find some blog which is likely the ball, it doesn't mean that the center of that blob will be the perfect center of the ball. Think of that as your measurement error. Also, if you happen to pick out the wrong blog, using a Kalman filter would help prevent you from trusting that one wrong measurement. Like you said before, if you can't find the ball in a frame, you can also use the filter to estimate where it is likely to be.
Here are some of the matrices you will need, and my guess at what they would be for you. Since the x and y position of the ball is independent, I think it is easier to have two filters, one for each. Both would look kinda like this:
x = [position ; velocity] //This is the output of the filter
P = [1, 0 ; 0 ,1] //This is the uncertainty of the estimation, I am not quite sure what you should have to start, but it will converge once the filter is running.
F = [ 1,dt ; 0,1] when you do x*F this will predict the next location of the ball. Notice that this assumes the ball keeps moving with the same velocity as before, and just updates the position.
Q = [ 0,0 ; 0,vSigma^2] This is the "process noise". This one of the matrices you tune to make the filter preform well. In your system, velocity can change at any time, but position will never change without the velocity being what changed it. This is confusing. The value should be the standard deviation of what those velocity changes might be.
z = [position in x or y] This is your measurement
H = [1,0 ; 0,0] This is how your measurement gets applied to your current state. Since you are only measuring position, you only have a 1 in the first row.
R = [?] I think you will only need a scalar for R, which is the error in your measurement.
With those matrices you should be able to plug them into the formulas that are everywhere for Kalman filters.
Some good things to read:
Kalman filtering demo
Another great into, read the page linked to in the third paragraph

I had this question few weeks ago. I hope this answer helps another people.
If you can get the a good segmentation at each frame (the whole ball), you don't need to use kalman filter. But segmentation can give you a set of unconected blobs (only few parts of the ball). The problem is to know what parts (blobs) belong to the object or are just noise. Using kalman filter we can assign blobs near of the estimated position as parts of the object. E.g. if the ball has 10 pixels of radius, blobs with a distance higher than 15 should not be considered as part of the object.
Kalman filter uses the previous state to predict the current state. But, uses the current measurement (current object position) to improve its next prediction. E.g. if a vehicle is at the position 10 (previous state) and goes with a velocity of 5 m/s, kalman filter predict the next position at the position 15. But if we measure the position of the object, we found the object is at position 18. In order to improve the estimation, kalman filter updates the velocity to 8 m/s.
As summary, kalman filter is mainly used to solve the data
association problem in video tracking. It is also good to estimate
the object position, because it take into account the noise in the
source and in the observation.
And for you final question, you are right. It corresponds to the number of
object to track (one kalman filter per object).

In vision application , it is common to use your results at each frame as measurement, for example location of ball in each frame is good measurement.

Related

Bias/Dirft compensation for integration of linear accelerometer data using Kalman filtering

I have become a part of this infinite question of how to estimate position from accelerometer data achieved by an Inertial measurement unit. I am wondering how to compensate for integration ''drift'' during linear movement using Kalman filtering.
At this moment I got my acceleration in a fixed coordinate system and all movements are in know directions with no change in angular position.
So at this point we got acceleration in 3D (x-y-z) in known directions, an acceleration in x will yield for zero acceleration in y and z and so on. Assuming perfect conditions, which are not the case, of course some noise with be added to the other directions when moving in one direction but lets ''leave'' this out at this point. In addition, It is important to note that the system only has to estimate a limited period, approximately about 1 second using a sampling freq of 512 Hz.
It also important to note that I have compensated for the offset (gravity and misalignment of the accelerometer in the IMU) and bias of the acceleromter data when static. Meaning when the sensor is non-moving all my readings are constant zero before going into the Kalman filter.
To more characterize my problem I have this graph to illustrate my problem with drift. This is estimations on 5 seconds to more show what I'm struggling with.
Position-estimation-drift-problem
Here we are looking into a movement in one direction, the movement are 20cm movement in y direction which in my case are forward relative to my starting position.
Is there a way to reduce/eliminate this drift when integrating my signal. For instance assume something about drifting when my sensor is non-moving. Or to compute using some correction in my Kalman algorithm to subtract or add to my estimated velocity and position. The system does not have to run in real time so any tuning bias compensation can be adjusted for looking back into the data. But I would be preferable if it was possible to take new measurements with slightly different movements and not tune more then needed.
Finally where/how can I compensate for this, in the Kalman algorithm or before/after, or should I be in for a disappointment already?
If I left out some important information please ask so i can elaborate more, an at last any thoughts/ideas are welcome!
Remember I do only need to estimate for second’s worth of time so my hope is that this makes it more achievable, but i might be wrong?
I can only guess/suggest few tricks, but you will probably get some significant error if you only based on accelerometer.
seems that detecting motionless is not resetting the speed, just acceleration (according to your graph) so this should be an easy fix
if we are talking an a car/other type of surface motion with contact / friction, your motionless can be set by characterizing the noise of in motion/self sensor noise
kalman parameters may be off
run multiple kernels and average results (may also try particle filter)
if its not for online application you can also try fitting offsets/drift and reduce them by assuming there is not motion in constant speed or other approaches that can replace the kalman filter which is designed for real time best estimation.
error seems a-symmetric in time, just run it in both directions (:
what are you measuring at 512 Hz??? maybe you can better model it
I can go on and on but if you supply data and code, it would be much easier.
Good luck,
Lev

Normalizing audio waveforms code implementation (Peak, RMS)

I have some audio data (array of floats) which I use to plot a simple
waveform.
When plotted, the waveform doesn't max out at the edges.
No problem - the data just needs to be normalized. I iterate once to find the max, and then iterate again dividing each by the max. Plot again and everything looks great!
But wait videos which have a loud intro, or loud explosion, causes the rest of the waveform to still be tiny.
After some research, I come across RMS that is supposed to address this. I iterate through the samples and calculate the RMS, and again divide each sample by the RMS value. This results in considerable "clipping":
What is the best method to solve this?
Intuitively, it seems I might need to calculate a local max or average based on a moving window (rather than the entire set) but I'm not entirely sure. Help?
Note: The waveform is purely for visual purposes (the audio will not be played back to the user).
You could transpose it (effectively making the y-axis non-linear, or you can think it as a form of companding).
Assuming the signal is within the range [-1, 1].
One popular quick and simple solution is to simply apply the hyperbolic tangens function (tanh). This will limit values to [-1, 1] by penalizing higher values more. If you amplify the signal before applying tanh, the effect will be more pronounced.
Another alternative is a logarithmic transform. As the signal changes sign some pre-processing has to be performed.
If r is a series of sample values one approach could be something like this:
r.log1p <- log2(1.1 * (abs(r) + 1)) * sign(r)
That is, for every value take its absolute, add one, multiply with some small constant, take the log and then finally multiply it with the sign of its corresponding old value.
The effect can be something like this:

Determining Note Durations based on Onset Locations

I have a question regarding how to determine the Duration of notes given their Onset Locations.
So for example, I have an array of amplitude values (containing short) and another array of the same size, that contains a 1 if a note onset is detected, and a 0 if not. So basically, the distance between each 1 will be used to determine the duration.
How can I do this? I know that I have to use the Sample Rate and other attributes of the audio data, but is there a particular formula that I can use?
Thank you!
So you are starting with a list of ONSETS, what you are really looking for is a list of OFFSETS.
There are many methods for onset detection (here is a paper on it) https://adamhess.github.io/Onset_Detection_Nov302011.pdf
many of the same methods can be applied to Offset Detection:
Since the onset is marked by an INCREASE in spectral content you can measure a decrease in Spectral content.
take a reasonable time window before and after your onset. (.25-.5s)
Chop up the window into smaller segments and take 50% overlapping Fourier transforms.
compute the difference between the fourier co-efficient between two successive windows decreases and only allow negative changes in SD.
multiple your results by -1.
pick the peaks off of the results
Voila, offsets.
(look at page 7 of the paper listed above for more detail about spectrial difference function, you can apply a modified (as above) version of it_
Well, if your samplerate in Hz is fs, then the time between two nodes is equal to
1/fs * <number of zeros between the two node-ones>
Very simple :-)
Regards

DSP - Filter sweep effect

I'm implementing a 'filter sweep' effect (I don't know if it's called like that). What I do is basically create a low-pass filter and make it 'move' along a certain frequency range.
To calculate the filter cut-off frequency at a given moment I use a user-provided linear function, which yields values between 0 and 1.
My first attempt was to directly map the values returned by the linear function to the range of frequencies, as in cf = freqRange * lf(x). Although it worked ok it looked as if the sweep ran much faster when moving through low frequencies and then slowed down during its way to the high frequency zone. I'm not sure why is this but I guess it's something to do with human hearing perceiving changes in frequency in a non-linear manner.
My next attempt was to move the filter's cut-off frequency in a logarithmic way. It works much better now but I still feel that the filter doesn't move at a constant perceived speed through the range of frequencies.
How should I divide the frequency space to obtain a constant perceived sweep speed?
Thanks in advance.
The frequency sweep effect you're referring to is likely a wah-wah filter, named for the ubiquitous wah-wah pedal.
We hear frequency in terms of octaves, and sweeping through octaves with a logarithmic scale is the way to linearize it. Not to sound dismissive, but it sounds like what you're doing is physically and mathematically correct. (You should spent as much time between 200 and 400 Hz as you do between 2000 and 4000 Hz, etc.) You just don't like how it sounds. And that's quite okay on both counts -- audio is highly subjective.
To mix things up a bit, one option would be to try the Bark scale, which is based on psychoacoustics and the structure of the ear. As I understand it, this is designed to spend equal amounts of time in each of your ear's internal "bandpass filters".
You could always try a quadratic or cubic function between 0 and 1. Audio potentiometers often use a few piecewise quadratic or cubic sections to get their mapping.
Winging it, but try this:
http://en.wikipedia.org/wiki/Physics_of_music#Scales "The following table shows the ratios between the frequencies of all the notes of the just major scale and the fixed frequency of the first note of the scale."
There is then a chart showing fractional values between 1 and 2, and if you tweak your timing to match, you may get what you wish. While the overall progression is still logarithmic, the stepping between each one should divide up into equal stepped 8ths (a bit jumpy).
Put another way, every half second adjust one note up. Each octave (I think) will cover twice the frequency range of the prior octave.
EDIT: Also, you'll find the frequencies here: http://en.wikipedia.org/wiki/Middle_C#Designation_by_octave (doesn't the programmer in you wish that C0 was exactly 16hz?)

Downsampling and applying a lowpass filter to digital audio

I've got a 44Khz audio stream from a CD, represented as an array of 16 bit PCM samples. I'd like to cut it down to an 11KHz stream. How do I do that? From my days of engineering class many years ago, I know that the stream won't be able to describe anything over 5500Hz accurately anymore, so I assume I want to cut everything above that out too. Any ideas? Thanks.
Update: There is some code on this page that converts from 48KHz to 8KHz using a simple algorithm and a coefficient array that looks like { 1, 4, 12, 12, 4, 1 }. I think that is what I need, but I need it for a factor of 4x rather than 6x. Any idea how those constants are calculated? Also, I end up converting the 16 byte samples to floats anyway, so I can do the downsampling with floats rather than shorts, if that helps the quality at all.
Read on FIR and IIR filters. These are the filters that use a coefficent array.
If you do a google search on "FIR or IIR filter designer" you will find lots of software and online-applets that does the hard job (getting the coefficients) for you.
EDIT:
This page here ( http://www-users.cs.york.ac.uk/~fisher/mkfilter/ ) lets you enter the parameters of your filter and will spit out ready to use C-Code...
You're right in that you need apply lowpass filtering on your signal. Any signal over 5500 Hz will be present in your downsampled signal but 'aliased' as another frequency so you'll have to remove those before downsampling.
It's a good idea to do the filtering with floats. There are fixed point filter algorithms too but those generally have quality tradeoffs to work. If you've got floats then use them!
Using DFT's for filtering is generally overkill and it makes things more complicated because dft's are not a contiuous process but work on buffers.
Digital filters generally come in two tastes. FIR and IIR. The're generally the same idea but IIF filters use feedback loops to achieve a steeper response with far less coefficients. This might be a good idea for downsampling because you need a very steep filter slope there.
Downsampling is sort of a special case. Because you're going to throw away 3 out of 4 samples there's no need to calculate them. There is a special class of filters for this called polyphase filters.
Try googling for polyphase IIR or polyphase FIR for more information.
Notice (in additions to the other comments) that the simple-easy-intuitive approach "downsample by a factor of 4 by replacing each group of 4 consecutive samples by the average value", is not optimal but is nevertheless not wrong, nor practically nor conceptually. Because the averaging amounts precisely to a low pass filter (a rectangular window, which corresponds to a sinc in frequency). What would be conceptually wrong is to just downsample by taking one of each 4 samples: that would definitely introduce aliasing.
By the way: practically any software that does some resampling (audio, image or whatever; example for the audio case: sox) takes this into account, and frequently lets you choose the underlying low-pass filter.
You need to apply a lowpass filter before you downsample the signal to avoid "aliasing". The cutoff frequency of the lowpass filter should be less than the nyquist frequency, which is half the sample frequency.
The "best" solution possible is indeed a DFT, discarding the top 3/4 of the frequencies, and performing an inverse DFT, with the domain restricted to the bottom 1/4th. Discarding the top 3/4ths is a low-pass filter in this case. Padding to a power of 2 number of samples will probably give you a speed benefit. Be aware of how your FFT package stores samples though. If it's a complex FFT (which is much easier to analyze, and generally has nicer properties), the frequencies will either go from -22 to 22, or 0 to 44. In the first case, you want the middle 1/4th. In the latter, the outermost 1/4th.
You can do an adequate job by averaging sample values together. The naïve way of grabbing samples four by four and doing an equal weighted average works, but isn't too great. Instead you'll want to use a "kernel" function that averages them together in a non-intuitive way.
Mathwise, discarding everything outside the low-frequency band is multiplication by a box function in frequency space. The (inverse) Fourier transform turns pointwise multiplication into a convolution of the (inverse) Fourier transforms of the functions, and vice-versa. So, if we want to work in the time domain, we need to perform a convolution with the (inverse) Fourier transform of box function. This turns out to be proportional to the "sinc" function (sin at)/at, where a is the width of the box in the frequency space. So at every 4th location (since you're downsampling by a factor of 4) you can add up the points near it, multiplied by sin (a dt) / a dt, where dt is the distance in time to that location. How nearby? Well, that depends on how good you want it to sound. It's common to ignore everything outside the first zero, for instance, or just take the number of points to be the ratio by which you're downsampling.
Finally there's the piss-poor (but fast) way of just discarding the majority of the samples, keeping just the zeroth, the fourth, and so on.
Honestly, if it fits in memory, I'd recommend just going the DFT route. If it doesn't use one of the software filter packages that others have recommended to construct the filter for you.
The process you're after called "Decimation".
There are 2 steps:
Applying Low Pass Filter on the data (In your case LPF with Cut Off at Pi / 4).
Downsampling (In you case taking 1 out of 4 samples).
There are many methods to design and apply the Low Pass Filter.
You may start here:
http://en.wikipedia.org/wiki/Filter_design
You could make use of libsamplerate to do the heavy lifting. Libsamplerate is a C API, and takes care of calculating the filter coefficients. You to select from different quality filters so that you can trade off quality for speed.
If you would prefer not to write any code, you could just use Audacity to do the sample rate conversion. It offers a powerful GUI, and makes use of libsamplerate for it's sample rate conversion.
I would try applying DFT, chopping 3/4 of the result and applying inverse DFT. I can't tell if it will sound good without actually trying tough.
I recently came across BruteFIR which may already do some of what you're interested in?
You have to apply low-pass filter (removing frequencies above 5500 Hz) and then apply decimation (leave every Nth sample, every 4th in your case).
For decimation, FIR, not IIR filters are usually employed, because they don't depend on previous outputs and therefore you don't have to calculate anything for discarded samples. IIRs, generally, depends on both inputs and outputs, so, unless a specific type of IIR is used, you'd have to calculate every output sample before discarding 3/4 of them.
Just googled an intro-level article on the subject: https://www.dspguru.com/dsp/faqs/multirate/decimation

Resources