How do I interpret an incorrect result?

How do I interpret an incorrect result? - svm

I have been using libsvm. It produces some good results (95% on positives, 94% on negatives). When I examine the ones that it gets incorrect, however, I am confused about why it got them wrong. How do I determine what it is doing wrong? (More importantly, how do I explain it to my boss?). Some of the testing inputs it gets wrong are very close (visually) to some of the testing inputs it gets right.
About my problem: I am looking at images, 32x32 pixels, 8-bit greyscale. I am evaluating different feature detectors and using them as a dense representation (i.e. at every pixel) of the image. Hence, my feature length is often 1024; some of the feature detectors have multiple outputs, sometimes I do not use every pixel but every 3rd or 5th, etc.. It is a binary classification task, looking for figures in the image; for example, I am trying to find a square, with various letters for negatives. The SVM does well. But sometimes, it will classify a T as a square, and I don't know why. If I'm using probabilities, then sometimes the probability is quite high. What do I do to get an insight into what it is doing and why?

Related

The bounding box's position and size is incorrect, how to improve it's accuracy?

I'm using detectron2 for solving a segmentation task,
I'm trying to classify an object into 4 classes,
so I have used COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml.
I have applied 4 kind of augmentation transforms and after training I get about 0.1
total loss.
But for some reason the accuracy of the bbox is not great on some images on the test set,
the bbox is drawn either larger or smaller or doesn't cover the whole object.
Moreover sometimes the predictor draws few bboxes, it assumes there are few different objects although there is only a single object.
Are there any suggestions how to improve it's accuracy?
Are there any good practice approaches how to resolve this issue?
Any suggestion or reference material will be helpful.

I would suggest the following:
Ensure that your training set has the object you want to detect in all sizes: in this way, the network learns that the size of the object can be different and less prone to overfitting (detector could assume your object should be only big for example).
Add data. Rather than applying all types of augmentations, try adding much more data. The phenomenon of detecting different objects although there is only one object leads me to believe that your network does not generalize well. Personally I would opt for at least 500 annotations per class.
The biggest step towards improvement will be achieved by means of (2).
Once you have a decent baseline, you could also experiment with augmentations.

How to interpret Random Effects Plot from mgcv

I have a few questions regarding using a random effect in a GAM. First, how do you interpret and communicate the output graph?
I have fire modeled as a random effect in this GAM because it is largely a random occurrence at my different field sites and I only noted it as a binary. It wouldn't work as a normal variable since it has too few levels and there is also relatively few sites with fire. However, it greatly improved model variance capture when included so I don't want to simply exclude it. I don't know how to interpret the output and I am also not entirely confident that there wouldn't be another way to include it in the model other than as a random effect. Any help would be greatly appreciated!

The effect has been modelled as a random slope if you didn't code it as a factor in the data. The value on the y axis is the estimated slope; it will be a little smaller in absolute value than if you use Fire as a linear fixed effect in the model formula because it is being penalised (shrunk) towards zero.
This likely should have been fitted as a binary fixed effect; code Fire as a factor with two levels (Yes/No, or Burned / Unburned say). Just because a variable represents something that is random over the data doesn't mean it is a suitable random effect; fire here has some average effect and the fixed effect describes that well. There's nothing stopping you from using Fire coded as a factor as a random effect via the smooth, but with only two levels it's not going the two intercepts aren't going to be estimate that precisely.
Now, if you had repeated observations on n sites and you thought the Fire effect varied across the n sites then you could do s(Site, Fire, bs = 're') where both Site and Fire are factors and you'll get different Fire effects for each Site. Then the plot you show would have many points on it as it is a QQ-plot of the estimated values for the effect of Fire in each Site, hence 1 point per Site. Given the way this model is estimated, these are somewhat assumed to be distributed Gaussian with some variance that is inversely proportional to the smoothness parameter selected by gam() when fitting this random effect smoother. That's why the default plot is as it is; it's a QQ-plot comparing the observed distribution of estimate values of the random effects against the theoretical expectation.

How can I build a good approximation of an unknown distribution when only having samples from it in order to draw from it in torch?

Say I just have random samples from the Distribution and no other data - e.g. a list of numbers - [1,15,30,4,etc.]. What's the best way to estimate the distribution to draw more samples from it in pytorch?
I am currently assuming that all samples come from a Normal distribution and just using the mean and std of the samples to build it and draw from it. The function, however, can be of any distribution.
samples = torch.Tensor([1,2,3,4,3,2,2,1])
Normal(samples.mean(), samples.std()).sample()

If you have enough samples (and preferably sample dimension is higher than 1), you could model the distribution using Variational Autoencoder or Generative Adversarial Networks (though I would stick with the first approach as it's simpler).
Basically, after correct implementation and training you would get deterministic decoder able to decode hidden code you would pass it (say vector of size 10 taken from normal distribution) into a value from your target distribution.
Note it might not be reliable at all though, it would be even harder if your samples are 1D only.

The best way depends on what you want to achieve. If you don't know the underlying distribution, you need to make assumptions about it and then fit a suitable distribution (that you know how to sample) to your samples. You may start with something simple like a Mixture of Gaussians (several normal distributions with different weightings).
Another way is to define a discrete distribution over the values you have. You will give each value the same probability, say p(x)=1/N. When you sample from it, you simply draw a random integer from [0,N) that points to one of your samples.

Why a CNN learns different feature maps

I understand (and please correct me if my understanding is wrong) that the primary purpose of a CNN is to reduce the number of parameters from what you would need if you were to use a fully connected NN. And CNN achieves this by extracting "features" of images.
CNN can do this because in a natural image, there are small features such as lines and elementary curves that may occur in an "invariant" fashion, and constitute the image much like elementary building blocks.
My question is: when we create layers of feature maps, say, 5 of them, and we get these by using the sliding window of a size, say, 5x5 on an image that has pixels of, say, 100x100, Initially, these feature maps are initialized as random number weight matrices, and must progressively adjust the weights with gradient descent right? But then, if we are getting these feature maps by using the exactly same sized windows, sliding in exactly the same ways (sharing the same starting point and the same stride value), on the exactly same image, how can these maps learn different features of the image? Won't they all come out the same, say, a line or a curve?
Is it due to the different initial values of the weight matrices? (I.e. some weight matrices are more receptive to learning a certain particular feature than others?)
Thanks!! I wrote my 4 questions/opinions and indexed them, for the ease of addressing them separately!

8 bit audio samples to 16 bit

This is my "weekend" hobby problem.
I have some well-loved single-cycle waveforms from the ROMs of a classic synthesizer.
These are 8-bit samples (256 possible values).
Because they are only 8 bits, the noise floor is pretty high. This is due to quantization error. Quantization error is pretty weird. It messes up all frequencies a bit.
I'd like to take these cycles and make "clean" 16-bit versions of them. (Yes, I know people love the dirty versions, so I'll let the user interpolate between dirty and clean to whatever degree they like.)
It sounds impossible, right, because I've lost the low 8 bits forever, right? But this has been in the back of my head for a while, and I'm pretty sure I can do it.
Remember that these are single-cycle waveforms that just get repeated over and over for playback, so this is a special case. (Of course, the synth does all kinds of things to make the sound interesting, including envelopes, modulations, filters cross-fading, etc.)
For each individual byte sample, what I really know is that it's one of 256 values in the 16-bit version. (Imagine the reverse process, where the 16-bit value is truncated or rounded to 8 bits.)
My evaluation function is trying to get the minimum noise floor. I should be able to judge that with one or more FFTs.
Exhaustive testing would probably take forever, so I could take a lower-resolution first pass. Or do I just randomly push randomly chosen values around (within the known values that would keep the same 8-bit version) and do the evaluation and keep the cleaner version? Or is there something faster I can do? Am I in danger of falling into local minimums when there might be some better minimums elsewhere in the search space? I've had that happen in other similar situations.
Are there any initial guesses I can make, maybe by looking at neighboring values?
Edit: Several people have pointed out that the problem is easier if I remove the requirement that the new waveform would sample to the original. That's true. In fact, if I'm just looking for cleaner sounds, the solution is trivial.

You could put your existing 8-bit sample into the high-order byte of your new 16-bit sample, and then use the low order byte to linear interpolate some new 16 bit datapoints between each original 8-bit sample.
This would essentially connect a 16 bit straight line between each of your original 8-bit samples, using several new samples. It would sound much quieter than what you have now, which is a sudden, 8-bit jump between the two original samples.
You could also try apply some low-pass filtering.

Going with the approach in your question, I would suggest looking into hill-climbing algorithms and the like.
http://en.wikipedia.org/wiki/Hill_climbing
has more information on it and the sidebox has links to other algorithms which may be more suitable.
AI is like alchemy - we never reached the final goal, but lots of good stuff came out along the way.

Well, I would expect some FIR filtering (IIR if you really need processing cycles, but FIR can give better results without instability) to clean up the noise. You would have to play with it to get the effect you want but the basic problem is smoothing out the sharp edges in the audio created by sampling it at 8 bit resolutions. I would give a wide birth to the center frequency of the audio and do a low pass filter, and then listen to make sure I didn't make it sound "flat" with the filter I picked.
It's tough though, there is only so much you can do, the lower 8 bits is lost, the best you can do is approximate it.
It's almost impossible to get rid of noise that looks like your signal. If you start tweeking stuff in your frequency band it will take out the signal of interest.
For upsampling, since you're already using an FFT, you can add zeros to the end of the frequency domain signal and do an inverse FFT. This completely preserves the frequecy and phase information of the original signal, although it spreads the same energy over more samples. If you shift it 8bits to be a 16bit samples first, this won't be a too much of a problem. But I usually kick it up by an integer gain factor before doing the transform.
Pete
Edit:
The comments are getting a little long so I'll move some to the answer.
The peaks in the FFT output are harmonic spikes caused by the quantitization. I tend to think of them differently than the noise floor. You can dither as someone mentioned and eliminate the amplitude of the harmonic spikes and flatten out the noise floor, but you loose over all signal to noise on the flat part of your noise floor. As far as the FFT is concerned. When you interpolate using that method, it retains the same energy and spreads over more samples, this reduces the amplitude. So before doing the inverse, give your signal more energy by multipling by a gain factor.
Are the signals simple/complex sinusoids, or do they have hard edges? i.e. Triangle, square waves, etc. I'm assuming they have continuity from cycle to cycle, is that valid? If so you can also increase your FFT resolution to more precisely pinpoint frequencies by increasing the number of waveform cycles fed to your FFT. If you can precisely identify the frequencies use, assuming they are somewhat discrete, you may be able to completely recreate the intended signal.
The 16-bit to 8-bit via truncation requirement will produce results that do not match the original source. (Thus making finding an optimal answer more difficult.) Typically you would produce a fixed point waveform by attempting to "get the closest match" that means rounding to the nearest number (trunking is a floor operation). That is most likely how they were originally generated. Adding 0.5 (in this case 0.5 is 128) and then trunking the output would allow you to generate more accurate results. If that's not a worry then ok, but it definitely will have a negative effect on accuracy.
UPDATED:
Why? Because the goal of sampling a signal is to be able to as close a possible reproduce the signal. If conversion threshold is set poorly on the sampling all you're error is to one side of signal and not well distributed and centered about zero. On such systems you typically try to maximize the use the availiable dynamic range, particularly if you have low resolution such as an 8-bit ADC.
Band limited versions? If they are filtered at different frequencies, I'd suspect it was to allow you to play the same sound with out distortions when you went too far out from the other variation. Kinda like mipmapping in graphics.
I suspect the two are the same signal with different aliasing filters applied, this may be useful in reproducing the original. They should be the same base signal with different convolutions applied.

There might be a simple approach taking advantange of the periodicity of the waveforms. How about if you:
Make a 16-bit waveform where the high bytes are the waveform and the low bytes are zero - call it x[n].
Calculate the discrete Fourier transform of x[n] = X[w].
Make a signal Y[w] = (dBMag(X[w]) > Threshold) ? X[w] : 0, where dBMag(k) = 10*log10(real(k)^2 + imag(k)^2), and Threshold is maybe 40 dB, based on 8 bits being roughly 48 dB dynamic range, and allowing ~1.5 bits of noise.
Inverse transform Y[w] to get y[n], your new 16 bit waveform.
If y[n] doesn't sound nice, dither it with some very low level noise.
Notes:
A. This technique only works in the original waveforms are exactly periodic!
B. Step 5 might be replaced with setting the "0" values to random noise in Y[w] in step 3, you'd have to experiment a bit to see what works better.
This seems easier (to me at least) than an optimization approach. But truncated y[n] will probably not be equal to your original waveforms. I'm not sure how important that constraint is. I feel like this approach will generate waveforms that sound good.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string