Creating a "signature" based on data? - signature

Signature: "a distinctive pattern, product, or characteristic by which someone or something can be identified."
I'm creating a neural net that takes songs as inputs and outputs 500 values: similarities with the top artists. Like:
1. 25% like Muse
2. 23% like the Arctic Monkeys
3. 20% like Imagine Dragons, etc
500. 0% like Beethoven
I'm wanting to create some type of "signature" from this output, with which I could, hopefully, do some interesting things (programmatically).
Anyone have any ideas?
I should also say, I already have plans for this output. I want to do approximate nearest neighbors to recommend (submitted) songs based on this output. But I want to do more interesting things


Do convolutional networks to predict RxNorm codes from ICD9 codes make sense?

I have been asked by the company to specifically use a convolutional neural network to predict the type of medication (RxNorm code) prescribed based on the diagnoses given (ICD9 codes). I will be given a million prescriptions written by doctors. Each prescription is independent of the next one.
So an example would be: 110, 670, 890, BB2344
The first 3 items are ICD9 codes, the last one is the output, the RxNorm code. There are a million of these.
Honestly the task seems nonsensical to me. I do not have any idea regarding how to structure the inputs.
There is no inherent order to the diagnoses and no timestamps.
One diagnosis may make another diagnosis more likely; but there are plenty of examples where they are just independent.
The ICD9 coding system has hierarchical structure, such that a code of 110 and 120 (both infections) are both more closely related than say a code of 110 and 890. (an infection and a wound).
Basically, what should my input "image" look like? Or does a CNN not make sense at all for this problem?
CNN require spatial (or temporal) correlation in inputs. There is no such thing here, so the short answer is no, it makes no sense. In general, given how simplistic is the data, I would actually expect some basic linear model (on one hot encoded data) / or even basic rule inductions to work well.
The only possible use of "cnn-like" structures is to exploit the graph nature through graph-CNNs. Since the hierarchical structure in the input can be considered a "spatial" correlation.

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

What are the efficient and accurate algorithms to exclude outliers from a set of data?

I have set of 200 data rows(implies a small set of data). I want to carry out some statistical analysis, but before that I want to exclude outliers.
What are the potential algos for the purpose? Accuracy is a matter of concern.
I am very new to Stats, so need help in very basic algos.
Overall, the thing that makes a question like this hard is that there is no rigorous definition of an outlier. I would actually recommend against using a certain number of standard deviations as the cutoff for the following reasons:
A few outliers can have a huge impact on your estimate of standard deviation, as standard deviation is not a robust statistic.
The interpretation of standard deviation depends hugely on the distribution of your data. If your data is normally distributed then 3 standard deviations is a lot, but if it's, for example, log-normally distributed, then 3 standard deviations is not a lot.
There are a few good ways to proceed:
Keep all the data, and just use robust statistics (median instead of mean, Wilcoxon test instead of T-test, etc.). Probably good if your dataset is large.
Trim or Winsorize your data. Trimming means removing the top and bottom x%. Winsorizing means setting the top and bottom x% to the xth and 1-xth percentile value respectively.
If you have a small dataset, you could just plot your data and examine it manually for implausible values.
If your data looks reasonably close to normally distributed (no heavy tails and roughly symmetric), then use the median absolute deviation instead of the standard deviation as your test statistic and filter to 3 or 4 median absolute deviations away from the median.
Start by plotting the leverage of the outliers and then go for some good ol' interocular trauma (aka look at the scatterplot).
Lots of statistical packages have outlier/residual diagnostics, but I prefer Cook's D. You can calculate it by hand if you'd like using this formula from (original link is dead, this is sourced from
You may have heard the expression 'six sigma'.
This refers to plus and minus 3 sigma (ie, standard deviations) around the mean.
Anything outside the 'six sigma' range could be treated as an outlier.
On reflection, I think 'six sigma' is too wide.
This article describes how it amounts to "3.4 defective parts per million opportunities."
It seems like a pretty stringent requirement for certification purposes. Only you can decide if it suits you.
Depending on your data and its meaning, you might want to look into RANSAC (random sample consensus). This is widely used in computer vision, and generally gives excellent results when trying to fit data with lots of outliers to a model.
And it's very simple to conceptualize and explain. On the other hand, it's non deterministic, which may cause problems depending on the application.
Compute the standard deviation on the set, and exclude everything outside of the first, second or third standard deviation.
Here is how I would go about it in SQL Server
The query below will get the average weight from a fictional Scale table holding a single weigh-in for each person while not permitting those who are overly fat or thin to throw off the more realistic average:
select w.Gender, Avg(w.Weight) as AvgWeight
from ScaleData w
join ( select d.Gender, Avg(d.Weight) as AvgWeight,
2*STDDEVP(d.Weight) StdDeviation
from ScaleData d
group by d.Gender
) d
on w.Gender = d.Gender
and w.Weight between d.AvgWeight-d.StdDeviation
and d.AvgWeight+d.StdDeviation
group by w.Gender
There may be a better way to go about this, but it works and works well. If you have come across another more efficient solution, I’d love to hear about it.
NOTE: the above removes the top and bottom 5% of outliers out of the picture for purpose of the Average. You can adjust the number of outliers removed by adjusting the 2* in the 2*STDDEVP as per:
If you want to just analyse it, say you want to compute the correlation with another variable, its ok to exclude outliers. But if you want to model / predict, it is not always best to exclude them straightaway.
Try to treat it with methods such as capping or if you suspect the outliers contain information/pattern, then replace it with missing, and model/predict it. I have written some examples of how you can go about this here using R.

Pitch recognition of musical notes on a smart phone, pt. 2

As a follow-up to my previous question, if I want my smartphone application to detect a certain musical note, and I only need to know whether the incoming sound is that musical note or not, with a certain amount of fuzziness, to allow the note to be off-key by x cents.
Given that, is there a superior method over others for speed and accuracy? That is, by knowing that the note you are looking for is, say, a #C3, how best to tell if that note is present or not? I'm assuming that looking for a single note would be easier than separating out all waveforms, and then looking at the results for the fundamental frequency.
In the responses to my original question, one respondent suggested that autocorrelation might work well if you know that the notes are within a certain range. I wonder if autocorrelation would then work even better, if you only have to check for the presence or absence of a certain note (+/- x cents).
Those methods being:
Kiss FFT
Discrete Wavelet Transform
zero crossing analysis
octave-spaced filters
Any thoughts would be appreciated.
As you describe it, you just need to determine if a particular pitch is present. A very simple (fast) detector would just record the equivalent of one period of the waveform, then record another period and correlate them, like an oversimplified (single-lag) autocorrelation. If there's a high match, you know the waveform being recorded is repeating at around the same period, or a harmonic of it.
For instance, to detect 1 kHz, record 1 ms of audio (48 samples at 48 kHz), then record another 1 ms, and compare them (correlate = multiply all samples and sum). If they line up (correlation above some threshold), then you're listening to 1 kHz, 2 kHz, 3 kHz, or some other multiple. Doing several periods would give you more confidence on the match.
A true autocorrelation would tell you which harmonic, specifically, if that's important to you.

Downsampling and applying a lowpass filter to digital audio

I've got a 44Khz audio stream from a CD, represented as an array of 16 bit PCM samples. I'd like to cut it down to an 11KHz stream. How do I do that? From my days of engineering class many years ago, I know that the stream won't be able to describe anything over 5500Hz accurately anymore, so I assume I want to cut everything above that out too. Any ideas? Thanks.
Update: There is some code on this page that converts from 48KHz to 8KHz using a simple algorithm and a coefficient array that looks like { 1, 4, 12, 12, 4, 1 }. I think that is what I need, but I need it for a factor of 4x rather than 6x. Any idea how those constants are calculated? Also, I end up converting the 16 byte samples to floats anyway, so I can do the downsampling with floats rather than shorts, if that helps the quality at all.
Read on FIR and IIR filters. These are the filters that use a coefficent array.
If you do a google search on "FIR or IIR filter designer" you will find lots of software and online-applets that does the hard job (getting the coefficients) for you.
This page here ( ) lets you enter the parameters of your filter and will spit out ready to use C-Code...
You're right in that you need apply lowpass filtering on your signal. Any signal over 5500 Hz will be present in your downsampled signal but 'aliased' as another frequency so you'll have to remove those before downsampling.
It's a good idea to do the filtering with floats. There are fixed point filter algorithms too but those generally have quality tradeoffs to work. If you've got floats then use them!
Using DFT's for filtering is generally overkill and it makes things more complicated because dft's are not a contiuous process but work on buffers.
Digital filters generally come in two tastes. FIR and IIR. The're generally the same idea but IIF filters use feedback loops to achieve a steeper response with far less coefficients. This might be a good idea for downsampling because you need a very steep filter slope there.
Downsampling is sort of a special case. Because you're going to throw away 3 out of 4 samples there's no need to calculate them. There is a special class of filters for this called polyphase filters.
Try googling for polyphase IIR or polyphase FIR for more information.
Notice (in additions to the other comments) that the simple-easy-intuitive approach "downsample by a factor of 4 by replacing each group of 4 consecutive samples by the average value", is not optimal but is nevertheless not wrong, nor practically nor conceptually. Because the averaging amounts precisely to a low pass filter (a rectangular window, which corresponds to a sinc in frequency). What would be conceptually wrong is to just downsample by taking one of each 4 samples: that would definitely introduce aliasing.
By the way: practically any software that does some resampling (audio, image or whatever; example for the audio case: sox) takes this into account, and frequently lets you choose the underlying low-pass filter.
You need to apply a lowpass filter before you downsample the signal to avoid "aliasing". The cutoff frequency of the lowpass filter should be less than the nyquist frequency, which is half the sample frequency.
The "best" solution possible is indeed a DFT, discarding the top 3/4 of the frequencies, and performing an inverse DFT, with the domain restricted to the bottom 1/4th. Discarding the top 3/4ths is a low-pass filter in this case. Padding to a power of 2 number of samples will probably give you a speed benefit. Be aware of how your FFT package stores samples though. If it's a complex FFT (which is much easier to analyze, and generally has nicer properties), the frequencies will either go from -22 to 22, or 0 to 44. In the first case, you want the middle 1/4th. In the latter, the outermost 1/4th.
You can do an adequate job by averaging sample values together. The naïve way of grabbing samples four by four and doing an equal weighted average works, but isn't too great. Instead you'll want to use a "kernel" function that averages them together in a non-intuitive way.
Mathwise, discarding everything outside the low-frequency band is multiplication by a box function in frequency space. The (inverse) Fourier transform turns pointwise multiplication into a convolution of the (inverse) Fourier transforms of the functions, and vice-versa. So, if we want to work in the time domain, we need to perform a convolution with the (inverse) Fourier transform of box function. This turns out to be proportional to the "sinc" function (sin at)/at, where a is the width of the box in the frequency space. So at every 4th location (since you're downsampling by a factor of 4) you can add up the points near it, multiplied by sin (a dt) / a dt, where dt is the distance in time to that location. How nearby? Well, that depends on how good you want it to sound. It's common to ignore everything outside the first zero, for instance, or just take the number of points to be the ratio by which you're downsampling.
Finally there's the piss-poor (but fast) way of just discarding the majority of the samples, keeping just the zeroth, the fourth, and so on.
Honestly, if it fits in memory, I'd recommend just going the DFT route. If it doesn't use one of the software filter packages that others have recommended to construct the filter for you.
The process you're after called "Decimation".
There are 2 steps:
Applying Low Pass Filter on the data (In your case LPF with Cut Off at Pi / 4).
Downsampling (In you case taking 1 out of 4 samples).
There are many methods to design and apply the Low Pass Filter.
You may start here:
You could make use of libsamplerate to do the heavy lifting. Libsamplerate is a C API, and takes care of calculating the filter coefficients. You to select from different quality filters so that you can trade off quality for speed.
If you would prefer not to write any code, you could just use Audacity to do the sample rate conversion. It offers a powerful GUI, and makes use of libsamplerate for it's sample rate conversion.
I would try applying DFT, chopping 3/4 of the result and applying inverse DFT. I can't tell if it will sound good without actually trying tough.
I recently came across BruteFIR which may already do some of what you're interested in?
You have to apply low-pass filter (removing frequencies above 5500 Hz) and then apply decimation (leave every Nth sample, every 4th in your case).
For decimation, FIR, not IIR filters are usually employed, because they don't depend on previous outputs and therefore you don't have to calculate anything for discarded samples. IIRs, generally, depends on both inputs and outputs, so, unless a specific type of IIR is used, you'd have to calculate every output sample before discarding 3/4 of them.
Just googled an intro-level article on the subject:
