Is Variance/Mean Ratio a good indicator for stability? - statistics

I am dealing with some time series data and I would like to compare 4 companies. I censored them due to confidentiality.
My assumption is like this:
VMR, index of dispersion, can be used to find most stable/stationary data
Is this a correct approach, if not, do you have any suggestions for a more correct approach?
Graphics

Related

How to compare different groups with different sample size?

I am plotting students' data from different schools to see the difference between male and female student numbers at some majors. I am using python, I already plot the data for some schools and as I expected male numbers are genuinely higher, then I realized that for each school I have a different number of total students. does my work make any sense when the sample size is different? if not may I have some suggestion to make some changes.
Now I'm realizing. Look: you have two classes where the first has 2 men, the second one - 20 men. And their marks. 2 men - both are 90/100. And 20 marks in the second one. Let it be a range from 40 to 80. Will it be correct if we say "Well, the first class made the test much better then the second"? Ofc, not.
To solve this problem just take a min(sizes of samples). If it looks too small, so throw away this programm, because you have not enough data to say something. And put a total size of sample via proxy legend or text, or add it in title. Anyway it will show you reliability of your results.
This question is not about programming, but rather about statistics, but I will try to answer.
Important question I didn't get there: What are you doing it for? If you ask question like "Hmm... Are there more men than women in the population(in this case, population = all persons in major programm)?". So each schools aren't important for you,and you can work with samples as you work with one (but don't forger gather them).
But you may ask question: "are there any difference between schools in samples?". In this case, gathering is not correct. For this purpose I highly recommend barh plot with stucked=True for each school. And for normalization just use percents. And difference between samples' size won't be problem.
PLS, If you ask question, put some code. 3 rows and one plot from a sample would be very helpful...

How to handle various units in a single attribute / feature using Pandas?

I have a dataset, on which i am working on Data Cleaning part, where one of the attribute or feature is having the values with various units. for example some of the values are as follow.
1 kg; 6 LB; 900 gms; 32 oz; etc.
If i use the standard scaler then it will not be fair as the values and their units are different, so cannot treat them as is.
Please do suggest how to handle such data.
I will recommend to change the different value to same unit first of all. For example, you can make all the value to kg or whatever suits best for you, and then perform the standard scale.
Thanks All, I did some research and found that i need to convert the various units into standard units and which follow internation norms referred to SI Units https://www.nist.gov/pml/weights-and-measures/metric-si/si-units , and same suggestion has given by #sharmajee499.
Moving ahead with this approach.. though this is going to be a lot of manual code, but seems there is no direct short and easy way.
Please do post if have any better solution.

Efficient sampling of discrete random variable [duplicate]

I have a list of US names and their respective names from the US census website. I would like to generate a random name from this list using the given probability. The data is here: US Census data
I have seen algorithms like the roulette wheel selection algorithm that are easy to implement, but I wanted to know if there was any way to generate random names in O(1). For histogram data this is easier, as you could create a hash of integers to birthdays, but I would like to do this for a continuous distribution.
If this is not possible, are there any python modules that take in probability distributions and generate random values based on those distributions?
There is an O(1)-time method See this detailed description of Vose's "alias" method. Unfortunately, it suffers from high initialization cost. For comparative timings of simpler methods, see Eli Bendersky's blog post. More timings can be found in this from the Python issue tracker.
These days it's practical to enumerate the entire US population (~317 million) if you really need O(1) lookup. Just pick a number up to 317 million and get the name from there. (317000000*4 bytes = 1.268GB)
I think there are lots of O(log n) ways. Is there a particular reason you need O(1) (They will use a lot less memory)

Divide Data Into Quartiles

I have a dataset that contains admissions rates of all providers that we work with. I need to divide that data into quartiles, so that each provider can see where their rate lies in comparison to other providers. The rate ranges from 7% to 89%. can anyone suggest me how to do this? I am not sure if this is the right place to ask this question but if somebody can help me with this, I would really appreciate that.
The other concern is that if a provider's numbers is really small eg: 2/4 = 50%, the provider might fall into worse quartile but it doesn't mean that the provider's performance is bad because the numbers are so small. I hope this is making sense. Please let me know if I can clarify it further.
There are ways to obtain quantiles without doing a complete sort but unless you've got huge amounts of data there is no point in implementing those algorithms if you haven't already got them available. Presuming you have a sort() function available, all you need to do is:
Given n data points.
Sort the data points.
Find the n/4, n/2 and 3*n/4th points in the sorted data, which are your quartiles.
As you say, if n is less than some number (that you'll have to decide for yourself) you may want to say that the quartile result is "not applicable" or some such.
First concern: For small n, do not use quartiles. Whether n is small is arbitrary.

Comparing audio recordings

I have 5 recorded wav files. I want to compare the new incoming recordings with these files and determine which one it resembles most.
In the final product I need to implement it in C++ on Linux, but now I am experimenting in Matlab. I can see FFT plots very easily. But I don't know how to compare them.
How can I compute the similarity of two FFT plots?
Edit: There is only speech in the recordings. Actually, I am trying to identify the response of answering machines of a few telecom companies. It's enough to distinguish two messages "this person can not be reached at the moment" and "this number is not used anymore"
This depends a lot on your definition of "resembles most". Depending on your use case this can be a lot of things. If you just want to compare the bare spectra of the whole file you can just correlate the values returned by the two ffts.
However spectra tend to change a lot when the files get warped in time. To figure out the difference with this, you need to do a windowed fft and compare the spectra for each window. This then defines your difference function you can use in a Dynamic time warping algorithm.
If you need perceptual resemblance an FFT probably does not get you what you need. An MFCC of the recordings is most likely much closer to this problem. Again, you might need to calculate windowed MFCCs instead of MFCCs of the whole recording.
If you have musical recordings again you need completely different aproaches. There is a blog posting that describes how Shazam works, so you might be able to find this on google. Or if you want real musical similarity have a look at this book
EDIT:
The best solution for the problem specified above would be the one described here ("shazam algorithm" as mentioned above).This is however a bit complicated to implement and easier solution might do well enough.
If you know that there are only 5 different different possible incoming files, I would suggest trying first something as easy as doing the euclidian distance between the two signals (in temporal or fourier). It is likely to give you good result.
Edit : So with different possible starts, try doing an autocorrelation and see which file has the higher peak.
I suggest you compute simple sound parameter like fundamental frequency. There are several methods of getting this value - I tried autocorrelation and cepstrum and for voice signals they worked fine. With such function working you can make time-analysis and compare two signals (base - to which you compare, in - which you would like to match) on given interval frequency. Comparing several intervals based on such criteria can tell you which base sample matches the best.
Of course everything depends on what you mean resembles most. To compare function you can introduce other parameters like volume, noise, clicks, pitches...

Resources