Divide Data Into Quartiles - statistics

I have a dataset that contains admissions rates of all providers that we work with. I need to divide that data into quartiles, so that each provider can see where their rate lies in comparison to other providers. The rate ranges from 7% to 89%. can anyone suggest me how to do this? I am not sure if this is the right place to ask this question but if somebody can help me with this, I would really appreciate that.
The other concern is that if a provider's numbers is really small eg: 2/4 = 50%, the provider might fall into worse quartile but it doesn't mean that the provider's performance is bad because the numbers are so small. I hope this is making sense. Please let me know if I can clarify it further.

There are ways to obtain quantiles without doing a complete sort but unless you've got huge amounts of data there is no point in implementing those algorithms if you haven't already got them available. Presuming you have a sort() function available, all you need to do is:
Given n data points.
Sort the data points.
Find the n/4, n/2 and 3*n/4th points in the sorted data, which are your quartiles.
As you say, if n is less than some number (that you'll have to decide for yourself) you may want to say that the quartile result is "not applicable" or some such.

First concern: For small n, do not use quartiles. Whether n is small is arbitrary.

Related

Is Variance/Mean Ratio a good indicator for stability?

I am dealing with some time series data and I would like to compare 4 companies. I censored them due to confidentiality.
My assumption is like this:
VMR, index of dispersion, can be used to find most stable/stationary data
Is this a correct approach, if not, do you have any suggestions for a more correct approach?
Graphics

ANOVA test on time series data

In below post of Analytics Vidya, ANOVA test has been performed on COVID data, to check whether the difference in posotive cases of denser region is statistically significant.
I believe ANOVA test can’t be performed on this COVID time series data, atleast not in way as it has been done in this post.
Sample data has been consider randomly from different groups(denser1, denser2…denser4). The data is time series so it is more likely that number of positive cases in random sample of groups will be from different point of time.
There might be the case denser1 has random data from early covid time and another region has random data from another point of time. If this is the case, then F-Statistics will high certainly.
Can anyone explain if you have other opinions?
https://www.analyticsvidhya.com/blog/2020/06/introduction-anova-statistics-data-science-covid-python/
ANOVA should not be applied to time-series data, as the independence assumption is violated. The issue with independence is that days tend to correlate very highly. For example, if you know that today you have 1400 positive cases, you would expect tomorrow to have a similar number of positive cases, regardless of any underlying trends.
It sounds like you're trying to determine causality of different treatments (ie mask mandates or other restrictions etc) and their effects on positive cases. The best way to infer causality is usually to perform A-B testing, but obviously in this case it would not be reasonable to give different populations different treatments. One method that is good for going back and retro-actively inferring causality is called "synthetic control".
https://economics.mit.edu/files/17847
Above is linked a basic paper on the methodology. The hard part of this analysis will be in constructing synthetic counterfactuals or "controls" to test your actual population against.
If this is not what you're looking for, please reply with a clarifying question, but I think this should be an appropriate method that is well-suited to studying time-series data.

How to compare different groups with different sample size?

I am plotting students' data from different schools to see the difference between male and female student numbers at some majors. I am using python, I already plot the data for some schools and as I expected male numbers are genuinely higher, then I realized that for each school I have a different number of total students. does my work make any sense when the sample size is different? if not may I have some suggestion to make some changes.
Now I'm realizing. Look: you have two classes where the first has 2 men, the second one - 20 men. And their marks. 2 men - both are 90/100. And 20 marks in the second one. Let it be a range from 40 to 80. Will it be correct if we say "Well, the first class made the test much better then the second"? Ofc, not.
To solve this problem just take a min(sizes of samples). If it looks too small, so throw away this programm, because you have not enough data to say something. And put a total size of sample via proxy legend or text, or add it in title. Anyway it will show you reliability of your results.
This question is not about programming, but rather about statistics, but I will try to answer.
Important question I didn't get there: What are you doing it for? If you ask question like "Hmm... Are there more men than women in the population(in this case, population = all persons in major programm)?". So each schools aren't important for you,and you can work with samples as you work with one (but don't forger gather them).
But you may ask question: "are there any difference between schools in samples?". In this case, gathering is not correct. For this purpose I highly recommend barh plot with stucked=True for each school. And for normalization just use percents. And difference between samples' size won't be problem.
PLS, If you ask question, put some code. 3 rows and one plot from a sample would be very helpful...

Need help simplifying or improving a weighted distribution formula in excel (math/excel/programming noob)

I've been having some fun creating a rather extensive inventory in Google Sheets for my collection of trading cards. I buy the majority of my collectibles in lots meaning that I pay a total of X dollars for Y number of cards of different values (as opposed to buying each card individually).
In my spreadsheet I have a "Purchase Price" column where I enter in the price I paid for each card. If I buy 1 lot of 10 cards, to find the value of each of those cards you would just divide the cost of the lot by the number of cards in the lot. So if I purchased 1 lot of 10 cards for a total of $100, the Purchase Price of each card would equal $10. Simple enough right?
Well, that would be if you were OK with entering the rare, uncommon, and common cards in the lot with having the same exact purchase price even though their real market values would all be different. So, what I did was create a formula that automatically adjusts the purchase price for each card that's part of a lot based on its rarity so it's at least closer in accuracy to the actual market value of the card.
Here is the formula:
=IFS(B2="C",D2*$B$15*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="U",D2*$B$16*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="R",D2*$B$17*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2)
Not sure if that means much to anyone, so here's a link to an example spreadsheet of the formula in action below.
And if you don't care to check that out, here's a screenshot:
The problem:
So the formula works exactly how I want it to work EXCEPT when there are 0 commons in a lot. When that happens I get a #DIV/0! error saying that "Function DIVIDE parameter 2 cannot be zero." I understand why this is happening since it doesn't like to divide by 0 in the first line, but what I don't understand is how to fix it.
How can I fix the DIV error, or is there a better way to do this, perhaps an alternative formula or approach? I am not a programmer and somewhat of a beginner at Excel.
Two suggestions.
Embed each division in an IFERROR() function as shown below. This function will return zero instead of an error. You can substitute another calculation for that. In fact, depending upon which level you introduce the IFERROR at (embracing just one of the three calculations or all three) you might choose to embed the IFS in another IFS that tests for zeroes. Once you have no more divisions by zero there would be no more need for IFERROR. So, it becomes a question of formula efficiency.
=IFERROR(D2*$B$15*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,0)
Forget about all of this and seek a commercially logical solution. The logic says that you never buy a lot unless it contains some items you want, and the seller never has a lot that doesn't contain rubbish. In the end you get inundated with commons, meaning you have more of them than you can ever hope to sell. So, what's their real, commercial value? Valuate your rare and uncommon cards individually and all the baggage not at all. You will find the outcome more realistic both for the Commons and the Rare. BTW, that's what they do with stamp or coin collections.
I realise this is more a comment than an answer, but it's too large to put it as a comment:
Your formula is unreadable, as you can see:
=IFS(B2="C",D2*$B$15*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="U",D2*$B$16*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="R",D2*$B$17*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2)
First I'd advise you to create a new column (you might always hide it), I:I, which contains following formula (for I2):
=B2*G2/((D2*const_weigth_common)+(E2*const_weigth_uncommon)+(F2*const_weigth_rare))/D2
(And you give this a meaningful header name)
As far as names for $B$15:$B$17 are concerned, do something like:
$B$15 : const_weigth_common
$B$16 : const_weigth_uncommon
$B$17 : const_weigth_rare
(You do know how to use names in Excel?)
Like this, your formula becomes:
=IFS(B2="C",I2 * const_weigth_common,
B2="U",I2 * const_weigth_uncommon,
B2="R",I2 * const_weigth_rare);
As far as your error is concerned: as mentioned in another answer, this might be tackled using the =IF() formula, so I2 becomes:
=IF(D2<>0;...;_ERROR_VALUE); // up to you how to change your error value
Like this, your formulas become much clearer and it will be easier to solve possible problems.
I Ain't No Math-A-Magician
But I can help you with this...
The way I see it, there are three schools of thought and you need to figure out which one is yours:
The programmer - trapping #div/0 errors all day
The purist - there is only one answer, and it is undefined
The pragmatic - a graph shows results approaching a limit
I think the programmer is either dangerous or ineffectual, or dangerously ineffectual. Sure he can trap there error but what exactly does that do? I'll tell you exactly what that does, it puts lipstick on a pig. It's literally replacing one string, "#div/0!", with a different more aesthetic string. Or he can play second fiddle to the devil and publicly killing the one bug everyone knew how to defend at defend sgainst and creating another?
I think the purist is right about one thing, the answer is what it is and it can't be a anything else; but he's also wrong. A precisely known theoretical answer may very well be undefined, but where in the real world has anyone ever seen division by zero? It's a mathematical construct like infinity, we can safely ignore it. Don't believe me? What happens when you dvide the sun by zero? Go ahead, I'll wait for your answer.
I think these things because I am grounded in pragmatism One string is not better than the other if they both symbolize an attempt to divide by zero. I prefer knowing who my enemies are so I my keep them in front of it me. Nature truly abhorss the undefined and is only slightly displeased with a vacuum.

Cannot generalize my Genetic Algorithm to new Data

I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the training phase.
However, I still thought I could take a few precautions and and get some kind of prediction on a set of unseen test stocks from the same period.
One precaution I took was:
When multiple stocks can be bought on the same day the GA only buys one from the list and it chooses this one randomly. I thought this randomness might help to avoid over-fitting?
Even if over-fitting is still occurring,shouldn't it be absent in the initial generations of the GA since it hasn't had a chance to over-fit yet?
As a note, I am aware of the no-free-lunch theorem which demonstrates ( I believe) that there is no perfect set of parameters which will produce an optimal output for two different datasets. If we take this further, does this no-free-lunch theorem also prohibit generalization?
The graph below illustrates this.
->The blue line is the GA output.
->The red line is the training data (slightly different because of the aforementioned randomness)
-> The yellow line is the stubborn test data which shows no generalization. In fact this is the most flattering graph I could produce..
The y-axis is profit, the x axis is the trading strategies sorted from worst to best ( left to right) according to there respective profits (on the y axis)
Some of the best advice I've received so far (thanks seaotternerd) is to focus on the earlier generations and increase the number of training examples. The graph below has 12 training stocks rather than just 4, and shows only the first 200 generations (instead of 1,000). Again, it's the most flattering chart I could produce, this time with medium selection pressure. It certainly looks a little bit better, but not fantastic either. The red line is the test data.
The problem with over-fitting is that, within a single data-set it's pretty challenging to tell over-fitting apart from actually getting better in the general case. In many ways, this is more of an art than a science, but here are some general guidelines:
A GA will learn to do exactly what you attach fitness to. If you tell it to get really good at predicting one series of stocks, it will do that. If you keep swapping in different stocks to predict, though, you might be more successful at getting it to generalize. There are a few ways to do this. The one that has had perhaps the most promising results for reducing over-fitting is imposing spatial structure on the population and evaluating on different test cases in different cells, as in the SCALP algorithm. You could also switch out the test cases on a time basis, but I've had more mixed results with that sort of an approach.
You are correct that over-fitting should be less of a problem early on. Generally, the longer you run a GA, the more over-fitting will be possible. Typically, people tend to assume that the general rules will be learned first, before the rote memorization of over-fitting takes place. However, I don't think I've actually ever seen this studied rigorously - I could imagine a scenario where over-fitting was so much easier than finding general rules that it happens first. I have no idea how common that is, though. Stopping early will also reduce the ability of the GA to find better general solutions.
Using a larger data-set (four stocks isn't that many) will make your GA less susceptible to over-fitting.
Randomness is an interesting idea. It will definitely hurt the GA's ability to find general rules, but it should also reduce over-fitting. Without knowing more about the specifics of your algorithm, it's hard to say which would win out.
That's a really interesting thought about the no free lunch theorem. I'm not 100% sure, but I think it does apply here to some extent - better fitting some data will make your results fit other data worse, by necessity. However, as wide as the range of possible stock behaviors is, it is much narrower than the range of all possible time series in general. This is why it is possible to have optimization algorithms at all - a given problem that we are working with tends produce data that cluster relatively closely together, relative to the entire space of possible data. So, within that set of inputs that we actually care about, it is possible to get better. There is generally an upper limit of some sort on how well you can do, and it is possible that you have hit that upper limit for your data-set. But generalization is possible to some extent, so I wouldn't give up just yet.
Bottom line: I think that varying the test cases shows the most promise (although I'm biased, because that's one of my primary areas of research), but it is also the most challenging solution, implementation-wise. So as a simpler fix you can try stopping evolution sooner or increasing your data-set.

Resources