How to compare different groups with different sample size?

How to compare different groups with different sample size? - statistics

I am plotting students' data from different schools to see the difference between male and female student numbers at some majors. I am using python, I already plot the data for some schools and as I expected male numbers are genuinely higher, then I realized that for each school I have a different number of total students. does my work make any sense when the sample size is different? if not may I have some suggestion to make some changes.

Now I'm realizing. Look: you have two classes where the first has 2 men, the second one - 20 men. And their marks. 2 men - both are 90/100. And 20 marks in the second one. Let it be a range from 40 to 80. Will it be correct if we say "Well, the first class made the test much better then the second"? Ofc, not.
To solve this problem just take a min(sizes of samples). If it looks too small, so throw away this programm, because you have not enough data to say something. And put a total size of sample via proxy legend or text, or add it in title. Anyway it will show you reliability of your results.

This question is not about programming, but rather about statistics, but I will try to answer.
Important question I didn't get there: What are you doing it for? If you ask question like "Hmm... Are there more men than women in the population(in this case, population = all persons in major programm)?". So each schools aren't important for you,and you can work with samples as you work with one (but don't forger gather them).
But you may ask question: "are there any difference between schools in samples?". In this case, gathering is not correct. For this purpose I highly recommend barh plot with stucked=True for each school. And for normalization just use percents. And difference between samples' size won't be problem.
PLS, If you ask question, put some code. 3 rows and one plot from a sample would be very helpful...

Related

Framework for minimizing time complexity of generalized search

I have training in pure math but not in statistics, computer science, and information theory so I am a bit lost here and would really appreciate any guidance.
I am looking for some helpful ways to frame a general search approach which would minimize the time complexity of the search.
For example, let's say I was playing a modified version of 20-questions with a friend. The friend has thought of a human, presently alive in the US, and I can ask upto 20 questions to uncover the truth. I want to ask as few questions as possible on average to win the game. We will play this game repeatedly and I want to develop a strategy that would minimize my average win time (as measured by the number of questions asked).
Sample Space: 329.5 million humans currently alive in the US
Rule: Ask any question. The question can have yes or no answer or even a descriptive answer. So for instance, it is allowed to ask the first name of the person.
Intuitively, it seems to me that immediately (as a first quesiton) asking a question like "Is it Barack Obama?" is a terrible question because it splits the sample space (or search space) into two sets, one with 1 person, namely the former US President, and the second containing rest of the US population.
Asking, what is their sex (or old school gender) may be a better question as it will split the yes and no answers into sets of roughly equal sizes.
Instead of asking a binary question, asking an n-ary question is likely better because it will split the sample space into n sub-spaces of varying sizes and if the sizes are similar then that's fantastic. For instance, the question could be, what is the first letter of their last name? There are 26 possible answers, although we know that people in the US are much more likely to have their last name begin with "J" rather than "X".
Of course, I can conceivably ask a 329.5 million-ary question whereby I'll have the answer in one-shot.
My questions for you guys are as follows:
If we fix "n", so asking only binary or ternary or fixed-n-ary questions, it seems to me that the efficient approach would be to ask questions which would divide the sample space into "n" roughly equal parts, if I am minimizing time complexity. How can I prove this? What is the right approach or mathematical fraemwork to prove this? Assuming that I am only minimizing time complexity or the average number of questions I need to ask to get to the solution.
If we don't fix "n" then what would be a general way to frame this mathematically? Now I have two variables over which I am operating, "n" and "the relative size of subsets the answer to a n-ary question splits the sample space", to minimize the time complexity. How can I frame this problem mathematically?
Is my intuition even correct? Or are there faster ways to approach this?
What I am describing sounds an awful lot like a Classificaiton Decision Tree in Machine Learning. Is minimzing Entorpy the right way to frame my question?
Who would know or think about this type of stuff ? Information theorists? Computer Scientists? Statisticians? Probability Theorists? Machine Learning folks? Someone else?
What's the right forum on the internet to get help on this question? Reddit? Some specific stackexchange? Anything else?
Thx

Need help simplifying or improving a weighted distribution formula in excel (math/excel/programming noob)

I've been having some fun creating a rather extensive inventory in Google Sheets for my collection of trading cards. I buy the majority of my collectibles in lots meaning that I pay a total of X dollars for Y number of cards of different values (as opposed to buying each card individually).
In my spreadsheet I have a "Purchase Price" column where I enter in the price I paid for each card. If I buy 1 lot of 10 cards, to find the value of each of those cards you would just divide the cost of the lot by the number of cards in the lot. So if I purchased 1 lot of 10 cards for a total of $100, the Purchase Price of each card would equal $10. Simple enough right?
Well, that would be if you were OK with entering the rare, uncommon, and common cards in the lot with having the same exact purchase price even though their real market values would all be different. So, what I did was create a formula that automatically adjusts the purchase price for each card that's part of a lot based on its rarity so it's at least closer in accuracy to the actual market value of the card.
Here is the formula:
=IFS(B2="C",D2*$B$15*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="U",D2*$B$16*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="R",D2*$B$17*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2)
Not sure if that means much to anyone, so here's a link to an example spreadsheet of the formula in action below.
And if you don't care to check that out, here's a screenshot:
The problem:
So the formula works exactly how I want it to work EXCEPT when there are 0 commons in a lot. When that happens I get a #DIV/0! error saying that "Function DIVIDE parameter 2 cannot be zero." I understand why this is happening since it doesn't like to divide by 0 in the first line, but what I don't understand is how to fix it.
How can I fix the DIV error, or is there a better way to do this, perhaps an alternative formula or approach? I am not a programmer and somewhat of a beginner at Excel.

Two suggestions.
Embed each division in an IFERROR() function as shown below. This function will return zero instead of an error. You can substitute another calculation for that. In fact, depending upon which level you introduce the IFERROR at (embracing just one of the three calculations or all three) you might choose to embed the IFS in another IFS that tests for zeroes. Once you have no more divisions by zero there would be no more need for IFERROR. So, it becomes a question of formula efficiency.
=IFERROR(D2*$B$15*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,0)
Forget about all of this and seek a commercially logical solution. The logic says that you never buy a lot unless it contains some items you want, and the seller never has a lot that doesn't contain rubbish. In the end you get inundated with commons, meaning you have more of them than you can ever hope to sell. So, what's their real, commercial value? Valuate your rare and uncommon cards individually and all the baggage not at all. You will find the outcome more realistic both for the Commons and the Rare. BTW, that's what they do with stamp or coin collections.

I realise this is more a comment than an answer, but it's too large to put it as a comment:
Your formula is unreadable, as you can see:
=IFS(B2="C",D2*$B$15*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="U",D2*$B$16*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="R",D2*$B$17*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2)
First I'd advise you to create a new column (you might always hide it), I:I, which contains following formula (for I2):
=B2*G2/((D2*const_weigth_common)+(E2*const_weigth_uncommon)+(F2*const_weigth_rare))/D2
(And you give this a meaningful header name)
As far as names for $B$15:$B$17 are concerned, do something like:
$B$15 : const_weigth_common
$B$16 : const_weigth_uncommon
$B$17 : const_weigth_rare
(You do know how to use names in Excel?)
Like this, your formula becomes:
=IFS(B2="C",I2 * const_weigth_common,
B2="U",I2 * const_weigth_uncommon,
B2="R",I2 * const_weigth_rare);
As far as your error is concerned: as mentioned in another answer, this might be tackled using the =IF() formula, so I2 becomes:
=IF(D2<>0;...;_ERROR_VALUE); // up to you how to change your error value
Like this, your formulas become much clearer and it will be easier to solve possible problems.

I Ain't No Math-A-Magician
But I can help you with this...
The way I see it, there are three schools of thought and you need to figure out which one is yours:
The programmer - trapping #div/0 errors all day
The purist - there is only one answer, and it is undefined
The pragmatic - a graph shows results approaching a limit
I think the programmer is either dangerous or ineffectual, or dangerously ineffectual. Sure he can trap there error but what exactly does that do? I'll tell you exactly what that does, it puts lipstick on a pig. It's literally replacing one string, "#div/0!", with a different more aesthetic string. Or he can play second fiddle to the devil and publicly killing the one bug everyone knew how to defend at defend sgainst and creating another?
I think the purist is right about one thing, the answer is what it is and it can't be a anything else; but he's also wrong. A precisely known theoretical answer may very well be undefined, but where in the real world has anyone ever seen division by zero? It's a mathematical construct like infinity, we can safely ignore it. Don't believe me? What happens when you dvide the sun by zero? Go ahead, I'll wait for your answer.
I think these things because I am grounded in pragmatism One string is not better than the other if they both symbolize an attempt to divide by zero. I prefer knowing who my enemies are so I my keep them in front of it me. Nature truly abhorss the undefined and is only slightly displeased with a vacuum.

How do I get the context of a sentence?

There is a questionnaire that we use to evaluate the student knowledge level (we do this manually, as in a test paper). It consists of the following parts:
Multiple choice
Comprehension Questions (I.e: Is a spider an insect?)
Now I have been given a task to make an expert system that will automate this. So basically we have a proper answer for this. But my problem is the "comprehension questions". I need to compare the context of their answer to the context of the correct answer.
I already initially searched for the answer, but it seems like it's really a big task to do. What I have search so far is I can do this through NLP which is really new to me. Also, if I'm not mistaken, it seems like that I have to find a dictionary of all words that is possible for the examiner to answer.
Am I on the right track? If no, please suggest of what should I do (study what?) or give me some links to the materials that I need. Also, should I make my own dictionary? Because the words that I will be using are in the Filipino language.
Update: Comprehension question
The comprehension section of the questionnaire contains one paragraph explaining a certain scenario. The questions are fairly simple. Here is an example:
Bonnie's uncle told her to pick apples from the tree. Picking up a stick, she poked the fruits so they would fall. In the middle of doing this, a strong gust of wind blew. Due to her fear of the fruits falling on top of her head, she stopped what she was doing. After this, though, she noticed that the wind had caused apples to fall from the tree. These fallen apples were what she brought home to her uncle.
The questions are:
What did Bonnie's uncle tell her to do?
What caused Bonnie to stop picking apples from the tree?
Is Bonnie a good fruit picker? Please explain your answer.
The possible answers that the answer key states are:
For number 1:
1.1 Bonnie's uncle told her to pick apples from the tree
1.2 Get apples
For number 2:
2.1 A strong gust of wind blew
2.2 She might get hit in the head by the fruits
For number 3:
3.1 No, because the apples she got were already on the ground
3.2 No, because the wind was what caused the fruits to fall
3.3 Yes, because it is difficult to pick fruits when it's windy.
3.4 Yes, because at least she tried
Now there are answers that were given to me. The job that the system shall be able to do is to compare the context of the student's answer to the context of the right answer in order for the system to successfully be able to grade the student's answer.

One simplistic way of doing this that I can think of (off the top of my head) is to use a string similarity metric like cosine or jaccard to identify whether certain keywords appear in a test answer and the known correct answer.
Extracting these keywords automatically could be done with part of speech tagging using NLP. For example, you could extract all nouns (and possibly verbs). Then, representing each answer as a vector of keywords, you could compare the test vector with the known correct vector.
For example, in the second question, the vector for the two possible answers could be
gust, wind, blew
hit, head, fruits
An answer like "she picked up a stick" with the keywords: picked, stick would have a very low score as compared to something like "afraid of fruit falling on her head" with keywords: fruit, falling, head.
Notes:
This can detect only wildly wrong answers. Wrong answers containing the right keywords would not be detected by this technique. :)
I'm not sure about non-english sentences. If that is the case, you might want to take every word in the answer as a keyword (removing stopwords). This question might help as well.

Divide Data Into Quartiles

I have a dataset that contains admissions rates of all providers that we work with. I need to divide that data into quartiles, so that each provider can see where their rate lies in comparison to other providers. The rate ranges from 7% to 89%. can anyone suggest me how to do this? I am not sure if this is the right place to ask this question but if somebody can help me with this, I would really appreciate that.
The other concern is that if a provider's numbers is really small eg: 2/4 = 50%, the provider might fall into worse quartile but it doesn't mean that the provider's performance is bad because the numbers are so small. I hope this is making sense. Please let me know if I can clarify it further.

There are ways to obtain quantiles without doing a complete sort but unless you've got huge amounts of data there is no point in implementing those algorithms if you haven't already got them available. Presuming you have a sort() function available, all you need to do is:
Given n data points.
Sort the data points.
Find the n/4, n/2 and 3*n/4th points in the sorted data, which are your quartiles.
As you say, if n is less than some number (that you'll have to decide for yourself) you may want to say that the quartile result is "not applicable" or some such.

First concern: For small n, do not use quartiles. Whether n is small is arbitrary.

1 vs 1 vote: calculate ratings (Flickchart.com)

Instead of rating items with grades from 1 to 10, I would like to have 1 vs 1 "fights". Two items are displayed beside each other and you pick the one which you like more. Based on these "fight" results, an algorithm should calculate ratings for each item.
You can see this approach on Flickchart.com where movies are rated using this approach.
It looks like this:
As you can see, items are pushed upwards if they win a "fight". The ranking is always changing based on the "fight" results. But this can't be only based on the win quote (here 54%) since it's harder to win against "Titanic" than against "25th Hour" or so.
There are a few things which are quite unclear for me:
- How are the ratings calculated? How do you decide which film is on the first place in the ranking? You have to consider how often an items wins and how good are the beaten items.
- How to choose which items have a "fight"?
Of course, you can't tell me how Flickchart exactly does this all. But maybe you can tell me how it could be done. Thanks in advance!

This might not be exactly what flickchart is doing, but you could use a variant of the ELO algorithm used in chess (and other sports), since these are essentially fights/games that they win/lose.
Basically, all movies start off with 0 wins/losses and every time they get a win they get a certain amount of points. You usually have an average around 20 (but any number will do) and winning against a movie with the same rating as yourself will give exactly that 20. Winning against a bad movie will maybe give around 10 points, while winning against a better movie might give you 30 points. The other way around, losing to a good movie you only lose 10 points, but if you lose to a bad movie, you lose 30 points.
The specifics of the algorithm is in the wikipedia link.

How are the ratings calculated? How do you decide which film is on the first place in the ranking? You have to consider how often an items wins and how good are the beaten items.
What you want is a weighted rating, also called a Bayesian estimate.
I think IMDB's Top 250 movies is a better starting point to make a ranking website. Some movies have 300,000+ votes while others others have fewer than 50,000. IMDB uses a Bayesian estimate to rank movies against one another without unfairly weighting popular movies. The algorithm is given at the bottom of the page:
weighted rating (WR) = (v ÷ (v+m)) × R
+ (m ÷ (v+m)) × C where:
R = average for the movie (mean) =
(Rating)
v = number of votes for the
movie = (votes)
m = minimum votes
required to be listed in the Top 250
(currently 3000)
C = the mean vote
across the whole report (currently
6.9)
for the Top 250, only votes from
regular voters are considered.
I don't know how IMDB chose 3000 as their minimum vote. They could have chosen 1000 or 10000, and the list would have been more or less the same. Maybe they're using "average number of votes after 6 weeks in the box office" or maybe they're using trial and error.
In any case, it doesn't really matter. The formula above is pretty much the standard for normalizing votes on ranking websites, and I'm almost certain Flickrchart uses something similar in the background.
The formula works so well because it "pulls" ratings toward the mean, so ratings above the mean are slightly decreased, ratings below the mean are slightly increased. However, the strength of the pull is inversely proportional to the number of votes a movie has. So movies with few votes are pulled more aggressively toward the mean than movies with lots of votes. Here are two data points to demonstrate the property:
Rank Movie Votes Avg Rating Weighted Rating
---- ----- ----- ---------- ---------------
219 La Strada 15,000+ 8.2 8.0
221 Pirates of the 210,000+ 8.0 8.0
Caribbean 2
Both movies' ratings are pulled down, but the pull on La Strada is more dramatic since it has fewer votes and therefore is not as representative as ratings for PotC.
For your specific case, you have two items in a "fight". You should probably design your table as follows:
Items
-----
ItemID (pk)
FightsWon (int)
FightsEngaged (int)
The average rating is FightsWon / FightsEngaged. The weighted rating is calculated using the formula above.
When a user chooses a winner in a fight, increase the winning item's FightsWon field by 1, increase both items FightsEngaged field by 1.
Hope this helps!
- Juliet

I've been toying with the problem of ranking items by means of pair-wise comparison for some time myself, and wanted to take the time to describe the ideas I came up with so far.
For now I'm simply sorting by <fights won> / <total fights>, highest first. This works fine if you're the only one voting, or if there are a lot of people voting. Otherwise it can quickly become inaccurate.
One problem here is how to choose which two items should fight. One thing that does seem to work well (subjectively) is to let the item that has the least fights so far, fight against a random item. This leads to a relatively uniform number of fights for the items (-> accuracy), at the cost of possibly being boring for the voter(s). They will often be comparing the newest item against something else, which is kinda boring. To alleviate that, you can choose the n items with the lowest fight-count and chose one of those randomly as the first contender.
You mentioned that you want to make victories against strong opponents count more than against weak ones. As mentioned in other posts above, rating systems used for chess and the like (Elo, Glicko) may work. Personally I would love to use Microsoft's TrueSkill, as it seems to be the most accurate and also provides a good way to pick two items to pit against each other -- the ones with the highest draw-probability as calculated by TrueSkill. But alas, my math understanding is not good enough to really understand and implement the details of the system, and it may be subject to licensing fees anyway...
Collective Choice: Competitive Ranking Systems has a nice overview of a few different rating systems if you need more information/inspiration.
Other than rating systems, you could also try various simple ladder systems. One example:
Randomize the list of items, so they are ranked 1 to n
Pick two items at random and let them fight
If the winner is ranked above the loser: Do nothing
If the loser is ranked above the winner:
If the loser is directly above the winner: Swap them
Else: Move the winner up the ladder x% toward the loser of the fight.
Goto 2
This is relatively unstable in the beginning, but should improve over time. It never ceases to fluctuate though.
Hope I could help at least a little.

As for flickchart, I've been playing around with it a little bit, and I think the rating system is pretty unsophisticated. In pseudo-code, my guess is that it looks something like this:
if rank(loser) == null and rank(winner) == null
insert loser at position estimated from global rank
insert winner at position estimated from global rank
else if rank(winner) == null or rank(winner) < rank(loser)
then advance winner to loser's position and demote loser and all following by 1
Why do I think this? First, I'm completely convinced that their Bayesian priors are not based on a careful mining of my previous choices. They seem to have no way to guess that because I like Return of the Jedi that I like The Empire Strikes Back. In fact, they can't figure out that because I've seen Home Alone 2 that I may have seen Home Alone 1. After hundreds of ratings, the choice hasn't come up.
Second of all, if you look at the above code you might find a little bug, which you will definitely notice on the site. You may notice that sometimes you will make a choice and the winner will slide by one. This seems to only happen when the loser wasn't previously added. My guess is that what is happening is that the loser is being added higher than the winner.
Other than that, you will notice that rankings do not change at all unless a lower ranked movie beats a higher ranked movie directly. I don't think any real scores are being kept: the site seems to be entirely memoryless except for the ordinal rank of each movie and your most recent rating.

Or you might want to use a variant of PageRank see prof. Wilf's cool description.

After having thought things through, the best solution for this film ranking is as follows.
Required data:
The number of votes taken on each pairing of films.
And also a sorted version of this data grouped like in radix sort
How many times each film was voted for in each pairing of films
Optional data:
How many times each film has been involved in a vote for each user
How to select a vote for a user:
Pick out a vote selection from the sorted list in the lowest used radix group (randomly)
Optional: use the user's personal voting stats to filter out films they've been asked to vote on too many times, possibly moving onto higher radix buckets if there's nothing suitable.
How to calculate the ranking score for a film:
Start the score at 0
Go through each other film in the system
Add voteswon / votestaken versus this film to the score
If no votes have been taken between these two films, add 0.5 instead (This is of course assuming you want new films to start out as average in the rankings)
Note: The optional stuff is just there to stop the user getting bored, but may be useful for other statistics also, especially if you include how many times they voted for that film over another.
Making sure that newly added films have statistics colleted on them ASAP and very evenly distributed votes across all existing films is vital to keeping stats correct for the rest of the films. It may be worth staggering the entry of a bunch of new films to the system to avoid temporary glitches in the rankings (though not immediate nor severe).
===THIS IS THE ORIGINAL ANSWER===
The problem is actually very easy. I am assuming here that you want to order by preference to vote for the film i.e. the #1 ranked film is the film that is most likely to be chosen in the vote. If you make it so that in each vote, you choose two films completely at random you can calculate this with simple maths.
Firstly each selection of two films to vote on is equally likely, so results from each vote can just be added together for a score (saves multiplying by 1/nC2 on everything). And obviously the probability of someone voting for one specific film against another specific film is just votesforthisfilm / numberofvotes.
So to calculate the score for one film, you just sum votesforthisfilm / numberofvotes for every film it can be matched against.
There is a little trouble here if you add a new film which hasn't had a considerable number of votes against all the other films, so you probably want to leave it out of the rankings until a number of votes has built up.
===WHAT FOLLOWS IS MOSTLY WRONG AND IS MAINLY HERE FOR HISTORICAL CONTEXT===
This scoring method is derived from a Markov chain of your voting system, assuming that all possible vote questions were equally likely. [This first sentence is wrong because making all vote questions have to be equally likely in the Markov chain to get meaningful results] Of course, this is not the case, and actually you can fix this as well, since you know how likely each vote question was, it's just the number of votes that have been done on that question! [The probability of getting a particular vote question is actually irrelevant so this doesn't help] In this way, using the same graph but with the edges weighted by votes done...
Probability of getting each film given that it was included in the vote is the same as probability of getting each film and it being in the vote divided by the probability it was included in the vote. This comes to sumoverallvotes((votesforthisfilm / numberofvotes) * numberofvotes) / totalnumberofvotes divided by sumoverallvotes(numberofvotes) / totalnumberofvotes. With much cancelling this comes to votesforthisfilmoverallvotes / numberofvotesinvolvingthisfilm. Which is really simple!

http://en.wikipedia.org/wiki/Maximize_Affirmed_Majorities?
(Or the BestThing voting algorithm, originally called the VeryBlindDate voting algorithm)

I believe this kind of 1 vs. 1 scenario might be a type of conjoint analysis called Discrete Choice. I see these fairly often in web surveys for market research. The customer is generally asked to choose between two+ different sets of features that they would prefer the most. Unfortunately it is fairly complicated (for a non-statistics guy like myself) so you may have difficulty understanding it.

I heartily recommend the book Programming Collective Intelligence for all sorts of interesting algorithms and data analysis along these lines.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string