1 vs 1 vote: calculate ratings (Flickchart.com) - statistics

Instead of rating items with grades from 1 to 10, I would like to have 1 vs 1 "fights". Two items are displayed beside each other and you pick the one which you like more. Based on these "fight" results, an algorithm should calculate ratings for each item.
You can see this approach on Flickchart.com where movies are rated using this approach.
It looks like this:
As you can see, items are pushed upwards if they win a "fight". The ranking is always changing based on the "fight" results. But this can't be only based on the win quote (here 54%) since it's harder to win against "Titanic" than against "25th Hour" or so.
There are a few things which are quite unclear for me:
- How are the ratings calculated? How do you decide which film is on the first place in the ranking? You have to consider how often an items wins and how good are the beaten items.
- How to choose which items have a "fight"?
Of course, you can't tell me how Flickchart exactly does this all. But maybe you can tell me how it could be done. Thanks in advance!

This might not be exactly what flickchart is doing, but you could use a variant of the ELO algorithm used in chess (and other sports), since these are essentially fights/games that they win/lose.
Basically, all movies start off with 0 wins/losses and every time they get a win they get a certain amount of points. You usually have an average around 20 (but any number will do) and winning against a movie with the same rating as yourself will give exactly that 20. Winning against a bad movie will maybe give around 10 points, while winning against a better movie might give you 30 points. The other way around, losing to a good movie you only lose 10 points, but if you lose to a bad movie, you lose 30 points.
The specifics of the algorithm is in the wikipedia link.

How are the ratings calculated? How do you decide which film is on the first place in the ranking? You have to consider how often an items wins and how good are the beaten items.
What you want is a weighted rating, also called a Bayesian estimate.
I think IMDB's Top 250 movies is a better starting point to make a ranking website. Some movies have 300,000+ votes while others others have fewer than 50,000. IMDB uses a Bayesian estimate to rank movies against one another without unfairly weighting popular movies. The algorithm is given at the bottom of the page:
weighted rating (WR) = (v ÷ (v+m)) × R
+ (m ÷ (v+m)) × C where:
R = average for the movie (mean) =
(Rating)
v = number of votes for the
movie = (votes)
m = minimum votes
required to be listed in the Top 250
(currently 3000)
C = the mean vote
across the whole report (currently
6.9)
for the Top 250, only votes from
regular voters are considered.
I don't know how IMDB chose 3000 as their minimum vote. They could have chosen 1000 or 10000, and the list would have been more or less the same. Maybe they're using "average number of votes after 6 weeks in the box office" or maybe they're using trial and error.
In any case, it doesn't really matter. The formula above is pretty much the standard for normalizing votes on ranking websites, and I'm almost certain Flickrchart uses something similar in the background.
The formula works so well because it "pulls" ratings toward the mean, so ratings above the mean are slightly decreased, ratings below the mean are slightly increased. However, the strength of the pull is inversely proportional to the number of votes a movie has. So movies with few votes are pulled more aggressively toward the mean than movies with lots of votes. Here are two data points to demonstrate the property:
Rank Movie Votes Avg Rating Weighted Rating
---- ----- ----- ---------- ---------------
219 La Strada 15,000+ 8.2 8.0
221 Pirates of the 210,000+ 8.0 8.0
Caribbean 2
Both movies' ratings are pulled down, but the pull on La Strada is more dramatic since it has fewer votes and therefore is not as representative as ratings for PotC.
For your specific case, you have two items in a "fight". You should probably design your table as follows:
Items
-----
ItemID (pk)
FightsWon (int)
FightsEngaged (int)
The average rating is FightsWon / FightsEngaged. The weighted rating is calculated using the formula above.
When a user chooses a winner in a fight, increase the winning item's FightsWon field by 1, increase both items FightsEngaged field by 1.
Hope this helps!
- Juliet

I've been toying with the problem of ranking items by means of pair-wise comparison for some time myself, and wanted to take the time to describe the ideas I came up with so far.
For now I'm simply sorting by <fights won> / <total fights>, highest first. This works fine if you're the only one voting, or if there are a lot of people voting. Otherwise it can quickly become inaccurate.
One problem here is how to choose which two items should fight. One thing that does seem to work well (subjectively) is to let the item that has the least fights so far, fight against a random item. This leads to a relatively uniform number of fights for the items (-> accuracy), at the cost of possibly being boring for the voter(s). They will often be comparing the newest item against something else, which is kinda boring. To alleviate that, you can choose the n items with the lowest fight-count and chose one of those randomly as the first contender.
You mentioned that you want to make victories against strong opponents count more than against weak ones. As mentioned in other posts above, rating systems used for chess and the like (Elo, Glicko) may work. Personally I would love to use Microsoft's TrueSkill, as it seems to be the most accurate and also provides a good way to pick two items to pit against each other -- the ones with the highest draw-probability as calculated by TrueSkill. But alas, my math understanding is not good enough to really understand and implement the details of the system, and it may be subject to licensing fees anyway...
Collective Choice: Competitive Ranking Systems has a nice overview of a few different rating systems if you need more information/inspiration.
Other than rating systems, you could also try various simple ladder systems. One example:
Randomize the list of items, so they are ranked 1 to n
Pick two items at random and let them fight
If the winner is ranked above the loser: Do nothing
If the loser is ranked above the winner:
If the loser is directly above the winner: Swap them
Else: Move the winner up the ladder x% toward the loser of the fight.
Goto 2
This is relatively unstable in the beginning, but should improve over time. It never ceases to fluctuate though.
Hope I could help at least a little.

As for flickchart, I've been playing around with it a little bit, and I think the rating system is pretty unsophisticated. In pseudo-code, my guess is that it looks something like this:
if rank(loser) == null and rank(winner) == null
insert loser at position estimated from global rank
insert winner at position estimated from global rank
else if rank(winner) == null or rank(winner) < rank(loser)
then advance winner to loser's position and demote loser and all following by 1
Why do I think this? First, I'm completely convinced that their Bayesian priors are not based on a careful mining of my previous choices. They seem to have no way to guess that because I like Return of the Jedi that I like The Empire Strikes Back. In fact, they can't figure out that because I've seen Home Alone 2 that I may have seen Home Alone 1. After hundreds of ratings, the choice hasn't come up.
Second of all, if you look at the above code you might find a little bug, which you will definitely notice on the site. You may notice that sometimes you will make a choice and the winner will slide by one. This seems to only happen when the loser wasn't previously added. My guess is that what is happening is that the loser is being added higher than the winner.
Other than that, you will notice that rankings do not change at all unless a lower ranked movie beats a higher ranked movie directly. I don't think any real scores are being kept: the site seems to be entirely memoryless except for the ordinal rank of each movie and your most recent rating.

Or you might want to use a variant of PageRank see prof. Wilf's cool description.

After having thought things through, the best solution for this film ranking is as follows.
Required data:
The number of votes taken on each pairing of films.
And also a sorted version of this data grouped like in radix sort
How many times each film was voted for in each pairing of films
Optional data:
How many times each film has been involved in a vote for each user
How to select a vote for a user:
Pick out a vote selection from the sorted list in the lowest used radix group (randomly)
Optional: use the user's personal voting stats to filter out films they've been asked to vote on too many times, possibly moving onto higher radix buckets if there's nothing suitable.
How to calculate the ranking score for a film:
Start the score at 0
Go through each other film in the system
Add voteswon / votestaken versus this film to the score
If no votes have been taken between these two films, add 0.5 instead (This is of course assuming you want new films to start out as average in the rankings)
Note: The optional stuff is just there to stop the user getting bored, but may be useful for other statistics also, especially if you include how many times they voted for that film over another.
Making sure that newly added films have statistics colleted on them ASAP and very evenly distributed votes across all existing films is vital to keeping stats correct for the rest of the films. It may be worth staggering the entry of a bunch of new films to the system to avoid temporary glitches in the rankings (though not immediate nor severe).
===THIS IS THE ORIGINAL ANSWER===
The problem is actually very easy. I am assuming here that you want to order by preference to vote for the film i.e. the #1 ranked film is the film that is most likely to be chosen in the vote. If you make it so that in each vote, you choose two films completely at random you can calculate this with simple maths.
Firstly each selection of two films to vote on is equally likely, so results from each vote can just be added together for a score (saves multiplying by 1/nC2 on everything). And obviously the probability of someone voting for one specific film against another specific film is just votesforthisfilm / numberofvotes.
So to calculate the score for one film, you just sum votesforthisfilm / numberofvotes for every film it can be matched against.
There is a little trouble here if you add a new film which hasn't had a considerable number of votes against all the other films, so you probably want to leave it out of the rankings until a number of votes has built up.
===WHAT FOLLOWS IS MOSTLY WRONG AND IS MAINLY HERE FOR HISTORICAL CONTEXT===
This scoring method is derived from a Markov chain of your voting system, assuming that all possible vote questions were equally likely. [This first sentence is wrong because making all vote questions have to be equally likely in the Markov chain to get meaningful results] Of course, this is not the case, and actually you can fix this as well, since you know how likely each vote question was, it's just the number of votes that have been done on that question! [The probability of getting a particular vote question is actually irrelevant so this doesn't help] In this way, using the same graph but with the edges weighted by votes done...
Probability of getting each film given that it was included in the vote is the same as probability of getting each film and it being in the vote divided by the probability it was included in the vote. This comes to sumoverallvotes((votesforthisfilm / numberofvotes) * numberofvotes) / totalnumberofvotes divided by sumoverallvotes(numberofvotes) / totalnumberofvotes. With much cancelling this comes to votesforthisfilmoverallvotes / numberofvotesinvolvingthisfilm. Which is really simple!

http://en.wikipedia.org/wiki/Maximize_Affirmed_Majorities?
(Or the BestThing voting algorithm, originally called the VeryBlindDate voting algorithm)

I believe this kind of 1 vs. 1 scenario might be a type of conjoint analysis called Discrete Choice. I see these fairly often in web surveys for market research. The customer is generally asked to choose between two+ different sets of features that they would prefer the most. Unfortunately it is fairly complicated (for a non-statistics guy like myself) so you may have difficulty understanding it.

I heartily recommend the book Programming Collective Intelligence for all sorts of interesting algorithms and data analysis along these lines.

Related

How to compare different groups with different sample size?

I am plotting students' data from different schools to see the difference between male and female student numbers at some majors. I am using python, I already plot the data for some schools and as I expected male numbers are genuinely higher, then I realized that for each school I have a different number of total students. does my work make any sense when the sample size is different? if not may I have some suggestion to make some changes.
Now I'm realizing. Look: you have two classes where the first has 2 men, the second one - 20 men. And their marks. 2 men - both are 90/100. And 20 marks in the second one. Let it be a range from 40 to 80. Will it be correct if we say "Well, the first class made the test much better then the second"? Ofc, not.
To solve this problem just take a min(sizes of samples). If it looks too small, so throw away this programm, because you have not enough data to say something. And put a total size of sample via proxy legend or text, or add it in title. Anyway it will show you reliability of your results.
This question is not about programming, but rather about statistics, but I will try to answer.
Important question I didn't get there: What are you doing it for? If you ask question like "Hmm... Are there more men than women in the population(in this case, population = all persons in major programm)?". So each schools aren't important for you,and you can work with samples as you work with one (but don't forger gather them).
But you may ask question: "are there any difference between schools in samples?". In this case, gathering is not correct. For this purpose I highly recommend barh plot with stucked=True for each school. And for normalization just use percents. And difference between samples' size won't be problem.
PLS, If you ask question, put some code. 3 rows and one plot from a sample would be very helpful...

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks
If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.
Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

Identifying wrong raters of items

I coded a program in which people rate different products. Per rating people get a bonus point. The more bonus points people get the more reputation they get. But my issue that people sometimes give ratings not to rate but just to earn bonus points. Is there a mathematical solution to identify fake raters?
Absolutely. Search for "shilling recommender systems" in Google Scholar or elsewhere. There has been a decent amount of scholarly work identifying bad actors in recommender systems. Generally there's a focus on preventing robot actions (which doesn't seem to be your concern) as well as finding humans who rate differently than the norm (i.e., rating distributions, time-of-rating distributions).
https://scholar.google.com/scholar?hl=en&q=shilling+recommender+systems

Effect of randomness on search results

I am currently working on a search ranking algorithm which will be applied to elastic search queries (domain: e-commerce). It assigns scores on several entities returned and finally sorts them based on the score assigned.
My question is: Has anyone ever tried to introduce a certain level of randomness to any search algorithm and has experienced a positive effect of it. I am thinking that it might be useful to reduce bias and promote the lower ranking items to give them a chance to be seen easier and get popular if they deserve it. I know that some machine learning algorithms are introducing some randomization to reduce the bias so I thought it might be applied to search as well.
Closest I can get here is this but not exactly what I am hoping to get answers for:
Randomness in Artificial Intelligence & Machine Learning
I don't see this mentioned in your post... Elasticsearch offers a random scoring feature: https://www.elastic.co/guide/en/elasticsearch/guide/master/random-scoring.html
As the owner of the website, you want to give your advertisers as much exposure as possible. With the current query, results with the same _score would be returned in the same order every time. It would be good to introduce some randomness here, to ensure that all documents in a single score level get a similar amount of exposure.
We want every user to see a different random order, but we want the same user to see the same order when clicking on page 2, 3, and so forth. This is what is meant by consistently random.
The random_score function, which outputs a number between 0 and 1, will produce consistently random results when it is provided with the same seed value, such as a user’s session ID
Your intuition is right - randomization can help surface results that get a lower than deserved score due to uncertainty in the estimation. Empirically, Google search ads seemed to have sometimes been randomized, and e.g. this paper is hinting at it (see Section 6).
This problem describes an instance of a class of problems called Explore/Exploit algorithms, or Multi-Armed Bandit problems; see e.g. http://en.wikipedia.org/wiki/Multi-armed_bandit. There is a large body of mathematical theory and algorithmic approaches. A general idea is to not always order by expected, "best" utility, but by an optimistic estimate that takes the degree of uncertainty into account. A readable, motivating blog post can be found here.

How to change to use Story Points for estimations in Scrum [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Having used "days" as the unit for estimation of tasks in Scrum I find it hard to change to using Story Points. I believe story points should be used as they are more comparable to each other - being less dependent on the qualifications of whoever addresses the task etc. However, it isn't easy to make a team start using Story Points when they're used to estimating in days.
So, how to make a team change to Story Points? What should motivate the team members to do so, and how should we apply the switch?
When I switched to points, I decided to it only if I could meet the two following points; 1) find and argument that justify the switch and that will convince the team 2) Find an easy method to use it.
Convincing
It took me a lot of reading on the subject but a finally found the argument that convinced me and my team: It’s nearly impossible to find two programmers that will agree on the time a task will take but the same two programmers will almost always agree on which task is the biggest when shown two different tasks.
This is the only skill you need to ‘estimate’ your backlog. Here I use the word ‘estimate’ but at this early stage it’s more like ordering the backlog from tough to easy.
Putting Points in the Backlog
This step is done with the participation of the entire scrum team.
Start dropping the stories one by one in a new spreadsheet while keeping the following order: the biggest story at the top and the smallest at the bottom. Do that until all the stories are in the list.
Now it’s time to put points on those stories. Personally I use the Poker Planning Scale (1/2,1,2,3,5,8,13,20,40,100) so this is what I will use for this example. At the bottom of that list you’ll probably have micro tasks (things that takes 4 hours or less to do). Give to every micro tasks the value of 1/2. Then continue up the list by giving the value 1 (next in the scale) to the stories until it is clear that a story is much bigger (2 instead of 1, so twice bigger). Now using the value '2', continue up the list until you find a story that should clearly have a 3 instead of a 2. Continue this process all the way to the top of the list.
NOTE: Try to keep the vast majority of the points between 1 and 13. The first time you might have a bunch of big stories (20, 40 and 100) and you’ll have to brake them down into chunks smaller or equal to 13.
That is it for the points and the original backlog. If you ever have a new story, compare it to that list to see where it fits (bigger/smaller process) and give it the value of its neighbors.
Velocity & Estimation
To estimate how long it will take you to go through that backlog, do the first sprint planning. Make the total of the points attached to the stories the teams picked and VOILA!, that’s your first velocity measure. You can then divide the total of points in the backlog by that velocity, to know how many sprints will be needed.
That velocity will change and settle in the first 2-3 sprints so it's always good to keep an eye on that value
If you want to change to using story points instead of duration, you just got to start estimating in story points. (I'm assuming here you have the authority to make that decision for your team.)
Pick a scale, could be small, medium, large could be fibonacci sequence, could be 1 to 5, whatever pick one and use it for several sprints this will give you your velocity. If you start changing the scale from one to the other then velocity between scales is not going to be comparable (ie dont do it). These estimates should involve all your Scrum team.
Having said that you still need an idea of how much this is going to cost you. There arent many accountants who will accept the answer "I'll tell you how much this is going to cost in 6 months". So you still need to estimate the project in duration as well, this will give you the cost. This estimate is probably going to be done by a senior person on the team
Then every month your velocity will tell you and the accountants how accurate that first cost estimate was and you can adapt accordingly.
Start by making one day equal one point (or some strict ratio). It is a good way to get started. After a couple of sprints you can start encouraging them to use more relative points (ie. how big is this compared to that thing).
The problem is that story points define effort.
Days are duration.
The two have an almost random relationship. duration = f ( effort ). That function is based on the skill of the person actually doing the work.
A person knows how long they will take to do the work. That's duration. In days.
They don't know this abstract 'effort' thing. They don't know how long a hypothetical person of average skills will require to do it.
The best you can do is both -- story points (effort) and days (duration).
You can't replace one with the other. If you try to use only effort, then you'll eventually need to get to days for planning purposes. You'll have to apply a person to the story points and compute a duration from the effort.

Resources