Given a list of reviews (10000+), such as:
has great pizzas, price is low and customer service is average
Customer service was horrible, there was a long wait during lunch, food was ok
has amazing pizzas and I highly recommend it, they also have deals/specials weekly. Very upscale, and the atmosphere is great
etc.
The goal is to find the most significant reviews (around 20) out of all. The review should encapsulate as much information about the merchant as possible. (Food satisfaction, Price, Wait Time, etc)
I have been looking at some ways of doing this, chunking/collocation/idf but not sure if any of them are viable.
You can do a multi-label classification task for each review, then:
You can retrieve the reviews with more tags (order by count(tags) desc)
Give weight (positive or negative) to the labels and retrive the rewiews with max(sum(weight)) ORDER DESC
in both cases you can exclude labels or reviews with certain labels
Related
Absolute data-science/statistics beginner here, please explain like i'm 15.
I have a database with the following:
food-related products with 0-n user-defined tags (such as "vegetarian", "veal", "fish", "low-carb")
purchase history of customers (customer_id, product_id, amount, time, ...)
I want to know which tags (or the meaning behind the tags, to be precise, since customers do not actually see the tags) have a measureable impact on the buy-decision of a group of customers (meaning more than 1, less than n). For example, I would imagine that a vegetarian customer would predominantly buy vegetarian products (meaning that yes, there is a group of people for the "vegetarian" tag, where the buy-decision is influenced).
My first idea is that there would be a multimodal distribution in the number of products bought by a customer (because vegetarians would buy more vegetarian products on average), but how would I know to which tag that correlates to if there are multiple?
I am new to DDD and decided to practice it with a dental clinic system I'm developing, but I'm struggling with modeling the domain so an extra pair of eyes will be greatly appreciated.
For this dentistry system, the domain expert told me that a patient holds only one medical history. The medical history must have a Record Number which is unique on the system. The medical history holds dental treatments the patient could have (like planned treatments) as well as treatments that the patient already had. Every treatment has a price, and so the medical history contains a Total price on it (based on planned/applied treatments). Whenever a patient gets a treatment done, he/she will have to pay with at least 50% of that treatment price, meaning he/she will eventually pay the rest of it on future appointments (if no treatment plan exists, he/she will have to pay for the 100% of the price). Finally, this dentistry clinic gives the option to patients to pay on different currencies, because sometimes a patient that comes for the day has only Euros, but then he decides he wants a plan and for future appointments will pay on Pounds.
Based on all this, and my beginner knowledge of DDD, my first thinking is that I have these entities:
Patient
Treatment
Dentist
I will have several value objects, but the most important ones might be:
Money (for prices and currency)
Signature (for applied treatments)
Tooth or Teeth (used on Treatment entity)
And I can only find one aggregate which is Medical history since it puts together patient info, as well as treatments (planned and applied). But this will mean that whenever I update a Medical History, I will have to update patient info and treatments, even if one of those never changed. Patients could change their personal information, which will be reflected in medical history, but it doesn't affect treatments.
I am a bit confused on how to model this. Please help!
Remember that Aggregates, and by extension Bounded Contexts (BC), are a grouping of data and business logic that belong together (and most likely things that need to change transactionally). The data that an aggregate contains is there because the business logic needs it, not because some application screen needs it. This is very important to clear up some confusion and to free you of some constraints in order to design your aggregates.
For example, when you display the Medical History to the user, you might want to show the Patient's name, address, age and so on, and also the treatments prices, but if you think about it, you don't need any of this to manage the Medical History. From what you say, the Medical History has a Record Number, a PatientId, and a list of TreatmentIds with maybe the Dates when they were done.
When you want to display the Medical History to the user, you can use UI Composition. So, you get the Medical History (which is mostly a bunch of Ids and dates). Then from the Medical History's PatientId, you can get the Patients's information from the BC that owns it. From the TreatmentIds, you can get the Treatment descriptions from some BC that owns that and their prices from the BC they belong to.
So, based on that, you can build your aggregates not based on the "relevant names" on your domain like Patient, Treatment or Dentist, but by the business logic they implement.
This is just wild guessing, but I can think of:
BC Marketing (for lack of a better name): Contains the descriptions of all treatments, information about the Dentists, Information about the rooms and materials, etc. So, texts, pictures and other details.
BC Finances: Contains information about the prices of each treatment, payment records of each payment, credits and debits of each patient, etc. In charge of keeping track of all these things. For example, it could know when a treatment starts/ends and depending on the Patient's record, require 50 or 100% payment. There's no need of direct relation to the Medical History here, it only needs to know if it's the first treatment or not.
BC Scheduling: In charge of scheduling new treatments and keep track of when they start and finish. This could contain the History, or it could potentially be somewhere else if necessary.
BC Medical: In charge of keeping all the medical records, allergies, medical details of the status of the teeth, etc.
BC Patients Care: In charge of tracking patients' information, name, nationality, contact details, etc.
Once you have an idea of the Bounded Context you can define the aggregates. There can be one or more per BC. Also, some things might not be an aggregate. For example, the Medical History might not require an actual aggregate if it's basically a record of treatment Ids and the dates they were made and there's no business logic associated (the history is not going to deny a treatment, have opinions on when a treatment should happen and so on, it's just a history).
Don't take this as a recommended design, but just as a thought process to come up with your own solution.
Entities have an Id where as Value Objects have structure identity which means if two value objects have the same value then they are the same.
In case of Money, there is no difference between two $5 bills, so it can be a value object.
You have not described the role and attributes of Tooth and Signature.
In case of Tooth, does it matter whose Tooth is it? Can You replace a patient's tooth with any other tooth which has the same attributes? If it does matter, then Tooth requires an Id therefore it is an entity.
In case of Signature, how are you going to compare two signatures? Do you have an image recognition software that can compare the look of two Signatures and decide that they are the same? You might have two patients with similar looking signatures, should their signature be treated as the same?
If you choose Medical history to be an Aggregate, then you should treat it as one object. Do you want to load the entire Medical history, in order to add a new Treatment to it? Can a Treatment be associated with another Entity, such as Dentist? If you can use a portion of Medical history (such as Treatment) individually then it is not an aggregate.
Some good tutorials:
Entity vs Value Object by Vladimir Khorikov
Entities, Value Objects, Aggregates and Roots by Jimmy Bogard
I use Solr for product filtering on our website,
for example you can have a product filter where you can filter database of Televisions by size, price, company etc.,. I found Solr+FilterQuery to be very efficeint for such functionality. I have a separate core that has the product info of all TVs in our DB.
I have another Core for product reviews. The review can be on a specific product type or company. So someone can write a review on a Samsung TV or Samsung customer service. So when someone searches for a text (for example "Samsung TV review" or "Samsung customer service"), I search this core.
Now I want to merge the results from the above cores. So when someone searches for 'samsung 46 lcd contrast ratio review', I esentially want to filter the TVs by Company (Samsung), then by size (46") and then find reviews that have text "contrast ratio review". I have no clue how to do this. Basically I want to merge the results by document ID and add additional colums for result 2 into result 1.
I have seen suggestions to flatten out the data but I want to use reviews index on a lot of other filters. So I am not sure if thats a good idea. Moreover if new reviews start coming in I dont want to reindex all the cores of products (even delta reindexing will touch lot of products).
Any ideas on how to acheieve this?
If I got your question right what you are looking for is JOIN functionality.
http://www.slideshare.net/lucenerevolution/grouping-and-joining-in-lucenesolr
http://wiki.apache.org/solr/Join
Instead of rating items with grades from 1 to 10, I would like to have 1 vs 1 "fights". Two items are displayed beside each other and you pick the one which you like more. Based on these "fight" results, an algorithm should calculate ratings for each item.
You can see this approach on Flickchart.com where movies are rated using this approach.
It looks like this:
As you can see, items are pushed upwards if they win a "fight". The ranking is always changing based on the "fight" results. But this can't be only based on the win quote (here 54%) since it's harder to win against "Titanic" than against "25th Hour" or so.
There are a few things which are quite unclear for me:
- How are the ratings calculated? How do you decide which film is on the first place in the ranking? You have to consider how often an items wins and how good are the beaten items.
- How to choose which items have a "fight"?
Of course, you can't tell me how Flickchart exactly does this all. But maybe you can tell me how it could be done. Thanks in advance!
This might not be exactly what flickchart is doing, but you could use a variant of the ELO algorithm used in chess (and other sports), since these are essentially fights/games that they win/lose.
Basically, all movies start off with 0 wins/losses and every time they get a win they get a certain amount of points. You usually have an average around 20 (but any number will do) and winning against a movie with the same rating as yourself will give exactly that 20. Winning against a bad movie will maybe give around 10 points, while winning against a better movie might give you 30 points. The other way around, losing to a good movie you only lose 10 points, but if you lose to a bad movie, you lose 30 points.
The specifics of the algorithm is in the wikipedia link.
How are the ratings calculated? How do you decide which film is on the first place in the ranking? You have to consider how often an items wins and how good are the beaten items.
What you want is a weighted rating, also called a Bayesian estimate.
I think IMDB's Top 250 movies is a better starting point to make a ranking website. Some movies have 300,000+ votes while others others have fewer than 50,000. IMDB uses a Bayesian estimate to rank movies against one another without unfairly weighting popular movies. The algorithm is given at the bottom of the page:
weighted rating (WR) = (v ÷ (v+m)) × R
+ (m ÷ (v+m)) × C where:
R = average for the movie (mean) =
(Rating)
v = number of votes for the
movie = (votes)
m = minimum votes
required to be listed in the Top 250
(currently 3000)
C = the mean vote
across the whole report (currently
6.9)
for the Top 250, only votes from
regular voters are considered.
I don't know how IMDB chose 3000 as their minimum vote. They could have chosen 1000 or 10000, and the list would have been more or less the same. Maybe they're using "average number of votes after 6 weeks in the box office" or maybe they're using trial and error.
In any case, it doesn't really matter. The formula above is pretty much the standard for normalizing votes on ranking websites, and I'm almost certain Flickrchart uses something similar in the background.
The formula works so well because it "pulls" ratings toward the mean, so ratings above the mean are slightly decreased, ratings below the mean are slightly increased. However, the strength of the pull is inversely proportional to the number of votes a movie has. So movies with few votes are pulled more aggressively toward the mean than movies with lots of votes. Here are two data points to demonstrate the property:
Rank Movie Votes Avg Rating Weighted Rating
---- ----- ----- ---------- ---------------
219 La Strada 15,000+ 8.2 8.0
221 Pirates of the 210,000+ 8.0 8.0
Caribbean 2
Both movies' ratings are pulled down, but the pull on La Strada is more dramatic since it has fewer votes and therefore is not as representative as ratings for PotC.
For your specific case, you have two items in a "fight". You should probably design your table as follows:
Items
-----
ItemID (pk)
FightsWon (int)
FightsEngaged (int)
The average rating is FightsWon / FightsEngaged. The weighted rating is calculated using the formula above.
When a user chooses a winner in a fight, increase the winning item's FightsWon field by 1, increase both items FightsEngaged field by 1.
Hope this helps!
- Juliet
I've been toying with the problem of ranking items by means of pair-wise comparison for some time myself, and wanted to take the time to describe the ideas I came up with so far.
For now I'm simply sorting by <fights won> / <total fights>, highest first. This works fine if you're the only one voting, or if there are a lot of people voting. Otherwise it can quickly become inaccurate.
One problem here is how to choose which two items should fight. One thing that does seem to work well (subjectively) is to let the item that has the least fights so far, fight against a random item. This leads to a relatively uniform number of fights for the items (-> accuracy), at the cost of possibly being boring for the voter(s). They will often be comparing the newest item against something else, which is kinda boring. To alleviate that, you can choose the n items with the lowest fight-count and chose one of those randomly as the first contender.
You mentioned that you want to make victories against strong opponents count more than against weak ones. As mentioned in other posts above, rating systems used for chess and the like (Elo, Glicko) may work. Personally I would love to use Microsoft's TrueSkill, as it seems to be the most accurate and also provides a good way to pick two items to pit against each other -- the ones with the highest draw-probability as calculated by TrueSkill. But alas, my math understanding is not good enough to really understand and implement the details of the system, and it may be subject to licensing fees anyway...
Collective Choice: Competitive Ranking Systems has a nice overview of a few different rating systems if you need more information/inspiration.
Other than rating systems, you could also try various simple ladder systems. One example:
Randomize the list of items, so they are ranked 1 to n
Pick two items at random and let them fight
If the winner is ranked above the loser: Do nothing
If the loser is ranked above the winner:
If the loser is directly above the winner: Swap them
Else: Move the winner up the ladder x% toward the loser of the fight.
Goto 2
This is relatively unstable in the beginning, but should improve over time. It never ceases to fluctuate though.
Hope I could help at least a little.
As for flickchart, I've been playing around with it a little bit, and I think the rating system is pretty unsophisticated. In pseudo-code, my guess is that it looks something like this:
if rank(loser) == null and rank(winner) == null
insert loser at position estimated from global rank
insert winner at position estimated from global rank
else if rank(winner) == null or rank(winner) < rank(loser)
then advance winner to loser's position and demote loser and all following by 1
Why do I think this? First, I'm completely convinced that their Bayesian priors are not based on a careful mining of my previous choices. They seem to have no way to guess that because I like Return of the Jedi that I like The Empire Strikes Back. In fact, they can't figure out that because I've seen Home Alone 2 that I may have seen Home Alone 1. After hundreds of ratings, the choice hasn't come up.
Second of all, if you look at the above code you might find a little bug, which you will definitely notice on the site. You may notice that sometimes you will make a choice and the winner will slide by one. This seems to only happen when the loser wasn't previously added. My guess is that what is happening is that the loser is being added higher than the winner.
Other than that, you will notice that rankings do not change at all unless a lower ranked movie beats a higher ranked movie directly. I don't think any real scores are being kept: the site seems to be entirely memoryless except for the ordinal rank of each movie and your most recent rating.
Or you might want to use a variant of PageRank see prof. Wilf's cool description.
After having thought things through, the best solution for this film ranking is as follows.
Required data:
The number of votes taken on each pairing of films.
And also a sorted version of this data grouped like in radix sort
How many times each film was voted for in each pairing of films
Optional data:
How many times each film has been involved in a vote for each user
How to select a vote for a user:
Pick out a vote selection from the sorted list in the lowest used radix group (randomly)
Optional: use the user's personal voting stats to filter out films they've been asked to vote on too many times, possibly moving onto higher radix buckets if there's nothing suitable.
How to calculate the ranking score for a film:
Start the score at 0
Go through each other film in the system
Add voteswon / votestaken versus this film to the score
If no votes have been taken between these two films, add 0.5 instead (This is of course assuming you want new films to start out as average in the rankings)
Note: The optional stuff is just there to stop the user getting bored, but may be useful for other statistics also, especially if you include how many times they voted for that film over another.
Making sure that newly added films have statistics colleted on them ASAP and very evenly distributed votes across all existing films is vital to keeping stats correct for the rest of the films. It may be worth staggering the entry of a bunch of new films to the system to avoid temporary glitches in the rankings (though not immediate nor severe).
===THIS IS THE ORIGINAL ANSWER===
The problem is actually very easy. I am assuming here that you want to order by preference to vote for the film i.e. the #1 ranked film is the film that is most likely to be chosen in the vote. If you make it so that in each vote, you choose two films completely at random you can calculate this with simple maths.
Firstly each selection of two films to vote on is equally likely, so results from each vote can just be added together for a score (saves multiplying by 1/nC2 on everything). And obviously the probability of someone voting for one specific film against another specific film is just votesforthisfilm / numberofvotes.
So to calculate the score for one film, you just sum votesforthisfilm / numberofvotes for every film it can be matched against.
There is a little trouble here if you add a new film which hasn't had a considerable number of votes against all the other films, so you probably want to leave it out of the rankings until a number of votes has built up.
===WHAT FOLLOWS IS MOSTLY WRONG AND IS MAINLY HERE FOR HISTORICAL CONTEXT===
This scoring method is derived from a Markov chain of your voting system, assuming that all possible vote questions were equally likely. [This first sentence is wrong because making all vote questions have to be equally likely in the Markov chain to get meaningful results] Of course, this is not the case, and actually you can fix this as well, since you know how likely each vote question was, it's just the number of votes that have been done on that question! [The probability of getting a particular vote question is actually irrelevant so this doesn't help] In this way, using the same graph but with the edges weighted by votes done...
Probability of getting each film given that it was included in the vote is the same as probability of getting each film and it being in the vote divided by the probability it was included in the vote. This comes to sumoverallvotes((votesforthisfilm / numberofvotes) * numberofvotes) / totalnumberofvotes divided by sumoverallvotes(numberofvotes) / totalnumberofvotes. With much cancelling this comes to votesforthisfilmoverallvotes / numberofvotesinvolvingthisfilm. Which is really simple!
http://en.wikipedia.org/wiki/Maximize_Affirmed_Majorities?
(Or the BestThing voting algorithm, originally called the VeryBlindDate voting algorithm)
I believe this kind of 1 vs. 1 scenario might be a type of conjoint analysis called Discrete Choice. I see these fairly often in web surveys for market research. The customer is generally asked to choose between two+ different sets of features that they would prefer the most. Unfortunately it is fairly complicated (for a non-statistics guy like myself) so you may have difficulty understanding it.
I heartily recommend the book Programming Collective Intelligence for all sorts of interesting algorithms and data analysis along these lines.
I thought I would ask the SO community on helping me with a project that I am currently working on. I need to model the price for a widget in a market situation. The price for the widget should be a result from the current supply and demand. Users will be able to buy and sell the widget at the fixed price. As users buy the widget the demand will go up along with the price. Conversely as users sell the widget the supply will go up and the price will go down. The quantity and current price of the widget will be stored in a database along with the total number of buys and sells for the widget.
Protrade.com has an excellent example of buying and trading widgets (players and teams), I would want to model my system in a similar fashion.
Is there any good programming libraries that will accurately model a market based on supply and demand?
Unfortunately I do not know of any libraries, but perhaps you can tap into Excel's statistics functions.
My opinion follows.
This is why economics is so boring, everything is supply/demand.
Something along the lines of the following should work as a start:
ListPrice = (Cost + Profit) * (demand/supply * economic-factor)
where economic-factor is some determined constant.
If you have some historical data, eg daily supply/demand ratio's you could factor it in, perhaps using some time-based scale.