In the secretary problem, the optimal strategy is to reject the first n/e and start selecting the best that is better than the best one in the n/e candidates. For example, if the number of candidates is 100, we reject the first 37 candidates then start selecting the best one. This is the simplest form of the theory, what about if the number of the candidate is not not known what will be the best strategy, or in other words, when should we start selecting/accepting candidates.
In case of unknown number of candidates the stopping rule is no longer valid and depends on prior probability distribution or use 1/e law of best choice. Please see Unknown Number of Participants & Best Choice.
Related
First time asking question, apologies if incorrect.
What would be the best way to approach this problem (Similar to travelling salesman, but I'm not sure if it runs into the same issues).
You have a list of "tasks" at certain locations (Cities) and a group of "people" that can complete those tasks (Salesmen). This is structured over a day, where some tasks may need to be completed before a specific time and may require specific "tools" (Set number available). The difference is that the length between each location is the same in all circumstances, but they all have to return to the start. Therefore, rather than trying to minimise the distance travelled, instead you want to maximise the time each salesmen spends moving and stays at the initial staring node. This also gives you pre-defined requirements.
The program doesn't need to find an optimal solution, just an acceptable one (Greater than a certain value.) Would you just bash out each case? If so, what would be the best language to use for bashing out the solutions?
Thanks
EDIT - Just to confirm, the pre-requisite where all the cities are the same distance from each other is just for simplification of the problem, not reflective of real life.
I have a object with many fields. Each field has different range of values. I want to use hypothesis to generate different instances of this object.
Is there a limit to the number of combination of field values Hypothesis can handle? Or what does the search tree hypothesis creates look like? I don't need all the combinations but I want to make sure that I get a fair number of combinations where I test many different values for each field. I want to make sure Hypothesis is not doing a DFS until it hits the max number of examples to generate
TLDR: don't worry, this is a common use-case and even a naive strategy works very well.
The actual search process used by Hypothesis is complicated (as in, "lead author's PhD topic"), but it's definitely not a depth-first search! Briefly, it's a uniform distribution layered on a psudeo-random number generator, with a coverage-guided fuzzer biasing that towards less-explored code paths, with strategy-specific heuristics on top of that.
In general, I trust this process to pick good examples far more than I trust my own judgement, or that of anyone without years of experience in QA or testing research!
So, a bit of explanation to preface the question. In variants of Open Face Chinese poker, you are dealt one and one card, which are to be placed into three different rows, and the goal is to make each row increasingly better, and of course getting the best hands possible. A difference from normal poker is that the top row only contains three cards, so three of a kind is the best possible hand you can get there. In a variant of this called Pineapple, which is what I'm working on a bot for, you are dealt three and three cards after the initial 5, and you discard one of those three cards each round.
Now, there's a special rule called Fantasyland, which means that if you get a pair of queens or better in the top row, and still manage to get successively better hands in the middle and top row, your next round becomes a Fantasyland round. This is a round where are dealt 15 cards at the same time, and are free to construct the best three rows possible (rows of 3, 5, and 5 cards, and discarding 2 of them). Each row yields a certain number of points (royalties, as they're called) depending on which hand is constructed, and each successive row needs better and better hands to yield the same amount of points.
Trying to optimize solutions for this seemed like a natural starting point, and one of the most interesting parts as well, so I started working on it. My first attempt, which is also where I'm stuck, was to use Simulated Annealing to do local search optimization. The energy/evaluation function is the amount of points, and at first I tried a move/neighbor function of simply swapping two cards at random, having first places them as they were drawn. This worked decently, managing to get a mean of around 6 points per hand, which isn't bad, but I often noticed that I could spot better solutions by swapping more than one pair of cards at the same time. Thus, I changed the move/neighbor function to swapping several pairs of cards at once, and also tried swapping a random amount of pairs between 1 and 3 through 5, which managed to yield slightly better results, but still I often spot better solutions by simply taking a look.
If anyone is reading this and understands the problem, any idea on how to better optimize this search? Should I use a different move/neighbor function, different Annealing parameters, or perhaps a different local search method, or even some kind of non-local search? All ideas are welcome and deeply appreciated.
You haven't indicated a performance requirement, so I'll assume that this should work quickly enough to be usable in a game with human players. It can't take an hour to find the solution, but you don't need it in a millisecond, either.
I'm wondering of simulated annealing is the right method. This might be an opportunity for brute force.
One can make a very fast algorithm for evaluating poker hands. Consider an encoding of the cards where 13 bits encode the card value and 4 bits encode the suit. OR together the cards in the hand and you can quickly identify pairs, triples, straights, and flushes.
At first glance, there would seem to be 15! (13,076,743,680,000) possible positions for all the cards which are dealt, but there are other symmetries and restrictions that reduce the meaningful combinations and limit the space that must be explored.
One important constraint is that the bottom row must have a higher score than the middle row and that the middle row must have a higher score than the top row.
There are 3003 sets of bottom cards, COMBINATIONS(15 cards, 5 at a time) = (15!)/(5!(15-5)!) = 3003. For each set of possible bottom cards, there are COMBINATIONS(10 cards, 5 at a time) = (10!)/(5!(10-5!)) = 252 sets of middle cards. The top row has COMBINATIONS(5 cards, 3 at a time) = (5!)/(3!*(5-3)!) = 10. With no further optimization, a brute force approach would require evaluating 3003*252*10 = 7567560 positions. I suspect that this can be evaluated within an acceptable response time.
A further optimization uses the constraint that each row must be worth less than the row below. If the middle row is worth more than the bottom row, the top row can be ignored by pruning the tree at that point, which removes a factor of 10 for those cases.
Also, since the bottom row must be work more than the middle and top rows, there may be some minimum score the bottom row must achieve before it is worth trying middle rows. Rejecting a bottom row prunes 2520 cases from the tree.
I understand that there is a way to use simulated annealing for estimating solutions for discrete problems. My use of simulated annealing has been limited to continuous problems with edge constraints. I don't have a good intuition for how to apply SA to discrete problems. Many discrete problems lend themselves to an exhaustive search, provided the search space can be trimmed by exploiting symmetries and constraints in the particular problem.
I'd love to know the solution you choose and your results.
I am currently working on a search ranking algorithm which will be applied to elastic search queries (domain: e-commerce). It assigns scores on several entities returned and finally sorts them based on the score assigned.
My question is: Has anyone ever tried to introduce a certain level of randomness to any search algorithm and has experienced a positive effect of it. I am thinking that it might be useful to reduce bias and promote the lower ranking items to give them a chance to be seen easier and get popular if they deserve it. I know that some machine learning algorithms are introducing some randomization to reduce the bias so I thought it might be applied to search as well.
Closest I can get here is this but not exactly what I am hoping to get answers for:
Randomness in Artificial Intelligence & Machine Learning
I don't see this mentioned in your post... Elasticsearch offers a random scoring feature: https://www.elastic.co/guide/en/elasticsearch/guide/master/random-scoring.html
As the owner of the website, you want to give your advertisers as much exposure as possible. With the current query, results with the same _score would be returned in the same order every time. It would be good to introduce some randomness here, to ensure that all documents in a single score level get a similar amount of exposure.
We want every user to see a different random order, but we want the same user to see the same order when clicking on page 2, 3, and so forth. This is what is meant by consistently random.
The random_score function, which outputs a number between 0 and 1, will produce consistently random results when it is provided with the same seed value, such as a user’s session ID
Your intuition is right - randomization can help surface results that get a lower than deserved score due to uncertainty in the estimation. Empirically, Google search ads seemed to have sometimes been randomized, and e.g. this paper is hinting at it (see Section 6).
This problem describes an instance of a class of problems called Explore/Exploit algorithms, or Multi-Armed Bandit problems; see e.g. http://en.wikipedia.org/wiki/Multi-armed_bandit. There is a large body of mathematical theory and algorithmic approaches. A general idea is to not always order by expected, "best" utility, but by an optimistic estimate that takes the degree of uncertainty into account. A readable, motivating blog post can be found here.
I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!
The Levenshtein distance should do.
Look around - there are implementations in many languages.
Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.
Soundex or Metaphone might work.
I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.
Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.
check this one http://docs.python.org/library/difflib.html
it should help you