Framework for minimizing time complexity of generalized search - search

I have training in pure math but not in statistics, computer science, and information theory so I am a bit lost here and would really appreciate any guidance.
I am looking for some helpful ways to frame a general search approach which would minimize the time complexity of the search.
For example, let's say I was playing a modified version of 20-questions with a friend. The friend has thought of a human, presently alive in the US, and I can ask upto 20 questions to uncover the truth. I want to ask as few questions as possible on average to win the game. We will play this game repeatedly and I want to develop a strategy that would minimize my average win time (as measured by the number of questions asked).
Sample Space: 329.5 million humans currently alive in the US
Rule: Ask any question. The question can have yes or no answer or even a descriptive answer. So for instance, it is allowed to ask the first name of the person.
Intuitively, it seems to me that immediately (as a first quesiton) asking a question like "Is it Barack Obama?" is a terrible question because it splits the sample space (or search space) into two sets, one with 1 person, namely the former US President, and the second containing rest of the US population.
Asking, what is their sex (or old school gender) may be a better question as it will split the yes and no answers into sets of roughly equal sizes.
Instead of asking a binary question, asking an n-ary question is likely better because it will split the sample space into n sub-spaces of varying sizes and if the sizes are similar then that's fantastic. For instance, the question could be, what is the first letter of their last name? There are 26 possible answers, although we know that people in the US are much more likely to have their last name begin with "J" rather than "X".
Of course, I can conceivably ask a 329.5 million-ary question whereby I'll have the answer in one-shot.
My questions for you guys are as follows:
If we fix "n", so asking only binary or ternary or fixed-n-ary questions, it seems to me that the efficient approach would be to ask questions which would divide the sample space into "n" roughly equal parts, if I am minimizing time complexity. How can I prove this? What is the right approach or mathematical fraemwork to prove this? Assuming that I am only minimizing time complexity or the average number of questions I need to ask to get to the solution.
If we don't fix "n" then what would be a general way to frame this mathematically? Now I have two variables over which I am operating, "n" and "the relative size of subsets the answer to a n-ary question splits the sample space", to minimize the time complexity. How can I frame this problem mathematically?
Is my intuition even correct? Or are there faster ways to approach this?
What I am describing sounds an awful lot like a Classificaiton Decision Tree in Machine Learning. Is minimzing Entorpy the right way to frame my question?
Who would know or think about this type of stuff ? Information theorists? Computer Scientists? Statisticians? Probability Theorists? Machine Learning folks? Someone else?
What's the right forum on the internet to get help on this question? Reddit? Some specific stackexchange? Anything else?
Thx

Related

Why we cant use arithmetic mean when in planning poker? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Why so? I really can't understand that. Why we can only select from numbers proposed by players?
The numbers used are spaced far apart on purpose (typically from the Fibonacci sequence). If you get numbers from all across the board from 1 to 23, you're supposed to ask why the person who voted 1 gave it such a low score ("Did you think about testing and deployment? What about these other acceptance criteria?") and why the person who voted 23 gave it such a high score ("Are you aware we already have an existing code base? Did you know that Karen knows a lot about this and can pair up with you?") and then re-vote. If you're really stuck because half the team says 8 and the other half says 13, you can take the 13 and move on with your lives.
The extra precision isn't necessary when your accuracy is not great. My team goes for even less precision and buckets stories into "small" (one person can do a bunch in an iteration), "medium" (one person can handle a few of these), "large" (one person a week or more), and "extra large" (too big and needs to be split).
You can do what you want to do. However, the thought about choosing the exact numbers that are proposed is that with growing numbers, you cannot estimate small details reliably. That's why with growing numbers, the gaps between numbers become larger.
Once you start giving detailed numbers (like one estimating 8 and the next 13, chosing 11 as a mean) people assume this actually is a detailed estimation. It's not. The method is meant to be a rough guess.
Behind the idea that people should agree on one number is that everybody should have the same understand of the story.
If people pick very different numbers they have a different understanding how much work needed to complete the story or how difficult it will be. The different numbers should start discussions then and finally lead to a shared view of the story.
You should think to numbers as symbols with none arithmetic meaning, except for a (partially) ordered relation, because they are estimates (of effort need to do done a user story).
If you use math to model an estimate you should provide a way:
to represent certainty
to represent uncertainty
to operate with that representations
to define an average as a function of certainty and uncertainty
If you use some kind of average which operates on estimates modeled as single numbers you are supposing that certainty and uncertainty can be handle in the same way, and I guess it's a bad assumption.
I think that the spirit of planning poker session is achiving a team-shared estimates by a discussion among human being and not using arithmetic on human being estimates.

nlp: alternate spelling identification

Help by editing my question title and tags is greatly appreciated!
Sometimes one participant in my corpus of "conversations" will refer to another participant using a nickname, usually an abbreviation or misspelling, but hereafter I'll just say "nicknames". Let's say I'm willing to manually tell my software whether or not I think various possible nicknames are in fact nicknames, but I want software to come up with a list of possible matches between the handle's that identify people, and the potential nicknames. How would I go about doing that?
Background on me and then my corpus: I have no experience doing natural language processing but I'm a competent data analyst with R. My data is produced by 70 teams, each forecasting the likelihood of 100 distinct events occurring some time in the future. The result that I have 70 x 100 = 7000 text files, containing the stream of forecasts participants make and the comments they include with their forecasts. I'll paste a very short snip of one of these text files below, this one had to do with whether the Malian government would enter talks with the MNLA:
02/12/2013 20:10: past_returns answered Yes: (50%)
I hadn't done a lot of research when I put in my previous
placeholder... I'm bumping up a lot due to DougL's forecast
02/12/2013 19:31: DougL answered Yes: (60%)
Weak President Traore wants talks if MNLA drops territorial claims.
Mali's military may not want talks. France wants talks. MNLA sugggests
it just needs autonomy. But in 7 weeks?
02/12/2013 10:59: past_returns answered No: (75%)
placeholder forecast...
http://www.irinnews.org/Report/97456/What-s-the-way-forward-for-Mali
My initial thoughts: Obviously I can start by providing the names I'm looking to match things up with... in the above example they would be past_returns and DougL (though there is no use of nicknames in the above). I wouldn't think it'd be that hard to get a computer to guess at minor misspellings (though I wouldn't personally know where to start). I can imagine that other tricks could be used, like assuming that a string is more likely to be a nickname if it is used much much more by one team, than by other teams. A nickname is more likely to refer to someone who spoke recently than someone who spoke long ago, or not at all on regarding this question. And they should be used in sentences in a manner similar to the way the full name/screenname is typically used in the corpus. But I'm interested to hear about simple approaches, as well as ones that try to consider more sophisticated techniques.
This could get about as complicated as you want to make it. From the semi-linguistic side of things, research topics would include Levenshtein Distance (for detecting minor misspellings of known names/nicknames) and Named Entity Recognition (for the task of detecting names/nicknames in the first place). Actually, NER's worth reading about, but existing systems might not help you much in your domain of forum handles and nicknames.
The first rough idea that comes to mind is that you could run a tokenized version of your corpus against an English dictionary (perhaps a dataset compiled from Wiktionary or something like WordNet) to find words that are candidates for names, then filter those through some heuristics (do they start with the same letters as known full names? Do they have a low Levenshtein distance from known names? Are they used more than once?).
You could also try some clustering or supervised ML algorithms against the non-word tokens. That might reveal some non-"word" tokens that often occur in the same threads as a given username; again, heuristics could help rule out some false positives.
Good luck; sounds like a fun problem - hope I mentioned at least one thing you hadn't already thought of.

Best Algorithmic Approach to Sentiment Analysis [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
My requirement is taking in news articles and determining if they are positive or negative about a subject. I am taking the approach outlined below, but I keep reading NLP may be of use here. All that I have read has pointed at NLP detecting opinion from fact, which I don't think would matter much in my case. I'm wondering two things:
1) Why wouldn't my algorithm work and/or how can I improve it? ( I know sarcasm would probably be a pitfall, but again I don't see that occurring much in the type of news we will be getting)
2) How would NLP help, why should I use it?
My algorithmic approach (I have dictionaries of positive, negative, and negation words):
1) Count number of positive and negative words in article
2) If a negation word is found with 2 or 3 words of the positive or negative word, (ie: NOT the best) negate the score.
3) Multiply the scores by weights that have been manually assigned to each word. (1.0 to start)
4) Add up the totals for positive and negative to get the sentiment score.
I don't think there's anything particularly wrong with your algorithm, it's a fairly straightforward and practical way to go, but there are a lot of situations where it will get make mistakes.
Ambiguous sentiment words - "This product works terribly" vs. "This product is terribly good"
Missed negations - "I would never in a millions years say that this product is worth buying"
Quoted/Indirect text - "My dad says this product is terrible, but I disagree"
Comparisons - "This product is about as useful as a hole in the head"
Anything subtle - "This product is ugly, slow and uninspiring, but it's the only thing on the market that does the job"
I'm using product reviews for examples instead of news stories, but you get the idea. In fact, news articles are probably harder because they will often try to show both sides of an argument and tend to use a certain style to convey a point. The final example is quite common in opinion pieces, for example.
As far as NLP helping you with any of this, word sense disambiguation (or even just part-of-speech tagging) may help with (1), syntactic parsing might help with the long range dependencies in (2), some kind of chunking might help with (3). It's all research level work though, there's nothing that I know of that you can directly use. Issues (4) and (5) are a lot harder, I throw up my hands and give up at this point.
I'd stick with the approach you have and look at the output carefully to see if it is doing what you want. Of course that then raises the issue of what you want you understand the definition of "sentiment" to be in the first place...
my favorite example is "just read the book". it contains no explicit sentiment word and it is highly depending on the context. If it apears in a movie review it means that the-movie-sucks-it's-a-waste-of-your-time-but-the-book-is-good. However, if it is in a book review it delivers a positive sentiment.
And what about - "this is the smallest [mobile] phone in the market". back in the '90, it was a great praise. Today it may indicate that it is a way too small.
I think this is the place to start in order to get the complexity of sentiment analysis: http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html (by Lillian Lee of Cornell).
You may find the OpinionFinder system and the papers describing it useful.
It is available at http://www.cs.pitt.edu/mpqa/ with other resources for opinion analysis.
It goes beyond polarity classification at the document level, but try to find individual opinions at the sentence level.
I believe the best answer to all of the questions that you mentioned is reading the book under the title of "Sentiment Analysis and opinion mining" by Professor Bing Liu. This book is the best of its own in the field of sentiment analysis. it is amazing. Just take a look at it and you will find the answer to all your 'why' and 'how' questions!
Machine-learning techniques are probably better.
Whitelaw, Garg, and Argamon have a technique that achieves 92% accuracy, using a technique similar to yours for dealing with negation, and support vector machines for text classification.
Why don't you try something similar to how SpamAsassin spam filter works? There really not much difference between intension mining and opinion mining.

Agile/XP estimating [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
Im the context of pair-programming ...how do you estimate? A 5 point story ...split in to 3 tasks...each task being swarmed by 2 members. Does it mean it mean it is finishable in half the time?
You estimate using points. The number of points achievable by a pair is called velocity. Check this: http://en.wikipedia.org/Planning_poker
I always stress story sizing over estimating. If you can minimize the variation in size between stories, then estimates become much less interesting. That is, estimation activities lose their value when the difference in size between any two stories is small, which helps fast teams recoup the cost of estimation and put it toward something productive (like building product).
With that in mind, I would suggest proactively pruning the backlog and splitting five point stories into smaller ones (still thin vertical slices) whenever possible. Until your team is experienced, I'd suggest you continue to have estimation parties, but prompt a default estimate of 1 point to each card, with a quick consensus check or discussion as to why any justify a bump to a 2 or 3. For anything that's clearly bigger than a 3, I'd suggest challenging that one of two problems is present: the baseline value of "1" is too small or the story should be split (either made more specific or tracked as an epic).
As the team establishes a decent velocity, the activity can hopefully shift its mindset from estimation to mere story vetting. That is, the question during planning of "how big is this story?" becomes "is this story abnormal?" The latter question takes significantly less time to answer.
Most Agile methodologies suggest that you should estimate in points. However, there are many successful teams out there - including several advanced and highly productive Kanban teams - who estimate in hours. Points come with their own games, perverse incentives and problems. YMMV. Anyway...
I've heard figures of 25% more dev-hours for a pair completing a task. So, the task would be finished in 62.5% of the time, using two developers instead of one. However, the quality of and knowledge-sharing often increases too. Since bugs = rework and rework takes longer than doing it the right way the first time, pair-programming usually pays for itself. This differs for different tasks and levels of skill, eg: simple bug fixes, novice programmers, etc.
In my experience 2/3 of the time is a pretty good ballpark figure. It's longer than 1/2 but less than it would be with just 1 person.
Sallyann Freudenberg is a good name to look up regarding pair programming research. You could also check out the refs on the Wikipedia page:
http://en.wikipedia.org/wiki/Pair_programming
The figure is mostly borne out by the data in this report by Alistair Cockburn and Laurie Williams: http://collaboration.csc.ncsu.edu/laurie/Papers/XPSardinia.PDF
Pair estimates how long it is going to take them. Any estimate is based on experience - the longer experience team has with the project, with technology and with working together in pairs the better the estimates will be. Any arbitrary percentages like add 25% for pairs etc. are of any use only at the very beginning - with new project and new team on new technology - where you have nothing else to base estimation on yet. As soon as experience starts building the estimates will improve.
Remember though, that they are just this - estimates, that is our best guess of the future derived with the help of experience and knowledge from our understanding of the present. It's like weather forecast - the more data we have, the more experience forecasters have the better it is, but it is just a prediction, not reality.
Which is why points are so great, because they help you estimate one parameter you can - how "big" the tasks at hand are.
1) A simple, concrete way to think about estimating units is the number of check-ins it would take to complete the story.
If you're doing TDD, continuous integration and refactoring you'll be working in small, chunks of work, keeping the build green and checking in regularly so under those conditions single check-ins can be a meaningful unit of estimation.
2) Alternatively, think about blocks of uninterrupted pairing time in a day e.g. after stand-up to coffee break, after coffee break to lunch, after lunch to mid-afternoon break, mid-afternoon to going home time - say 4 periods a day.... so say 4 units is a single day. That gives you a bound on what you could expect to do in an interation...
Personally I go for the number of check-ins, because I can sketch out roughly the tasks involved and get an idea of check-in numbers.
The great thing about number of check-ins is that it doesn't matter if you're pairing or not - you're just tracking what you can get done.

Agile - User Story Definitions [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I'm writing a small app for my friend's business, and thought I'd take the opportunity to brush up on some Agile Project Management training I did at the start of the year.
I (and I think, my current organisation!) have always struggled with gathering requirements in the form of User Stories, which take the form:
As a [User Type] I want [feature] so that [some benefit]
I'm always tempted to miss out the beginning and end, and just leave the feature - but this then just becomes requirements gathering the old way!
But I don't want to just make it fit, so that I can say 'I'm doing Agile'.... for example, if I know that the user is to be presented with a list of items, then the reason is self-evident, is it not?
e.g.
As a [Store Manager] I want [to see a list of Stock Items] so that ... ?
Is it normal practice to leave out the [so that] clause?
We used to miss it out as well. And by leaving it out we missed a lot.
To understand the feature properly and not just do the thing right but DO THE RIGHT THING it is key to know WHY the feature, and for that the next key is WHO (the role)
In DDD terms, stakeholder. Stakeholders can be different, everyone who cares. From programmers and db admins to all the types of users.
So, first understand, who is the stakeholder, then you know 50% of WHY he cares, then the benefit, and then it is already almost obviously WHAT to implement.
Try to not just write "as a user". Specify. "as store manager", or even "as the lead of the shift responsible for closing the day", i need....so that....
Maybe you can implement something different which will give the same stakeholder even better benefit!!!
Try, To Achieve [Business Value] As [User] I need [Feature].
The goal is to focus on the value the feature delivers. It helps you think in vertical slices, which reduces pure "technical tasks" that aren't visible. It's not an easy transition, but when you start thinking vertically you start really being able to reduce the waste in your process.
Another way is to thinking of the acceptance tests that your customer could write to ensure the feature would work. It's a short jump to then using something like FitNesse to automated those tests.
No, it's actually not obvious - there are a lot of reasons to want to see a list, a lot of things you might want to with it - scan it for some info, get an overview, print it, copy and paste it into a word document etc. And what exactly it is will give you valuable hints on reasonable implementation details - formatting of the list, exact content; or even a hint that a different feature might be a better idea to satisfy that need. Don't be surprised to find out that the reason actually is "so that I can count the number of entries"...
Of course, this might in fact not apply to you. My actual point in fact is that there are reasons that people came up with this template - and there are also reasons that a lot of experienced people don't actually use it. And when you are new to the practice, you are not in a good position to assess all the pros and cons of following a practice, so I'd highly recommend to simply try to follow it closely for some time. You might be surprised by the usefulness of it - or not, in which case you still learned something and can drop it with a clear concise... :)
User Stories is another way of saying you need to interview your users to find out what they want and what problems they are trying to solve. That the heart of having this in agile development. If the form is not working for your then take a step back and try a different approach that feels more natural to you or better suited to your capabilities as a writer.
In short don't feel like you have to be in a straight jacket. The important thing is that you follow the spirit of the methodology.
In this specific case you want to get a list of what problems the user has, why they are problems, and what they think will help them.
I think you should really try to get a reason defined, even if it may seem obvious. If you can't come up with a reason then why build the feature in the first place? Also the reason may point out other deficiencies in the design that could trigger improvements in other areas.
I often categorize my stories by the user/persona that it primarily relates to, thus I don't put the user's identity in the story title. My stories also are bigger than some agile methodologies suggest. Usually, I start with a title. I use it for planning purposes. Once I get close to actually working on that story, I flesh it out with some details -- basic idea, constraints, assumptions, related stories -- so that I capture more of the information that I know about it. I also keep my stories in a wiki, not on note cards. I understand the trade-off -- i.e., I may spend too much time on details before I need them, but I am able to capture and share it with, typically, off-site customers easily.
The bottom line for me is that Agile is a philosophy, rather than a specification. There are particular implementations that may (strongly) suggest that you do things a certain way and may be non-negotiable on some items. For example, it's hard to say you're doing XP if you don't pair program. In general, though, I would say that most agilists would say that you ought to do those things that work for you, in the way that they work for you -- as long as they are consistent with the general principles, you can still call yourself agile. The general principles would include things like release early/release often, unit testing, short iterations, acknowledge that change will happen, delay detailed planning until you are ready to implement, ...
Bottom line for me: if the stories work for you without the user and rationale -- as long as you understand who the user is and why they want something -- do it however you want. Just don't require a complete specification before you start implementing.

Resources