Interview Question: What is a hashmap? [closed] - hashmap

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I was asked this in an interview: "Tell me everything you know about hashmaps."
I proceeded to do just that: it's a data structure with key-value pairs; a hash function is used to locate the element; how hash collisions can be resolved, etc.
After I was done, they asked: "OK, now explain everything you just said to a 5-year-old. You can't use technical terms, especially hashing and mapping."
I have to say it took me by surprise and I didn't give a good answer. How would you answer?

Rules. Kids know rules. Kids know that certain items belong in certain places. A HashMap is like a set of rules that say, given an item (your shoes, your favorite book, or your clothes) that there is a specific place that they should go (the shoe rack, the bookshelf, or the closet).
So if you want to know where to look for your shoes, you know to look in the shoe rack.
But wait: what happens if the shoe rack is already full? There are a few options.
1) For each item, there's a list of places you can try. Try putting them next to the door. But wait, there's something there already: where else can we put them? Try the closet. If we need to find our shoes, we follow the same list until we find them. (probing sequences)
2) Buy a bigger house, with a bigger shoe rack. (dynamic resizing)
3) Stack the shoes on top of the rack, ignoring the fact that it makes it a real pain to find the right pair, because we have to go through all of the shoes in the pile to find them. (chaining).

Lets take the big word book, or dictionary, and try to find the word zebra. We can easily guess that zebras will be near the end of the book, just like the letter "Z" is at the end of the alphabet. Now lets say that we can always find where the zebra is inside of the big word book. This is the way that we can quickly find zebras, or elephants, or any other type of thing we can think of in the big word book. Sometimes two words will be on the same page like apple and ant. We are sure which page we want to look at, but we aren't sure how close apple and ant are to each other until we get to the page. Sometimes apple and ant can be on the same page and sometimes they might not be, some big word books have bigger words.
That's how I would have done it.

Speaking as a parent, if I had to explain a hashmap to a 5-year-old, I'd say exactly what you said while waving around a chocolate cupcake.
Seriously, questions like this ought to mean "can you explain the concept in plain english", a good heuristic for how well you've internalized your understanding of it. Since it sounds like you get that, the question seems a bit silly.

The pieces of data the map holds can be looked up by some related information, much like how pages can be looked up by the words on them in the index of a book.
The key advantage to using a HashMap is that like an index in a book, it's much quicker to look up the page a word is on in the index than it would be to start searching page by page for that word.
(I'm giving you a serious answer because the interviewer might have been trying to see how well you can explain technical concepts to non-techies like project managers and customers. Maybe a hashmap directly isn't so useful, but it's probably as fair an indication as any of translation skills.)

You have a book of blank but numbered pages and a special decoder ring that generates a page number when a something is entered into it into it.
To assign a value:
You get a ID (key) and a message (value).
You put the ID into the special decoder ring and it spits out the page number.
Open your book to that page. If the ID is on the page, cross out the ID/message.
Now write the ID and message on the page. If there is already a one or more other IDs with messages just write the new one below it.
To retrieve a value:
You are given just an ID (key).
You put the ID into the special decoder ring and it spits out the page number.
Open your book to that page. If the ID is on the page, read the message (value) that follows it.

Related

Framework for minimizing time complexity of generalized search

I have training in pure math but not in statistics, computer science, and information theory so I am a bit lost here and would really appreciate any guidance.
I am looking for some helpful ways to frame a general search approach which would minimize the time complexity of the search.
For example, let's say I was playing a modified version of 20-questions with a friend. The friend has thought of a human, presently alive in the US, and I can ask upto 20 questions to uncover the truth. I want to ask as few questions as possible on average to win the game. We will play this game repeatedly and I want to develop a strategy that would minimize my average win time (as measured by the number of questions asked).
Sample Space: 329.5 million humans currently alive in the US
Rule: Ask any question. The question can have yes or no answer or even a descriptive answer. So for instance, it is allowed to ask the first name of the person.
Intuitively, it seems to me that immediately (as a first quesiton) asking a question like "Is it Barack Obama?" is a terrible question because it splits the sample space (or search space) into two sets, one with 1 person, namely the former US President, and the second containing rest of the US population.
Asking, what is their sex (or old school gender) may be a better question as it will split the yes and no answers into sets of roughly equal sizes.
Instead of asking a binary question, asking an n-ary question is likely better because it will split the sample space into n sub-spaces of varying sizes and if the sizes are similar then that's fantastic. For instance, the question could be, what is the first letter of their last name? There are 26 possible answers, although we know that people in the US are much more likely to have their last name begin with "J" rather than "X".
Of course, I can conceivably ask a 329.5 million-ary question whereby I'll have the answer in one-shot.
My questions for you guys are as follows:
If we fix "n", so asking only binary or ternary or fixed-n-ary questions, it seems to me that the efficient approach would be to ask questions which would divide the sample space into "n" roughly equal parts, if I am minimizing time complexity. How can I prove this? What is the right approach or mathematical fraemwork to prove this? Assuming that I am only minimizing time complexity or the average number of questions I need to ask to get to the solution.
If we don't fix "n" then what would be a general way to frame this mathematically? Now I have two variables over which I am operating, "n" and "the relative size of subsets the answer to a n-ary question splits the sample space", to minimize the time complexity. How can I frame this problem mathematically?
Is my intuition even correct? Or are there faster ways to approach this?
What I am describing sounds an awful lot like a Classificaiton Decision Tree in Machine Learning. Is minimzing Entorpy the right way to frame my question?
Who would know or think about this type of stuff ? Information theorists? Computer Scientists? Statisticians? Probability Theorists? Machine Learning folks? Someone else?
What's the right forum on the internet to get help on this question? Reddit? Some specific stackexchange? Anything else?
Thx

How do I get the context of a sentence?

There is a questionnaire that we use to evaluate the student knowledge level (we do this manually, as in a test paper). It consists of the following parts:
Multiple choice
Comprehension Questions (I.e: Is a spider an insect?)
Now I have been given a task to make an expert system that will automate this. So basically we have a proper answer for this. But my problem is the "comprehension questions". I need to compare the context of their answer to the context of the correct answer.
I already initially searched for the answer, but it seems like it's really a big task to do. What I have search so far is I can do this through NLP which is really new to me. Also, if I'm not mistaken, it seems like that I have to find a dictionary of all words that is possible for the examiner to answer.
Am I on the right track? If no, please suggest of what should I do (study what?) or give me some links to the materials that I need. Also, should I make my own dictionary? Because the words that I will be using are in the Filipino language.
Update: Comprehension question
The comprehension section of the questionnaire contains one paragraph explaining a certain scenario. The questions are fairly simple. Here is an example:
Bonnie's uncle told her to pick apples from the tree. Picking up a stick, she poked the fruits so they would fall. In the middle of doing this, a strong gust of wind blew. Due to her fear of the fruits falling on top of her head, she stopped what she was doing. After this, though, she noticed that the wind had caused apples to fall from the tree. These fallen apples were what she brought home to her uncle.
The questions are:
What did Bonnie's uncle tell her to do?
What caused Bonnie to stop picking apples from the tree?
Is Bonnie a good fruit picker? Please explain your answer.
The possible answers that the answer key states are:
For number 1:
1.1 Bonnie's uncle told her to pick apples from the tree
1.2 Get apples
For number 2:
2.1 A strong gust of wind blew
2.2 She might get hit in the head by the fruits
For number 3:
3.1 No, because the apples she got were already on the ground
3.2 No, because the wind was what caused the fruits to fall
3.3 Yes, because it is difficult to pick fruits when it's windy.
3.4 Yes, because at least she tried
Now there are answers that were given to me. The job that the system shall be able to do is to compare the context of the student's answer to the context of the right answer in order for the system to successfully be able to grade the student's answer.
One simplistic way of doing this that I can think of (off the top of my head) is to use a string similarity metric like cosine or jaccard to identify whether certain keywords appear in a test answer and the known correct answer.
Extracting these keywords automatically could be done with part of speech tagging using NLP. For example, you could extract all nouns (and possibly verbs). Then, representing each answer as a vector of keywords, you could compare the test vector with the known correct vector.
For example, in the second question, the vector for the two possible answers could be
gust, wind, blew
hit, head, fruits
An answer like "she picked up a stick" with the keywords: picked, stick would have a very low score as compared to something like "afraid of fruit falling on her head" with keywords: fruit, falling, head.
Notes:
This can detect only wildly wrong answers. Wrong answers containing the right keywords would not be detected by this technique. :)
I'm not sure about non-english sentences. If that is the case, you might want to take every word in the answer as a keyword (removing stopwords). This question might help as well.

nlp: alternate spelling identification

Help by editing my question title and tags is greatly appreciated!
Sometimes one participant in my corpus of "conversations" will refer to another participant using a nickname, usually an abbreviation or misspelling, but hereafter I'll just say "nicknames". Let's say I'm willing to manually tell my software whether or not I think various possible nicknames are in fact nicknames, but I want software to come up with a list of possible matches between the handle's that identify people, and the potential nicknames. How would I go about doing that?
Background on me and then my corpus: I have no experience doing natural language processing but I'm a competent data analyst with R. My data is produced by 70 teams, each forecasting the likelihood of 100 distinct events occurring some time in the future. The result that I have 70 x 100 = 7000 text files, containing the stream of forecasts participants make and the comments they include with their forecasts. I'll paste a very short snip of one of these text files below, this one had to do with whether the Malian government would enter talks with the MNLA:
02/12/2013 20:10: past_returns answered Yes: (50%)
I hadn't done a lot of research when I put in my previous
placeholder... I'm bumping up a lot due to DougL's forecast
02/12/2013 19:31: DougL answered Yes: (60%)
Weak President Traore wants talks if MNLA drops territorial claims.
Mali's military may not want talks. France wants talks. MNLA sugggests
it just needs autonomy. But in 7 weeks?
02/12/2013 10:59: past_returns answered No: (75%)
placeholder forecast...
http://www.irinnews.org/Report/97456/What-s-the-way-forward-for-Mali
My initial thoughts: Obviously I can start by providing the names I'm looking to match things up with... in the above example they would be past_returns and DougL (though there is no use of nicknames in the above). I wouldn't think it'd be that hard to get a computer to guess at minor misspellings (though I wouldn't personally know where to start). I can imagine that other tricks could be used, like assuming that a string is more likely to be a nickname if it is used much much more by one team, than by other teams. A nickname is more likely to refer to someone who spoke recently than someone who spoke long ago, or not at all on regarding this question. And they should be used in sentences in a manner similar to the way the full name/screenname is typically used in the corpus. But I'm interested to hear about simple approaches, as well as ones that try to consider more sophisticated techniques.
This could get about as complicated as you want to make it. From the semi-linguistic side of things, research topics would include Levenshtein Distance (for detecting minor misspellings of known names/nicknames) and Named Entity Recognition (for the task of detecting names/nicknames in the first place). Actually, NER's worth reading about, but existing systems might not help you much in your domain of forum handles and nicknames.
The first rough idea that comes to mind is that you could run a tokenized version of your corpus against an English dictionary (perhaps a dataset compiled from Wiktionary or something like WordNet) to find words that are candidates for names, then filter those through some heuristics (do they start with the same letters as known full names? Do they have a low Levenshtein distance from known names? Are they used more than once?).
You could also try some clustering or supervised ML algorithms against the non-word tokens. That might reveal some non-"word" tokens that often occur in the same threads as a given username; again, heuristics could help rule out some false positives.
Good luck; sounds like a fun problem - hope I mentioned at least one thing you hadn't already thought of.

How to categorize and tabularize free-form answers to a question in a survey?

I want to analyze answers to a web survey (Git User's Survey 2008 if one is interested). Some of the questions were free-form questions, like "How did you hear about Git?". With more than 3,000 replies analyzing those replies entirely by hand is out of the question (especially that there is quite a bit of free-form questions in this survey).
How can I group those replies (probably based on the key words used in response) into categories at least semi-automatically (i.e. program can ask for confirmation), and later how to tabularize (count number of entries in each category) those free-form replies (answers)? One answer can belong to more than one category, although for simplicity one can assume that categories are orthogonal / exclusive.
What I'd like to know is at least keyword to search for, or an algorithm (a method) to use. I would prefer solutions in Perl (or C).
Possible solution No 1. (partial): Bayesian categorization
(added 2009-05-21)
One solution I thought about would be to use something like algorithm (and mathematical method behind it) for Bayesian spam filtering, only instead of one or two categories ("spam" and "ham") there would be more; and categories itself would be created adaptively / interactively.
Text::Ngrams + Algorithm::Cluster
Generate some vector representation for each answer (e.g. word count) using Text::Ngrams.
Cluster the vectors using Algorithm::Cluster to determine the groupings and also the keywords which correspond to the groups.
You are not going to like this. But: If you do a survey and you include lots of free-form questions, you better be prepared to categorize them manually. If that is out of the question, why did you have those questions in the first place?
I've brute forced stuff like this in the past with quite large corpuses. Lingua::EN::Tagger, Lingua::Stem::En. Also the Net::Calais API is (unfortunately, as Thomposon Reuters are not exactly open source friendly) pretty useful for extracting named entities from text. Of course once you've cleaned up the raw data with this stuff, the actual data munging is up to you. I'd be inclined to suspect that frequency counts and a bit of mechanical turk cross-validation of the output would be sufficient for your needs.
Look for common words as keywords, but through away meaningless ones like "the", "a", etc. After that you get into natural language stuff that is beyond me.
It just dawned on me that the perfect solution for this is AAI (Artificial Artificial Intelligence). Use Amazon's Mechanical Turk. The Perl bindings are Net::Amazon::MechanicalTurk. At one penny per reply with a decent overlap (say three humans per reply) that would come to about $90 USD.

Agile - User Story Definitions [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I'm writing a small app for my friend's business, and thought I'd take the opportunity to brush up on some Agile Project Management training I did at the start of the year.
I (and I think, my current organisation!) have always struggled with gathering requirements in the form of User Stories, which take the form:
As a [User Type] I want [feature] so that [some benefit]
I'm always tempted to miss out the beginning and end, and just leave the feature - but this then just becomes requirements gathering the old way!
But I don't want to just make it fit, so that I can say 'I'm doing Agile'.... for example, if I know that the user is to be presented with a list of items, then the reason is self-evident, is it not?
e.g.
As a [Store Manager] I want [to see a list of Stock Items] so that ... ?
Is it normal practice to leave out the [so that] clause?
We used to miss it out as well. And by leaving it out we missed a lot.
To understand the feature properly and not just do the thing right but DO THE RIGHT THING it is key to know WHY the feature, and for that the next key is WHO (the role)
In DDD terms, stakeholder. Stakeholders can be different, everyone who cares. From programmers and db admins to all the types of users.
So, first understand, who is the stakeholder, then you know 50% of WHY he cares, then the benefit, and then it is already almost obviously WHAT to implement.
Try to not just write "as a user". Specify. "as store manager", or even "as the lead of the shift responsible for closing the day", i need....so that....
Maybe you can implement something different which will give the same stakeholder even better benefit!!!
Try, To Achieve [Business Value] As [User] I need [Feature].
The goal is to focus on the value the feature delivers. It helps you think in vertical slices, which reduces pure "technical tasks" that aren't visible. It's not an easy transition, but when you start thinking vertically you start really being able to reduce the waste in your process.
Another way is to thinking of the acceptance tests that your customer could write to ensure the feature would work. It's a short jump to then using something like FitNesse to automated those tests.
No, it's actually not obvious - there are a lot of reasons to want to see a list, a lot of things you might want to with it - scan it for some info, get an overview, print it, copy and paste it into a word document etc. And what exactly it is will give you valuable hints on reasonable implementation details - formatting of the list, exact content; or even a hint that a different feature might be a better idea to satisfy that need. Don't be surprised to find out that the reason actually is "so that I can count the number of entries"...
Of course, this might in fact not apply to you. My actual point in fact is that there are reasons that people came up with this template - and there are also reasons that a lot of experienced people don't actually use it. And when you are new to the practice, you are not in a good position to assess all the pros and cons of following a practice, so I'd highly recommend to simply try to follow it closely for some time. You might be surprised by the usefulness of it - or not, in which case you still learned something and can drop it with a clear concise... :)
User Stories is another way of saying you need to interview your users to find out what they want and what problems they are trying to solve. That the heart of having this in agile development. If the form is not working for your then take a step back and try a different approach that feels more natural to you or better suited to your capabilities as a writer.
In short don't feel like you have to be in a straight jacket. The important thing is that you follow the spirit of the methodology.
In this specific case you want to get a list of what problems the user has, why they are problems, and what they think will help them.
I think you should really try to get a reason defined, even if it may seem obvious. If you can't come up with a reason then why build the feature in the first place? Also the reason may point out other deficiencies in the design that could trigger improvements in other areas.
I often categorize my stories by the user/persona that it primarily relates to, thus I don't put the user's identity in the story title. My stories also are bigger than some agile methodologies suggest. Usually, I start with a title. I use it for planning purposes. Once I get close to actually working on that story, I flesh it out with some details -- basic idea, constraints, assumptions, related stories -- so that I capture more of the information that I know about it. I also keep my stories in a wiki, not on note cards. I understand the trade-off -- i.e., I may spend too much time on details before I need them, but I am able to capture and share it with, typically, off-site customers easily.
The bottom line for me is that Agile is a philosophy, rather than a specification. There are particular implementations that may (strongly) suggest that you do things a certain way and may be non-negotiable on some items. For example, it's hard to say you're doing XP if you don't pair program. In general, though, I would say that most agilists would say that you ought to do those things that work for you, in the way that they work for you -- as long as they are consistent with the general principles, you can still call yourself agile. The general principles would include things like release early/release often, unit testing, short iterations, acknowledge that change will happen, delay detailed planning until you are ready to implement, ...
Bottom line for me: if the stories work for you without the user and rationale -- as long as you understand who the user is and why they want something -- do it however you want. Just don't require a complete specification before you start implementing.

Resources