Hash table search and Insert/Remove in O(1) - hashmap

How it can be O(1)? If you have an empty hash table it is obviously constant but when number of elements increases and collisions starts, wouldn't also it increase complexity?
I mean search time will increase for tables with more elements, how is this constant?

The extremely simple toy "static" hash table implementations often used to introduce students to hash tables do have the problems you pointed out.
In practice, many popular hash algorithm implementations are "dynamic" -- rather than allowing the table to fill up and the hash chains to get longer and longer, leading to the problems you pointed out, during an insert() they pro-actively recognize when there are "too many" collisions, and then do something about it.
Often they don't bother trying out to figure out why there are so many collisions -- i.e., they don't know or care if the real problem is (a) or (c) -- they go ahead and do both (b) and (d):
There are two different ways that a hash table can get lots of collisions, and both of them have known solutions:
(a) If there's a lot of data items, and not very many buckets, there will be collisions. A (separate chaining) hash table with 100 slots, after trying to insert 150 items, will inevitably have dozens of collisions. (Because of the birthday effect, you'll probably get dozens of collisions a lot sooner than that). To avoid this problem,
(b) resize the hash table, which guarantees there is plenty of empty space to insert more items, -- i.e., the algorithm maintains a good load factor.
(c) With the particular items it needs to store, the particular hash function currently in use "coincidentally" sends many items to the same bucket, many more than one would expect from the birthday effect. To avoid this problem,
(d) switch to a different randomly-picked hash function from a universal family of hash functions. For any particular set of data -- even if the data has been carefully selected so the current hash function sends most of the data to the same bucket -- most of the other hash functions in a universal family will spread that particular set of data out much more evenly.
In order to prevent the problems you pointed out,
practical in-RAM hash table implementations have code in the insert() function that occasionally triggers some "extra work" resize-by-copying-all-entries which requires O(n+k) time to do (c) and (d), that happens so rarely that that n inserts require a total time of O(n), so the average insert time is O(1).
Many hash table implementations -- implementations of separate chaining, linear probing, etc. -- resize often enough that hash table lookups require O(1) comparisons on average, although they have no guarantees on the worst-case hash-table lookup time.
A few hash table implementations -- implementations of hopscotch hashing, cuckoo hashing, dynamic perfect hashing, etc. -- do even more "extra work" in the form of (b) and repeating (d) however many times is necessary to guarantee that hash table lookups require O(1) comparisons even in the worst case.

Related

Data Structure Question: Is there a link between the size of a list in a chaining implementation of hash maps and its load factor?

For example, if I have n keys and m slots in the hash map, the average size of a linked list starting from a slot would be n/m. Am I correct in thinking this? Again, I'm talking about an average. Thanks in advance!
I'm trying to learn data structures.
As you say, the average size of a single list is generally going to be the table's load factor; but this is assuming that the "Simple Uniform Hashing Assumption" holds with your hash table (more specifically, with its hash function(s) and expected input keys): simply put, we assume that the hash function distributes elements to buckets uniformly, as well as independently of one another.
To expand a little, and in different words:
We assume that if we choose a new item randomly (imagine sampling an item from the probability distribution that characterizes our inputs), then there is an equal chance that the item we end up with will be mapped to any of the m buckets. (A chance of 1/m.)
Furthermore, that this probability is unaffected given the presence (or absence) of any other elements in any of the buckets.
This is helpful because from this we can conclude that the probability for an item to be sorted into a given bucket is always 1/m, regardless of any other circumstances; from this it directly follows that the expected (average) length of a single bucket's list will be n/m (we insert n elements into the table, and for each one, sort it into this given list at a probability of 1/m).
To see that this is important, we might imagine a case in which it doesn't hold: for instance, if we're facing some kind of "attack" and our inputs are engineered to all hash into the same bucket, or even just with a high probability. In this case SUHA no longer holds, and clearly neither does the link you've asked about between the length of a list and the load factor.
This is part of the reason that it is important to choose a good hash function for your use case: without it, the assumption may not hold which could have a harmful effect on your lookup times.

Hash Table that tries to hash Strings uniformly?

I am currently in a Data Structures course nearing the end of the semester, and have been assigned a project in which we are implementing a Linked Hash Table to store and retrieve keys. We have been given a pretty large amount of freedom with how we are going to design our hash table implementation, but for bonus points we were told to try and find a hash function that distributes our keys (unique strings) close to uniformly and randomly throughout the table.
I have chosen to use the ELF hash, seen here http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
My question is as follows: With this hash function an integer is returned, but I am having trouble seeing how this can be used to help specify a specific index to put my key in in the hash table. I could simply do: index = ELFhash(String key) % tableSize, but does this defeat the purpose of using the ELF hash in the first place??
Also I have chosen my collision resolution strategy to be double hashing. Is there a good way to determine an appropriate secondary hashing function to find your jumps? My hash table is not going to be a constant size (sets of strings will be added and removed from the set of data I am hashing, and I will be rehashing them after each iteration of adding and removing to have a load factor of .75), so it is hard for me to just do something like k % n where n is a number that is relatively prime with my table size.
Thanks for taking the time to read my question, and let me know what you think!
You're correct to think about "wrapping bias," but for most practical purposes, it's not going to be a problem.
If the hash table is of size N and the hash value is in the range [0..M), then let k = floor(M/N). Any hash value in the range [0..k*N) is a "good" one in that, using mod N as a map, each hash bucket is mapped by exactly k hash values. The hash values in [k*N..M) are "bad" in that if you use them, the corresponding M-K*n lowest hash buckets map from one additional hash value. Even if the hash function is perfect, these buckets have a higher probability of receiving a given value.
The question, though, is "How much higher?" That depends on M and N. If the hash value is an unsigned int in [0..2^32), and - having read Knuth and others - you decide to pick prime number of buckets around a thousand, say 1009, what happens?
floor(2^32 / 1009) = 4256657
The number of "bad" values is
2^32 - 4256657 * 1009 = 383
Consequently, all buckets are mapped from 4256657 "good" values, and 383 get one additional unwanted "bad" value for 4256658. Thus the "bias" for is 1/4,256,657.
It's very unlikely you'll find a hash function where a 1 in 4 million probability difference between buckets will be noticeable.
Now if you redo the calculation with a million buckets instead of a thousand, then things look a bit different. In that case if you're a bit OC, you might want to switch to a 64-bit hash.
On additional thing: The Elf hash is pretty unlikely to give absolutely terrible results, and it's quite fast, but there are much better hash functions. A reasonably well-regarded one you might want give a try is Murmur 32. (The Wiki article mentions that the original alg has some weaknesses that can be exploited for DoS attacks, but for your application it will be fine.) I'm sure your prof doesn't want you to copy code, but the Wikipedia page has it complete. It would be interesting to implement Elf yourself and try it against Murmur to see how they compare.

Iterate over hash function though it reduces search space

I was reading this article regarding the number of times you should hash your password
A salt is added to password before the password is hashed to safeguard against dictionary attacks and rainbow table attacks.
The commentors in the answer by ORIP stated
hashing a hash is not something you should do, as the possibility of
hash collision increase with each iteration which may reduce the
search space (salt doesn't help), but this is irrelevant for
password-based cryptography. To reach the 256-bit search space of this
hash you'd need a completely random password, 40 characters long, from
all available keyboard characters (log2(94^40))
The answer by erickson recommended
With pre-computation off the table, an attacker has compute the hash
on each attempt. How long it takes to find a password now depends
entirely on how long it takes to hash a candidate. This time is
increased by iteration of the hash function. The number iterations is
generally a parameter of the key derivation function; today, a lot of
mobile devices use 10,000 to 20,000 iterations, while a server might
use 100,000 or more. (The bcrypt algorithm uses the term "cost
factor", which is a logarithmic measure of the time required.)
My questions are
1) Why do we iterate over the hash function since each iteration reduces the search space and hence make it easier to crack the password
2) What does search space mean ??
3) Why is the reduction of search space irrelevant for password-based cryptography
4) When is reduction of search space relevant ??
.
Let's start with the basic question: What is a search space?
A search space is the set of all values that must be searched in order to find the one you want. In the case of AES-256, the total key space is 2^256. This is a really staggeringly large number. This is the number that most people are throwing around when they say that AES cannot be brute forced.
The search space of "8-letter sequences of lowercase letters" is 26^8, or about 200 billion (~2^37), which from a cryptographic point of view is a tiny, insignificant number that can be searched pretty quickly. It's less than 3 days at 1,000,000 checks per second. Real passwords are chosen out of much smaller sets, since most people don't type 8 totally random letters. (You can up this with upper case and numbers and symbols, but people pick from a tiny set of those, too.)
OK, so people like to type short, easy passwords, but we want to make them hard to brute-force. So we need a way to convert "easy to guess passwords" into "hard to guess key." We call this a Key Derivation Function (KDF). We need two things for it:
The KDF must be "computationally indistinguishable from random." This means that there is no inverse of the hash function that can be computed more quickly than a brute force search.
The KDF should take non-trivial time to compute, so that brute forcing the tiny password space is still very difficult. Ideally it should be made as difficult as brute forcing the entire key space, but it is rare to push it that far.
The first point is the answer to your question of "why don't we care about collisions?" It is because collisions, while they could possibly exist, cannot be predicted in an computationally efficient manner. If collisions could be efficiently predicted, then your KDF function is not indistinguishable from random.
A KDF is not the same as just "repeated hashing." Repeated hashing can be distinguished from random, and is subject to significant attacks (most notably length-extension attacks).
PBKDF2, as a specific KDF example, is proven to be computationally indistinguishable from random, as long as it is provided with a pseudorandom function (PRF). A PRF is defined as itself being computationally indistinguishable from random. PBDFK2 uses HMAC, which is proven to be a PRF as long as it is provided a hashing function that is at least weakly collision resistant (the requirement is actually a bit weaker than even that).
Note the word "proven" here. Good cryptography lives on top of mathematical security proofs. It is not just "tie a lot of knots and hope it holds."
So that's a little tiny bit of the math behind why we're not worried about collisions, but let's also consider some intuition about it.
The total number of 16-character (absurdly long) passwords that can be easily typed on a common English keyboard is about 95^16 or 2^105 (that doesn't count the 15, 14, 13, etc length passwords, but since 95^16 is almost two orders of magnitude larger than 95^15, it's close enough). Now, consider that for each password, we're going to randomly map it to 10,000 intermediate keys (via 10,000 iterations of PBKDF2). That gets us up to 2^118 random choices that we hope never collide in our hash. What are the chances?
Well, 2^256 (our total space) divided by 2^118 (our keys) is 2^138. That means we're using much less than 10^-41 of the space for all passwords that could even be remotely likely. If we're picking these randomly (and the definition of a PRF says we are), the chances of two colliding are, um, small. And if two somehow did, no attacker would ever be able to predict it.
Take away lesson: Use PBKDF2 (or another good KDF like scrypt or bcrypt) to convert passwords into keys. Use a lot of iterations (10,000-100,000 at a minimum). Do not worry about the collisions.
You may be interested in a little more discussion of this in Brute-Forcing Passwords.
As the second snippet said, each iteration makes each "guess" a hacker makes take longer, therefore increasing the total time it will take then to crack an average password.
Search space is all the possible hashes for a password after however many iterations you are using. Each iteration decreases the search space.
Because of #1, as the size of the search space decreases, the time to check each possibility increases, balancing out that negative effect.
According to the second snippet, answers #1 and #3 say it actually isn't.
I hope this makes sense, it's a very complicated topic.
The reason to iterate is to make it harder for an attacker to brute force the hash. If you have a single round of hashing for a value, then in order to precompute a table for cracking that hash, you need to do 1 * keyspace hashes. If you do 1000 hashes of the value, then it would require the work of 1000 * keyspace.
Search space generally refers to the total number of combinations of characters that could make up a password.
I would say that the reduction of search space is irrelevant because passwords are generally not cracked by attempting 0000000, then 0000001, etc. They are instead attempted to be cracked by using dictionaries and combinatorics. There is essentially a realm of passwords that are likely to get cracked (like "password", "abcdef1", "goshawks", etc.), but creating a larger work factor will make it much more difficult for an attacker to hit all of the likely passwords in the space. Combining that with a salt, means they have to do all of the work for those likely passwords, for every hash they want to crack.
The reduction in search space becomes relevant if you are trying to crack something that is random and could take up any value in the search space.

Why is it called rainbow table?

Anyone know why it is called rainbow table? Just remembered we have learned there is an attack called "dictionary attack". Why it is not call dictionary?
Because it contains the entire "spectrum" of possibilities.
A dictionary attack is a bruteforce technique of just trying possibilities. Like this (python pseudo code)
mypassworddict = dict()
for password in mypassworddict:
trypassword(password)
However, a rainbow table works differently, because it's for inverting hashes. A high level overview of a hash is that it has a number of bins:
bin1, bin2, bin3, bin4, bin5, ...
Which correspond to binary parts of the output string - that's how the string ends up the length it is. As the hash proceeds, it affects differing parts of the bins in different ways. So the first byte (or whatever input field is accepted) input affects (say, simplistically) bins 3 and 4. The next input affects 2 and 6. And so on.
A rainbow table is a computation of all the possibilities of a given bin, i.e. all the possible inverses of that bin, for every bin... that's why it ends up so large. If the first bin value is 0x1 then you need to have a lookup list of all the values of bin2 and all the values of bin3 working backwards through the hash, which eventually gives you a value.
Why isn't it called a dictionary attack? Because it isn't.
As I've seen your previous question, let me expand on the detail you're looking for there. A cryptographically secure hash needs to be safe ideally from smallish input sizes up to whole files. To precompute the values of a hash for an entire file would take forever. So a rainbow table is designed on a small well understood subset of outputs, for example the permutations of all the characters a-z over a field of say 10 characters.
This is why password advice for defeating dictionary attacks works here. The more subsets of the whole possible set of inputs you put into your input for the hash, the more a rainbow table needs to contain to search it. The data sizes required end up stupidly big and so does the time to search. So, think about it:
If you have an input that is [a-z] for 5-8 characters, that's not too bad a rainbow table.
If you increase the length to 42 characters, that's a massive rainbow table. Each input affects the hash and so the bins of said hash.
If you throw numbers in to your search requirement [a-z][0-9] you've got even more searching to do.
Likewise [A-Za-z0-9]. Finally, stick in [\w] i.e. any printable character you can think of, and again, you're looking at a massive table.
So, making passwords long and complicated makes rainbow tables start taking blue-ray sized discs of data. Then, as per your previous question, you start adding in salting and hash derived functions and you make a general solution to hash cracking hard(er).
The goal here is to stay ahead of the computational power available.
Rainbow is a variant of dictionary attack (Pre-computed dictionary attack to be exact), but it takes less space than full dictionary (at the price of time needed to find a key in table). The other end of this space-memory tradeoff is full search (brute force attack = zero precomputation, a lot of time).
In the rainbow table the precomputed dictionary of pairs key-ciphertext is compressed in chains. Every step in chain is done using different commpression function. And the table has a lot of chains, so it looks like a rainbow.
In this picture different compression functions K1, K2, K3 have a colors like in rainbow:
The table, stored in the file contains only first and last columns, as the middle columns can be recomputed.
I don't know where the name comes from, but the differences are:
A dictionary contains a few selected items (e.g. english words), while a rainbow table contains every possible combination.
A dictionary only contains the input, while the rainbow table contains both the input and the output.
A dictionary is used to test different input to see if the output is valid, while a rainbow table is used for e reverse lookup, i.e. to find which input gives a specific output.
Unfortunately some of the statements are not correct. Contrary to what is bring posted rainbow tables DO NOT contain all the possibilites for a given keyspace well not the ones generated for use that I've seen. They can be generated to cover 99.9 but due to the randomness of a hash function there in no gurantee that EVERY plaintext is covered.
Each chain is made up of links or steps and each step is made of a hashing and reduction function. If your chain was 100 links long you would go that number of hash/reduction functions then discarding everything in between except the start and end.
To find the plain for a given hash you simply perform the reduction / hash x amount of the length of your chain. So you run the step once and check against the endpoint if it's a miss you would repeat... Until you have stepped through the entire length of your chain. If there is a match you can then regenerate the chain from the start point and you may be able to find the plain. If after the regeneration it is not correct then this is a false alarm. This happens due to collisions caused by the reduction hashing function. Since the table contains many chains you can do a large lookup against all the chain endpoints each step, this is essentially where the magic happens allowing speed. This will also lead to false alarms, since you only need to regenerate chains which have matches you save lots of time by skipping unnecessary chains.
They do not contain dictionaries.... Well not the traditional tables there are variants of rainbow tables which incorporate the use of dictionaries though.
That's about it. There are many ways which this process has been optimized including removing merging / duplicate chains and creating perfect tables and also storing them in differing packing to save space and loading time.

What's Up with O(1)?

I have been noticing some very strange usage of O(1) in discussion of algorithms involving hashing and types of search, often in the context of using a dictionary type provided by the language system, or using dictionary or hash-array types used using array-index notation.
Basically, O(1) means bounded by a constant time and (typically) fixed space. Some pretty fundamental operations are O(1), although using intermediate languages and special VMs tends to distort ones thinking here (e.g., how does one amortize the garbage collector and other dynamic processes over what would otherwise be O(1) activities).
But ignoring amortization of latencies, garbage-collection, and so on, I still don't understand how the leap to assumption that certain techniques that involve some kind of searching can be O(1) except under very special conditions.
Although I have noticed this before, an example just showed up in the Pandincus question, "'Proper’ collection to use to obtain items in O(1) time in C# .NET?".
As I remarked there, the only collection I know of that provides O(1) access as a guaranteed bound is a fixed-bound array with an integer index value. The presumption is that the array is implemented by some mapping to random access memory that uses O(1) operations to locate the cell having that index.
For collections that involve some sort of searching to determine the location of a matching cell for a different kind of index (or for a sparse array with integer index), life is not so easy. In particular, if there are collisons and congestion is possible, access is not exactly O(1). And if the collection is flexible, one must recognize and amortize the cost of expanding the underlying structure (such as a tree or a hash table) for which congestion relief (e.g., high collision incidence or tree imbalance).
I would never have thought to speak of these flexible and dynamic structures as O(1). Yet I see them offered up as O(1) solutions without any identification of the conditions that must be maintained to actually have O(1) access be assured (as well as have that constant be negligibly small).
THE QUESTION: All of this preparation is really for a question. What is the casualness around O(1) and why is it accepted so blindly? Is it recognized that even O(1) can be undesirably large, even though near-constant? Or is O(1) simply the appropriation of a computational-complexity notion to informal use? I'm puzzled.
UPDATE: The Answers and comments point out where I was casual about defining O(1) myself, and I have repaired that. I am still looking for good answers, and some of the comment threads are rather more interesting than their answers, in a few cases.
The problem is that people are really sloppy with terminology. There are 3 important but distinct classes here:
O(1) worst-case
This is simple - all operations take no more than a constant amount of time in the worst case, and therefore in all cases. Accessing an element of an array is O(1) worst-case.
O(1) amortized worst-case
Amortized means that not every operation is O(1) in the worst case, but for any sequence of N operations, the total cost of the sequence is no O(N) in the worst case. This means that even though we can't bound the cost of any single operation by a constant, there will always be enough "quick" operations to make up for the "slow" operations such that the running time of the sequence of operations is linear in the number of operations.
For example, the standard Dynamic Array which doubles its capacity when it fills up requires O(1) amortized time to insert an element at the end, even though some insertions require O(N) time - there are always enough O(1) insertions that inserting N items always takes O(N) time total.
O(1) average-case
This one is the trickiest. There are two possible definitions of average-case: one for randomized algorithms with fixed inputs, and one for deterministic algorithms with randomized inputs.
For randomized algorithms with fixed inputs, we can calculate the average-case running time for any given input by analyzing the algorithm and determining the probability distribution of all possible running times and taking the average over that distribution (depending on the algorithm, this may or may not be possible due to the Halting Problem).
In the other case, we need a probability distribution over the inputs. For example, if we were to measure a sorting algorithm, one such probability distribution would be the distribution that has all N! possible permutations of the input equally likely. Then, the average-case running time is the average running time over all possible inputs, weighted by the probability of each input.
Since the subject of this question is hash tables, which are deterministic, I'm going to focus on the second definition of average-case. Now, we can't always determine the probability distribution of the inputs because, well, we could be hashing just about anything, and those items could be coming from a user typing them in or from a file system. Therefore, when talking about hash tables, most people just assume that the inputs are well-behaved and the hash function is well behaved such that the hash value of any input is essentially randomly distributed uniformly over the range of possible hash values.
Take a moment and let that last point sink in - the O(1) average-case performance for hash tables comes from assuming all hash values are uniformly distributed. If this assumption is violated (which it usually isn't, but it certainly can and does happen), the running time is no longer O(1) on average.
See also Denial of Service by Algorithmic Complexity. In this paper, the authors discuss how they exploited some weaknesses in the default hash functions used by two versions of Perl to generate large numbers of strings with hash collisions. Armed with this list of strings, they generated a denial-of-service attack on some webservers by feeding them these strings that resulted in the worst-case O(N) behavior in the hash tables used by the webservers.
My understanding is that O(1) is not necessarily constant; rather, it is not dependent on the variables under consideration. Thus a hash lookup can be said to be O(1) with respect to the number of elements in the hash, but not with respect to the length of the data being hashed or ratio of elements to buckets in the hash.
The other element of confusion is that big O notation describes limiting behavior. Thus, a function f(N) for small values of N may indeed show great variation, but you would still be correct to say it is O(1) if the limit as N approaches infinity is constant with respect to N.
O(1) means constant time and (typically) fixed space
Just to clarify these are two separate statements. You can have O(1) in time but O(n) in space or whatever.
Is it recognized that even O(1) can be undesirably large, even though near-constant?
O(1) can be impractically HUGE and it's still O(1). It is often neglected that if you know you'll have a very small data set the constant is more important than the complexity, and for reasonably small data sets, it's a balance of the two. An O(n!) algorithm can out-perform a O(1) if the constants and sizes of the data sets are of the appropriate scale.
O() notation is a measure of the complexity - not the time an algorithm will take, or a pure measure of how "good" a given algorithm is for a given purpose.
I can see what you're saying, but I think there are a couple of basic assumptions underlying the claim that look-ups in a Hash table have a complexity of O(1).
The hash function is reasonably designed to avoid a large number of collisions.
The set of keys is pretty much randomly distributed, or at least not purposely designed to make the hash function perform poorly.
The worst case complexity of a Hash table look-up is O(n), but that's extremely unlikely given the above 2 assumptions.
Hashtables is a data structure that supports O(1) search and insertion.
A hashtable usually has a key and value pair, where the key is used to as the parameter to a function (a hash function) which will determine the location of the value in its internal data structure, usually an array.
As insertion and search only depends upon the result of the hash function and not on the size of the hashtable nor the number of elements stored, a hashtable has O(1) insertion and search.
There is one caveat, however. That is, as the hashtable becomes more and more full, there will be hash collisions where the hash function will return an element of an array which is already occupied. This will necesitate a collision resolution in order to find another empty element.
When a hash collision occurs, a search or insertion cannot be performed in O(1) time. However, good collision resolution algorithms can reduce the number of tries to find another suiteable empty spot or increasing the hashtable size can reduce the number of collisions in the first place.
So, in theory, only a hashtable backed by an array with an infinite number of elements and a perfect hash function would be able to achieve O(1) performance, as that is the only way to avoid hash collisions that drive up the number of required operations. Therefore, for any finite-sized array will at one time or another be less than O(1) due to hash collisions.
Let's take a look at an example. Let's use a hashtable to store the following (key, value) pairs:
(Name, Bob)
(Occupation, Student)
(Location, Earth)
We will implement the hashtable back-end with an array of 100 elements.
The key will be used to determine an element of the array to store the (key, value) pair. In order to determine the element, the hash_function will be used:
hash_function("Name") returns 18
hash_function("Occupation") returns 32
hash_function("Location") returns 74.
From the above result, we'll assign the (key, value) pairs into the elements of the array.
array[18] = ("Name", "Bob")
array[32] = ("Occupation", "Student")
array[74] = ("Location", "Earth")
The insertion only requires the use of a hash function, and does not depend on the size of the hashtable nor its elements, so it can be performed in O(1) time.
Similarly, searching for an element uses the hash function.
If we want to look up the key "Name", we'll perform a hash_function("Name") to find out which element in the array the desired value resides.
Also, searching does not depend on the size of the hashtable nor the number of elements stored, therefore an O(1) operation.
All is well. Let's try to add an additional entry of ("Pet", "Dog"). However, there is a problem, as hash_function("Pet") returns 18, which is the same hash for the "Name" key.
Therefore, we'll need to resolve this hash collision. Let's suppose that the hash collision resolving function we used found that the new empty element is 29:
array[29] = ("Pet", "Dog")
Since there was a hash collision in this insertion, our performance was not quite O(1).
This problem will also crop up when we try to search for the "Pet" key, as trying to find the element containing the "Pet" key by performing hash_function("Pet") will always return 18 initially.
Once we look up element 18, we'll find the key "Name" rather than "Pet". When we find this inconsistency, we'll need to resolve the collision in order to retrieve the correct element which contains the actual "Pet" key. Resovling a hash collision is an additional operation which makes the hashtable not perform at O(1) time.
I can't speak to the other discussions you've seen, but there is at least one hashing algorithm that is guaranteed to be O(1).
Cuckoo hashing maintains an invariant so that there is no chaining in the hash table. Insertion is amortized O(1), retrieval is always O(1). I've never seen an implementation of it, it's something that was newly discovered when I was in college. For relatively static data sets, it should be a very good O(1), since it calculates two hash functions, performs two lookups, and immediately knows the answer.
Mind you, this is assuming the hash calcuation is O(1) as well. You could argue that for length-K strings, any hash is minimally O(K). In reality, you can bound K pretty easily, say K < 1000. O(K) ~= O(1) for K < 1000.
There may be a conceptual error as to how you're understanding Big-Oh notation. What it means is that, given an algorithm and an input data set, the upper bound for the algorithm's run time depends on the value of the O-function when the size of the data set tends to infinity.
When one says that an algorithm takes O(n) time, it means that the runtime for an algorithm's worst case depends linearly on the size of the input set.
When an algorithm takes O(1) time, the only thing it means is that, given a function T(f) which calculates the runtime of a function f(n), there exists a natural positive number k such that T(f) < k for any input n. Essentially, it means that the upper bound for the run time of an algorithm is not dependent on its size, and has a fixed, finite limit.
Now, that does not mean in any way that the limit is small, just that it's independent of the size of the input set. So if I artificially define a bound k for the size of a data set, then its complexity will be O(k) == O(1).
For example, searching for an instance of a value on a linked list is an O(n) operation. But if I say that a list has at most 8 elements, then O(n) becomes O(8) becomes O(1).
In this case, it we used a trie data structure as a dictionary (a tree of characters, where the leaf node contains the value for the string used as key), if the key is bounded, then its lookup time can be considered O(1) (If I define a character field as having at most k characters in length, which can be a reasonable assumption for many cases).
For a hash table, as long as you assume that the hashing function is good (randomly distributed) and sufficiently sparse so as to minimize collisions, and rehashing is performed when the data structure is sufficiently dense, you can indeed consider it an O(1) access-time structure.
In conclusion, O(1) time may be overrated for a lot of things. For large data structures the complexity of an adequate hash function may not be trivial, and sufficient corner cases exist where the amount of collisions lead it to behave like an O(n) data structure, and rehashing may become prohibitively expensive. In which case, an O(log(n)) structure like an AVL or a B-tree may be a superior alternative.
In general, I think people use them comparatively without regard to exactness. For example, hash-based data structures are O(1) (average) look up if designed well and you have a good hash. If everything hashes to a single bucket, then it's O(n). Generally, though one uses a good algorithm and the keys are reasonably distributed so it's convenient to talk about it as O(1) without all the qualifications. Likewise with lists, trees, etc. We have in mind certain implementations and it's simply more convenient to talk about them, when discussing generalities, without the qualifications. If, on the other hand, we're discussing specific implementations, then it probably pays to be more precise.
HashTable looks-ups are O(1) with respect to the number of items in the table, because no matter how many items you add to the list the cost of hashing a single item is pretty much the same, and creating the hash will tell you the address of the item.
To answer why this is relevant: the OP asked about why O(1) seemed to be thrown around so casually when in his mind it obviously could not apply in many circumstances. This answer explains that O(1) time really is possible in those circumstances.
Hash table implementations are in practice not "exactly" O(1) in use, if you test one you'll find they average around 1.5 lookups to find a given key across a large dataset
( due to to the fact that collisions DO occur, and upon colliding, a different location must be assigned )
Also, In practice, HashMaps are backed by arrays with an initial size, that is "grown" to double size when it reaches 70% fullness on average, which gives a relatively good addressing space. After 70% fullness collision rates grow faster.
Big O theory states that if you have a O(1) algorithm, or even an O(2) algorithm, the critical factor is the degree of the relation between input-set size and steps to insert/fetch one of them. O(2) is still constant time, so we just approximate it as O(1), because it means more or less the same thing.
In reality, there is only 1 way to have a "perfect hashtable" with O(1), and that requires:
A Global Perfect Hash Key Generator
An Unbounded addressing space.
( Exception case: if you can compute in advance all the permutations of permitted keys for the system, and your target backing store address space is defined to be the size where it can hold all keys that are permitted, then you can have a perfect hash, but its a "domain limited" perfection )
Given a fixed memory allocation, it is not plausible in the least to have this, because it would assume that you have some magical way to pack an infinite amount of data into a fixed amount of space with no loss of data, and that's logistically impossible.
So retrospectively, getting O(1.5) which is still constant time, in a finite amount of memory with even a relatively Naïve hash key generator, I consider pretty damn awesome.
Suffixory note Note I use O(1.5) and O(2) here. These actually don't exist in big-o. These are merely what people whom don't know big-o assume is the rationale.
If something takes 1.5 steps to find a key, or 2 steps to find that key, or 1 steps to find that key, but the number of steps never exceeds 2 and whether it takes 1 step or 2 is completely random, then it is still Big-O of O(1). This is because no matter how many items to you add to the dataset size, It still maintains the <2 steps. If for all tables > 500 keys it takes 2 steps, then you can assume those 2 steps are in fact one-step with 2 parts, ... which is still O(1).
If you can't make this assumption, then your not being Big-O thinking at all, because then you must use the number which represents the number of finite computational steps required to do everything and "one-step" is meaningless to you. Just get into your head that there is NO direct correlation between Big-O and number of execution cycles involved.
O(1) means, exactly, that the algorithm's time complexity is bounded by a fixed value. This doesn't mean it's constant, only that it is bounded regardless of input values. Strictly speaking, many allegedly O(1) time algorithms are not actually O(1) and just go so slowly that they are bounded for all practical input values.
Yes, garbage collection does affect the asymptotic complexity of algorithms running in the garbage collected arena. It is not without cost, but it is very hard to analyze without empirical methods, because the interaction costs are not compositional.
The time spent garbage collecting depends on the algorithm being used. Typically modern garbage collectors toggle modes as memory fills up to keep these costs under control. For instance, a common approach is to use a Cheney style copy collector when memory pressure is low because it pays cost proportional to the size of the live set in exchange for using more space, and to switch to a mark and sweep collector when memory pressure becomes greater, because even though it pays cost proportional to the live set for marking and to the whole heap or dead set for sweeping. By the time you add card-marking and other optimizations, etc. the worst case costs for a practical garbage collector may actually be a fair bit worse, picking up an extra logarithmic factor for some usage patterns.
So, if you allocate a big hash table, even if you access it using O(1) searches for all time during its lifetime, if you do so in a garbage collected environment, occasionally the garbage collector will traverse the entire array, because it is size O(n) and you will pay that cost periodically during collection.
The reason we usually leave it off of the complexity analysis of algorithms is that garbage collection interacts with your algorithm in non-trivial ways. How bad of a cost it is depends a lot on what else you are doing in the same process, so the analysis is not compositional.
Moreover, above and beyond the copy vs. compact vs. mark and sweep issue, the implementation details can drastically affect the resulting complexities:
Incremental garbage collectors that track dirty bits, etc. can all but make those larger re-traversals disappear.
It depends on whether your GC works periodically based on wall-clock time or runs proportional to the number of allocations.
Whether a mark and sweep style algorithm is concurrent or stop-the-world
Whether it marks fresh allocations black if it leaves them white until it drops them into a black container.
Whether your language admits modifications of pointers can let some garbage collectors work in a single pass.
Finally, when discussing an algorithm, we are discussing a straw man. The asymptotics will never fully incorporate all of the variables of your environment. Rarely do you ever implement every detail of a data structure as designed. You borrow a feature here and there, you drop a hash table in because you need fast unordered key access, you use a union-find over disjoint sets with path compression and union by rank to merge memory-regions over there because you can't afford to pay a cost proportional to the size of the regions when you merge them or what have you. These structures are thought primitives and the asymptotics help you when planning overall performance characteristics for the structure 'in-the-large' but knowledge of what the constants are matters too.
You can implement that hash table with perfectly O(1) asymptotic characteristics, just don't use garbage collection; map it into memory from a file and manage it yourself. You probably won't like the constants involved though.
I think when many people throw around the term "O(1)" they implicitly have in mind a "small" constant, whatever "small" means in their context.
You have to take all this big-O analysis with context and common sense. It can be an extremely useful tool or it can be ridiculous, depending on how you use it.

Resources