Do I need a HashSet or a Set? - haskell

I want to initialise a set of all Ints from 1 to n (n<20000). Then I want to remove them one by one and meanwhile check if certain elements are still in it until the set is empty.
Which data structure is suited best for this task?

If you want to stick to immutable data structures, I would recommend IntSet. It's carefully optimized for precisely this kind of thing. A Set Int is a balanced binary search tree of Ints, which takes a lot of space and a good bit of time. A HashSet Int is an array-mapped trie of Ints, which is likely faster and more compact, but still pretty mediocre. An IntSet is a PATRICIA tree whose leaves are bitsets. So it's pretty compact (a little over twice the size of an unboxed immutable array when full), but much more efficient to modify.
Initializing an IntSet with all Ints from 1 to n takes O(n) time. If you're only initializing once, or once in a while, and n < 20000, then that shouldn't cause any performance trouble. If, however, you need to initialize often (especially if you sometimes only remove a few elements before discarding the set), or n turns out to be much larger (e.g., hundreds of millions) and you want to cut down on initialization time, you can use IntSet to represent the complement of the set you want to store.
data CompSet = CompSet
{ initialMax :: !Int
, size :: !Int
, missingElements :: !IntSet
}
A CompSet stores the initial maximum (n), and an IntSet indicating which elements in [1..initialMax] are no longer in the set. The size of the CompSet is initialized to initialMax and lets you know in O(1) time whether the set is empty (i.e., when size missingElements = initialMax).

Use a bitset (a.k.a. Integer). A 1 bit represents a value still in the set; a 0 bit represents one that just ain't there. For example, the Integer that represents having all the numbers from 1 to n would be bit (n+1) - 2 (assuming you plan to use 0-indexing, as seems sensible to me); to check whether a number is in the set, use testBit; to remove a number, use clearBit.
An alternate implementation strategy for the same underlying idea would be to use an unboxed array of Bool, either mutable or immutable as needed. The unboxed versions do the appropriate bit-packing. The only downside would be possibly having to resize the array if you need to add numbers to the set later that are larger than you originally allocated space for.

Related

Haskell Massiv Array Size Limit

If Massiv, as well as other array libraries use Int for indexing, then how does one construct and index arrays larger than 2^29 elements? Int can only be as large as 2^29. I noticed in the source code that Linear indexing is used on array operations aswell so I would assume that just writing a vector as a two dimensional array would still have the same issue.
Is there a solution to this within Massiv or is there another array library suitable for arrays with more than 2^29 elements?
Edit: #Thomas just mentioned that the maxBound of Int is machine dependent. How ever I would still like to know how to index arrays with a number of elements greater than the maxBound of Int.
There is no way to create a list that contains more than maxBound :: Int elements in memory, because the size of an Int is generally expected to be sufficient to cover the full addressable memory space. A hypothetical list or array of length greater than maxBound :: Int on your system therefore would not fit in addressable memory and could not be stored, thus there is no need for a mechanism by which one could index into such a structure.

Does time complexity of Hashmap get() and put() operation is O(1) at all time [duplicate]

We are used to saying that HashMap get/put operations are O(1). However it depends on the hash implementation. The default object hash is actually the internal address in the JVM heap. Are we sure it is good enough to claim that the get/put are O(1)?
Available memory is another issue. As I understand from the javadocs, the HashMap load factor should be 0.75. What if we do not have enough memory in JVM and the load factor exceeds the limit?
So, it looks like O(1) is not guaranteed. Does it make sense or am I missing something?
It depends on many things. It's usually O(1), with a decent hash which itself is constant time... but you could have a hash which takes a long time to compute, and if there are multiple items in the hash map which return the same hash code, get will have to iterate over them calling equals on each of them to find a match.
In the worst case, a HashMap has an O(n) lookup due to walking through all entries in the same hash bucket (e.g. if they all have the same hash code). Fortunately, that worst case scenario doesn't come up very often in real life, in my experience. So no, O(1) certainly isn't guaranteed - but it's usually what you should assume when considering which algorithms and data structures to use.
In JDK 8, HashMap has been tweaked so that if keys can be compared for ordering, then any densely-populated bucket is implemented as a tree, so that even if there are lots of entries with the same hash code, the complexity is O(log n). That can cause issues if you have a key type where equality and ordering are different, of course.
And yes, if you don't have enough memory for the hash map, you'll be in trouble... but that's going to be true whatever data structure you use.
It has already been mentioned that hashmaps are O(n/m) in average, if n is the number of items and m is the size. It has also been mentioned that in principle the whole thing could collapse into a singly linked list with O(n) query time. (This all assumes that calculating the hash is constant time).
However what isn't often mentioned is, that with probability at least 1-1/n (so for 1000 items that's a 99.9% chance) the largest bucket won't be filled more than O(logn)! Hence matching the average complexity of binary search trees. (And the constant is good, a tighter bound is (log n)*(m/n) + O(1)).
All that's required for this theoretical bound is that you use a reasonably good hash function (see Wikipedia: Universal Hashing. It can be as simple as a*x>>m). And of course that the person giving you the values to hash doesn't know how you have chosen your random constants.
TL;DR: With Very High Probability the worst case get/put complexity of a hashmap is O(logn).
I'm not sure the default hashcode is the address - I read the OpenJDK source for hashcode generation a while ago, and I remember it being something a bit more complicated. Still not something that guarantees a good distribution, perhaps. However, that is to some extent moot, as few classes you'd use as keys in a hashmap use the default hashcode - they supply their own implementations, which ought to be good.
On top of that, what you may not know (again, this is based in reading source - it's not guaranteed) is that HashMap stirs the hash before using it, to mix entropy from throughout the word into the bottom bits, which is where it's needed for all but the hugest hashmaps. That helps deal with hashes that specifically don't do that themselves, although i can't think of any common cases where you'd see that.
Finally, what happens when the table is overloaded is that it degenerates into a set of parallel linked lists - performance becomes O(n). Specifically, the number of links traversed will on average be half the load factor.
I agree with:
the general amortized complexity of O(1)
a bad hashCode() implementation could result to multiple collisions, which means that in the worst case every object goes to the same bucket, thus O(N) if each bucket is backed by a List.
since Java 8, HashMap dynamically replaces the Nodes (linked list) used in each bucket with TreeNodes (red-black tree when a list gets bigger than 8 elements) resulting to a worst performance of O(logN).
But, this is not the full truth if we want to be 100% precise. The implementation of hashCode() and the type of key Object (immutable/cached or being a Collection) might also affect real time complexity in strict terms.
Let's assume the following three cases:
HashMap<Integer, V>
HashMap<String, V>
HashMap<List<E>, V>
Do they have the same complexity? Well, the amortised complexity of the 1st one is, as expected, O(1). But, for the rest, we also need to compute hashCode() of the lookup element, which means we might have to traverse arrays and lists in our algorithm.
Lets assume that the size of all of the above arrays/lists is k.
Then, HashMap<String, V> and HashMap<List<E>, V> will have O(k) amortised complexity and similarly, O(k + logN) worst case in Java8.
*Note that using a String key is a more complex case, because it is immutable and Java caches the result of hashCode() in a private variable hash, so it's only computed once.
/** Cache the hash code for the string */
private int hash; // Default to 0
But, the above is also having its own worst case, because Java's String.hashCode() implementation is checking if hash == 0 before computing hashCode. But hey, there are non-empty Strings that output a hashcode of zero, such as "f5a5a608", see here, in which case memoization might not be helpful.
HashMap operation is dependent factor of hashCode implementation. For the ideal scenario lets say the good hash implementation which provide unique hash code for every object (No hash collision) then the best, worst and average case scenario would be O(1).
Let's consider a scenario where a bad implementation of hashCode always returns 1 or such hash which has hash collision. In this case the time complexity would be O(n).
Now coming to the second part of the question about memory, then yes memory constraint would be taken care by JVM.
In practice, it is O(1), but this actually is a terrible and mathematically non-sense simplification. The O() notation says how the algorithm behaves when the size of the problem tends to infinity. Hashmap get/put works like an O(1) algorithm for a limited size. The limit is fairly large from the computer memory and from the addressing point of view, but far from infinity.
When one says that hashmap get/put is O(1) it should really say that the time needed for the get/put is more or less constant and does not depend on the number of elements in the hashmap so far as the hashmap can be presented on the actual computing system. If the problem goes beyond that size and we need larger hashmaps then, after a while, certainly the number of the bits describing one element will also increase as we run out of the possible describable different elements. For example, if we used a hashmap to store 32bit numbers and later we increase the problem size so that we will have more than 2^32 bit elements in the hashmap, then the individual elements will be described with more than 32bits.
The number of the bits needed to describe the individual elements is log(N), where N is the maximum number of elements, therefore get and put are really O(log N).
If you compare it with a tree set, which is O(log n) then hash set is O(long(max(n)) and we simply feel that this is O(1), because on a certain implementation max(n) is fixed, does not change (the size of the objects we store measured in bits) and the algorithm calculating the hash code is fast.
Finally, if finding an element in any data structure were O(1) we would create information out of thin air. Having a data structure of n element I can select one element in n different way. With that, I can encode log(n) bit information. If I can encode that in zero bit (that is what O(1) means) then I created an infinitely compressing ZIP algorithm.
In simple word, If each bucket contain only single node then time complexity will be O(1). If bucket contain more than one node them time complexity will be O(linkedList size). which is always efficient than O(n).
hence we can say on an average case time complexity of put(K,V) function :
nodes(n)/buckets(N) = λ (lambda)
Example : 16/16 = 1
Time complexity will be O(1)
Java HashMap time complexity
--------------------------------
get(key) & contains(key) & remove(key) Best case Worst case
HashMap before Java 8, using LinkedList buckets 1 O(n)
HashMap after Java 8, using LinkedList buckets 1 O(n)
HashMap after Java 8, using Binary Tree buckets 1 O(log n)
put(key, value) Best case Worst case
HashMap before Java 8, using LinkedList buckets 1 1
HashMap after Java 8, using LinkedList buckets 1 1
HashMap after Java 8, using Binary Tree buckets 1 O(log n)
Hints:
Before Java 8, HashMap use LinkedList buckets
After Java 8, HashMap will use either LinkedList buckets or Binary Tree buckets according to the bucket size.
if(bucket size > TREEIFY_THRESHOLD[8]):
treeifyBin: The bucket will be a Balanced Binary Red-Black Tree
if(bucket size <= UNTREEIFY_THRESHOLD[6]):
untreeify: The bucket will be LinkedList (plain mode)

Are sequences faster than vectors for searching in haskell?

I am kind of new using data structures in haskell besides of Lists. My goal is to chose one container among Data.Vector, Data.Sequence, Data.List, etc ... My problem is the following:
I have to create a sequence (mathematically speaking). The sequence starts at 0. In each iteration two new elements are generated but only one should be appended based in whether the first element is already in the sequence. So in each iteration there is a call to elem function (see the pseudo-code below).
appendNewItem :: [Integer] -> [Integer]
appendNewItem acc = let firstElem = someFunc
secondElem = someOtherFunc
newElem = if firstElem `elem` acc
then secondElem
else firstElem
in acc `append` newElem
sequenceUptoN :: Int -> [Integer]
sequenceUptoN n = (iterate appendNewItem [0]) !! n
Where append and iterate functions vary depending on which colection you use (I am using lists in the type signature for simplicity).
The question is: Which data structure should I use?. Is Data.Sequence faster for this task because of the Finger Tree inner structure?
Thanks a lot!!
No, sequences are not faster for searching. A Vector is just a flat chunk of memory, which gives generally the best lookup performance. If you want to optimise searching, use Data.Vector.Unboxed. (The normal, “boxed” variant is also pretty good, but it actually contains only references to the elements in the flat memory-chunk, so it's not quite as fast for lookups.)
However, because of the flat memory layout, Vectors are not good for (pure-functional) appending: basically, whenever you add a new element, the whole array must be copied so as to not invalidate the old one (which somebody else might still be using). If you need to append, Seq is a pretty good choice, although it's not as fast as destructive appending: for maximum peformance, you'll want to pre-allocate an uninitialized Data.Vector.Unboxed.Mutable.MVector of the required size, populate it using the ST monad, and freeze the result. But this is much more fiddly than purely-functional alternatives, so unless you need to squeeze out every bit of performance, Data.Sequence is the way to go. If you only want to append, but not look up elements, then a plain old list in reverse order would also do the trick.
I suggest using Data.Sequence in conjunction with Data.Set. The Sequence to hold the sequence of values and the Set to track the collection.
Sequence, List, and Vector are all structures for working with values where the position in the structure has primary importance when it comes to indexing. In lists we can manipulate elements at the front efficiently, in sequences we can manipulate elements based on the log of the distance the closest end, and in vectors we can access any element in constant time. Vectors however, are not that useful if the length keeps changing, so that rules out their use here.
However, you also need to lookup a certain value within the list, which these structures don't help with. You have to search the whole of a list/sequence/vector to be certain that a new value isn't present. Data.Map and Data.Set are two of the structures for which you define an index value based on Ord, and let you lookup/insert in log(n). So, at the cost of memory usage you can lookup the presence of firstElem in your Set in log(n) time and then add newElem to the end of the sequence in constant time. Just make sure to keep these two structures in synch when adding or taking new elements.

How can natural numbers be represented to offer constant time addition?

Cirdec's answer to a largely unrelated question made me wonder how best to represent natural numbers with constant-time addition, subtraction by one, and testing for zero.
Why Peano arithmetic isn't good enough:
Suppose we use
data Nat = Z | S Nat
Then we can write
Z + n = n
S m + n = S(m+n)
We can calculate m+n in O(1) time by placing m-r debits (for some constant r), one on each S constructor added onto n. To get O(1) isZero, we need to be sure to have at most p debits per S constructor, for some constant p. This works great if we calculate a + (b + (c+...)), but it falls apart if we calculate ((...+b)+c)+d. The trouble is that the debits stack up on the front end.
One option
The easy way out is to just use catenable lists, such as the ones Okasaki describes, directly. There are two problems:
O(n) space is not really ideal.
It's not entirely clear (at least to me) that the complexity of bootstrapped queues is necessary when we don't care about order the way we would for lists.
As far as I know, Idris (a dependently-typed purely functional language which is very close to Haskell) deals with this in a quite straightforward way. Compiler is aware of Nats and Fins (upper-bounded Nats) and replaces them with machine integer types and operations whenever possible, so the resulting code is pretty effective. However, that's not true for custom types (even isomorphic ones) as well as for compilation stage (there were some code samples using Nats for type checking which resulted in exponential growth in compile-time, I can provide them if needed).
In case of Haskell, I think a similar compiler extension may be implemented. Another possibility is to make TH macros which would transform the code. Of course, both of options aren't easy.
My understanding is that in basic computer programming terminology the underlying problem is you want to concatenate lists in constant time. The lists don't have cheats like forward references, so you can't jump to the end in O(1) time, for example.
You can use rings instead, which you can merge in O(1) time, regardless if a+(b+(c+...)) or ((...+c)+b)+a logic is used. The nodes in the rings don't need to be doubly linked, just a link to the next node.
Subtraction is the removal of any node, O(1), and testing for zero (or one) is trivial. Testing for n > 1 is O(n), however.
If you want to reduce space, then at each operation you can merge the nodes at the insertion or deletion points and weight the remaining ones higher. The more operations you do, the more compact the representation becomes! I think the worst case will still be O(n), however.
We know that there are two "extremal" solutions for efficient addition of natural numbers:
Memory efficient, the standard binary representation of natural numbers that uses O(log n) memory and requires O(log n) time for addition. (See also Chapter "Binary Representations" in the Okasaki's book.)
CPU efficient which use just O(1) time. (See Chapter "Structural Abstraction" in the book.) However, the solution uses O(n) memory as we'd represent natural number n as a list of n copies of ().
I haven't done the actual calculations, but I believe for the O(1) numerical addition we won't need the full power of O(1) FIFO queues, it'd be enough to bootstrap standard list [] (LIFO) in the same way. If you're interested, I could try to elaborate on that.
The problem with the CPU efficient solution is that we need to add some redundancy to the memory representation so that we can spare enough CPU time. In some cases, adding such a redundancy can be accomplished without compromising the memory size (like for O(1) increment/decrement operation). And if we allow arbitrary tree shapes, like in the CPU efficient solution with bootstrapped lists, there are simply too many tree shapes to distinguish them in O(log n) memory.
So the question is: Can we find just the right amount of redundancy so that sub-linear amount of memory is enough and with which we could achieve O(1) addition? I believe the answer is no:
Let's have a representation+algorithm that has O(1) time addition. Let's then have a number of the magnitude of m-bits, which we compute as a sum of 2^k numbers, each of them of the magnitude of (m-k)-bit. To represent each of those summands we need (regardless of the representation) minimum of (m-k) bits of memory, so at the beginning, we start with (at least) (m-k) 2^k bits of memory. Now at each of those 2^k additions, we are allowed to preform a constant amount of operations, so we are able to process (and ideally remove) total of C 2^k bits. Therefore at the end, the lower bound for the number of bits we need to represent the outcome is (m-k-C) 2^k bits. Since k can be chosen arbitrarily, our adversary can set k=m-C-1, which means the total sum will be represented with at least 2^(m-C-1) = 2^m/2^(C+1) ∈ O(2^m) bits. So a natural number n will always need O(n) bits of memory!

What advantages do StableNames have over reallyUnsafePtrEquality#, and vice versa?

data StableName a
Stable names have the following property:
If sn1 :: StableName and sn2 :: StableName and sn1 == sn2 then sn1 and sn2 were created by calls to makeStableName on the same object.
The reverse is not necessarily true: if two stable names are not equal, then the objects they name may still be equal.
reallyUnsafePtrEquality# :: a -> a -> Int#
reallyUnsafePtrEquality# returns whether two objects on the GHC heap are the same object. It's really unsafe because the garbage collector moves things around, closures, etc. To the best of my knowledge, it can return false negatives (it says two objects aren't the same, but they are), but not false positives (saying they're the same when they aren't).
Both of them seem to do the same basic thing: they can tell you whether two objects are definitely the same, but not whether they're definitely not.
The advantages I can see for StableNames are:
They can be hashed.
They are less nonportable.
Their behaviour is well-defined and supported.
They don't have reallyUnsafe as part of their name.
The advantages I can see for reallyUnsafePtrEquality#:
It can be called directly on the objects in question, instead of having to create separate StablesNames.
You don't have to go through an IO function to create the StableNames.
You don't have to keep StableNames around, so memory usage is lower.
The RTS doesn't have to do whatever magic it does to make the StableNames work, so performance is presumably better.
It has reallyUnsafe in the name and a # at the end. Hardcore!
My questions are:
Did I miss anything?
Is there any use case where the fact that StableNames are separate from the objects they name is an advantage?
Is either one more accurate (less likely to return false negatives) than the other?
If you don't need hashing, don't care about portability, and aren't bothered by using something called reallyUnsafe, is there any reason to prefer StableNames over reallyUnsafePtrEquality#?
Holding the StableName of an object doesn't prevent it from being garbage collected, whereas holding the object itself around (to use with reallyUnsafePtrEquality# later) does. Sure, you can use System.Mem.Weak, but at that point, why not just use a StableName? (In fact, weak pointers were added with StableNames.)
Being able to hash them is the main motivator for StableNames, as the documentation says:
We can't build a hash table using the address of the object as the key, because objects get moved around by the garbage collector, meaning a re-hash would be necessary after every garbage collection.
In general, if StableNames will work for your purposes, I'd use them, even if you need to use unsafePerformIO; if you really need reallyUnsafePtrEquality#, you'll know. The only example I can think of where reallyUnsafePtrEquality# would work and StableNames wouldn't is speeding up an expensive Eq instance:
x == y =
x `seq` y `seq`
case reallyUnsafePtrEquality# x y of
1# -> True
_ -> slowEq x y
There's probably other examples I just haven't thought of, but they're not common.

Resources