`Integer` vs `Int64` vs `Word64` - haskell

I have some data which can be represented by an unsigned Integral type and its biggest value requires 52 bits. AFAIK only Integer, Int64 and Word64 satisfy these requirements.
All the information I could find out about those types was that Integer is signed and has a floating unlimited bit-size, Int64 and Word64 are fixed and signed and unsigned respectively. What I coudn't find out was the information on the actual implementation of those types:
How many bits will a 52-bit value actually occupy if stored as an Integer?
Am I correct that Int64 and Word64 allow you to store a 64-bit data and weigh exactly 64 bits for any value?
Are any of those types more performant or preferrable for any other reasons than size, e.g. native code implementations or direct processor instructions-related optimizations?
And just in case: which one would you recommend for storing a 52-bit value in an application extremely sensitive in terms of performance?

How many bits will a 52-bit value actually occupy if stored as an Integer?
This is implementation-dependent. With GHC, values that fit inside a machine word are stored directly in a constructor of Integer, so if you're on a 64-bit machine, it should take the same amount of space as an Int. This corresponds to the S# constructor of Integer:
data Integer = S# Int#
| J# Int# ByteArray#
Larger values (i.e. those represented with J#) are stored with GMP.
Am I correct that Int64 and Word64 allow you to store a 64-bit data and weigh exactly 64 bits for any value?
Not quite — they're boxed. An Int64 is actually a pointer to either an unevaluated thunk or a one-word pointer to an info table plus a 64-bit integer value. (See the GHC commentary for more information.)
If you really want something that's guaranteed to be 64 bits, no exceptions, then you can use an unboxed type like Int64#, but I would strongly recommend profiling first; unboxed values are quite painful to use. For instance, you can't use unboxed types as arguments to type constructors, so you can't have a list of Int64#s. You also have to use operations specific to unboxed integers. And, of course, all of this is extremely GHC-specific.
If you're looking to store a lot of 52-bit integers, you might want to use vector or repa (built on vector, with fancy things like automatic parallelism); they store the values unboxed under the hood, but let you work with them in boxed form. (Of course, each individual value you take out will be boxed.)
Are any of those types more performant or preferrable for any other reasons than size, e.g. native code implementations or direct processor instructions-related optimizations?
Yes; using Integer incurs a branch for every operation, since it has to distinguish the machine-word and bignum cases; and, of course, it has to handle overflow. Fixed-size integral types avoid this overhead.
And just in case: which one would you recommend for storing a 52-bit value in an application extremely sensitive in terms of performance?
If you're using a 64-bit machine: Int64 or, if you must, Int64#.
If you're using a 32-bit machine: Probably Integer, since on 32-bit Int64 is emulated with FFI calls to GHC functions that are probably not very highly optimised, but I'd try both and benchmark it. With Integer, you'll get the best performance on small integers, and GMP is heavily-optimised, so it'll probably do better on the larger ones than you might think.
You could select between Int64 and Integer at compile-time using the C preprocessor (enabled with {-# LANGUAGE CPP #-}); I think it would be easy to get Cabal to control a #define based on the word width of the target architecture. Beware, of course, that they are not the same; you will have to be careful to avoid "overflows" in the Integer code, and e.g. Int64 is an instance of Bounded but Integer is not. It might be simplest to just target a single word width (and thus type) for performance and live with the slower performance on the other.
I would suggest creating your own Int52 type as a newtype wrapper over Int64, or a Word52 wrapper over Word64 — just pick whichever matches your data better, there should be no performance impact; if it's just arbitrary bits I'd go with Int64, just because Int is more common than Word.
You can define all the instances to handle wrapping automatically (try :info Int64 in GHCi to find out which instances you'll want to define), and provide "unsafe" operations that just apply directly under the newtype for performance-critical situations where you know there won't be any overflow.
Then, if you don't export the newtype constructor, you can always swap the implementation of Int52 later, without changing any of the rest of your code. Don't worry about the overhead of a separate type — the runtime representation of a newtype is completely identical to the underlying type; they only exist at compile-time.

Related

How can I bit-convert between Int and Word quickly?

The Haskell base documentation says that "A Word is an unsigned integral type, with the same size as Int."
How can I take an Int and cast its bit representation to a Word, so I get a Word value with the same bit representation as the original Int (even though the number values they represent will be different)?
I can't use fromIntegral because that will change the bit representation.
I could loop through the bits with the Bits class, but I suspect that will be very slow - and I don't need to do any kind of bit manipulation. I want some kind of function that will be compiled down to a no-op (or close to it), because no conversion is done.
Motivation
I want to use IntSet as a fast integer set implementation - however, what I really want to store in it are Words. I feel that I could create a WordSet which is backed by an IntSet, by converting between them quickly. The trouble is, I don't want to convert by value, because I don't want to truncate the top half of Word values: I just want to keep the bit representation the same.
int2Word#/word2Int# in GHC.Prim perform bit casting. You can implement wrapper functions which cast between boxed Int/Word using them easily.

What does Int use three bits for? [duplicate]

Why is GHC's Int type not guaranteed to use exactly 32 bits of precision? This document claim it has at least 30-bit signed precision. Is it somehow related to fitting Maybe Int or similar into 32-bits?
It is to allow implementations of Haskell that use tagging. When using tagging you need a few bits as tags (at least one, two is better). I'm not sure there currently are any such implementations, but I seem to remember Yale Haskell used it.
Tagging can somewhat avoid the disadvantages of boxing, since you no longer have to box everything; instead the tag bit will tell you if it's evaluated etc.
The Haskell language definition states that the type Int covers at least the range [−229, 229−1].
There are other compilers/interpreters that use this property to boost the execution time of the resulting program.
All internal references to (aligned) Haskell data point to memory addresses that are multiple of 4(8) on 32-bit(64-bit) systems. So, references need only 30bits(61bits) and therefore allow 2(3) bits for "pointer tagging".
In case of data, the GHC uses those tags to store information about that referenced data, i.e. whether that value is already evaluated and if so which constructor it has.
In case of 30-bit Ints (so, not GHC), you could use one bit to decide if it is either a pointer to an unevaluated Int or that Int itself.
Pointer tagging could be used for one-bit reference counting, which can speed up the garbage collection process. That can be useful in cases where a direct one-to-one producer-consumer relationship was created at runtime: It would result directly in memory reuse instead of a garbage collector feeding.
So, using 2 bits for pointer tagging, there could be some wild combination of intense optimisation...
In case of Ints I could imagine these 4 tags:
a singular reference to an unevaluated Int
one of many references to the same possibly still unevaluated Int
30 bits of that Int itself
a reference (of possibly many references) to an evaluated 32-bit Int.
I think this is because of early ways to implement GC and all that stuff. If you have 32 bits available and you only need 30, you could use those two spare bits to implement interesting things, for instance using a zero in the least significant bit to denote a value and a one for a pointer.
Today the implementations don't use those bits so an Int has at least 32 bits on GHC. (That's not entirely true. IIRC one can set some flags to have 30 or 31 bit Ints)

Why is Data.Word in Haskell called that?

Why is an unsigned integral type in Haskell called "Word"? Does the "ord" stand for "ordinal"?
Here is the documentation on Data.Word:
http://hackage.haskell.org/package/base-4.6.0.1/docs/Data-Word.html#t:Word
This is a very hard question to google!
From wikipedia:
The term 'word' is used for a small group of bits which are handled simultaneously by processors of a particular architecture. The size of a word is thus CPU-specific. Many different word sizes have been used, including 6-, 8-, 12-, 16-, 18-, 24-, 32-, 36-, 39-, 48-, 60-, and 64-bit. Since it is architectural, the size of a word is usually set by the first CPU in a family, rather than the characteristics of a later compatible CPU. The meanings of terms derived from word, such as longword, doubleword, quadword, and halfword, also vary with the CPU and OS.
In short, a word is a fixed-length group of bits that the CPU can process. We normally work with words that are powers of two, since modern CPUs handle those very well. A word in particular is not really a number, although we treat it as such for most purposes. Instead, just think of it as a fixed number of bits in RAM that you can manipulate. A common use for something like Word8 is to implement ASCII C-style strings, for example. Haskell's implementation treats WordN types as unsigned integers that implement Num, among other type classes.
There is a module called Data.Ord where "Ord" stands for "Ordering". It has extra functions for working with comparisons of datatypes and is where the Ordering datatype and Ord typeclass are defined. It is unrelated to Data.Word.

Bit Size of GHC's Int Type

Why is GHC's Int type not guaranteed to use exactly 32 bits of precision? This document claim it has at least 30-bit signed precision. Is it somehow related to fitting Maybe Int or similar into 32-bits?
It is to allow implementations of Haskell that use tagging. When using tagging you need a few bits as tags (at least one, two is better). I'm not sure there currently are any such implementations, but I seem to remember Yale Haskell used it.
Tagging can somewhat avoid the disadvantages of boxing, since you no longer have to box everything; instead the tag bit will tell you if it's evaluated etc.
The Haskell language definition states that the type Int covers at least the range [−229, 229−1].
There are other compilers/interpreters that use this property to boost the execution time of the resulting program.
All internal references to (aligned) Haskell data point to memory addresses that are multiple of 4(8) on 32-bit(64-bit) systems. So, references need only 30bits(61bits) and therefore allow 2(3) bits for "pointer tagging".
In case of data, the GHC uses those tags to store information about that referenced data, i.e. whether that value is already evaluated and if so which constructor it has.
In case of 30-bit Ints (so, not GHC), you could use one bit to decide if it is either a pointer to an unevaluated Int or that Int itself.
Pointer tagging could be used for one-bit reference counting, which can speed up the garbage collection process. That can be useful in cases where a direct one-to-one producer-consumer relationship was created at runtime: It would result directly in memory reuse instead of a garbage collector feeding.
So, using 2 bits for pointer tagging, there could be some wild combination of intense optimisation...
In case of Ints I could imagine these 4 tags:
a singular reference to an unevaluated Int
one of many references to the same possibly still unevaluated Int
30 bits of that Int itself
a reference (of possibly many references) to an evaluated 32-bit Int.
I think this is because of early ways to implement GC and all that stuff. If you have 32 bits available and you only need 30, you could use those two spare bits to implement interesting things, for instance using a zero in the least significant bit to denote a value and a one for a pointer.
Today the implementations don't use those bits so an Int has at least 32 bits on GHC. (That's not entirely true. IIRC one can set some flags to have 30 or 31 bit Ints)

What performance can I expect from Int32 and Int64?

I often see pprograms like this, where Int64 is an absolute performance killer on 32-bit platforms. My question is now:
If I need a specific word length for my task (in my case a RNG), is Int64 efficient on 64-bit platforms or will it still use the C-calls? And how efficient is converting an Int64 to an Int?
on a 64bit system Int64 should be fine, I don't know for sure though.
More importantly if you're doing Crypto or random-number-generation you MUST use the data type that the algorithm says to use, also be careful of signed-ness. If you do-not do this you will have the wrong results, which might mean that your cryptography is not secure or your random number generator is not really random (RNGs are hard, and many look random but aren't).
For any other type of work use Integer wherever you can, or even better, make your program polymorphic using the Integral type-class. Then if you think your program is slower than it should be profile it to determine where you should concentrate when you try to speed it up. if you use the Integral type-class changing from Integer to Int is easy. Haskell should be smart enough to specialize (most) code that uses polymorphism to avoid overheads.
Interesting article on 64 bit performance here:
Isn’t my code going to be faster on 64-bit???
As the article states, the big bottleneck isn't the processor, it's the cache and memory I/O.

Resources