Haskell Massiv Array Size Limit - haskell

If Massiv, as well as other array libraries use Int for indexing, then how does one construct and index arrays larger than 2^29 elements? Int can only be as large as 2^29. I noticed in the source code that Linear indexing is used on array operations aswell so I would assume that just writing a vector as a two dimensional array would still have the same issue.
Is there a solution to this within Massiv or is there another array library suitable for arrays with more than 2^29 elements?
Edit: #Thomas just mentioned that the maxBound of Int is machine dependent. How ever I would still like to know how to index arrays with a number of elements greater than the maxBound of Int.

There is no way to create a list that contains more than maxBound :: Int elements in memory, because the size of an Int is generally expected to be sufficient to cover the full addressable memory space. A hypothetical list or array of length greater than maxBound :: Int on your system therefore would not fit in addressable memory and could not be stored, thus there is no need for a mechanism by which one could index into such a structure.

Related

Why isn't bitwise trie a popular implementation of associative array

I have to store billions of entries with Int64 keys in an ordered map.
If I use a usual BST then each search operation costs log(N) pointer dereferencings (20-30 for millions to billions entries),
however bitwise trie with bitmap reduces this just to Ceil(64/6) = 11 pointer dereferencings.
This comes at a cost of an array for all 64 children in each trie node but I think that applying the usual array list growth strategy to this array and reusing previously allocated but discarded arrays with mitigate some problems with space wastage.
I'm aware a variant of this data structure is called HAMT and used as an effective persistent data structure, but this question is about a usual ordered map like std::map in C++, besides I need no deletions of entries.
However there are a few implementations of this data structure on github.
Why aren't bitwise tries as popular as binary search trees?
Disclaimer: I'm the author of https://en.wikipedia.org/wiki/Bitwise_trie_with_bitmap
Why aren't bitwise tries as popular as binary search trees?
Good question and I don't know the answer. To make bitwise tries more popular is exactly the reason to publish the wikipedia article.
This comes at a cost of an array for all 64 children in each trie node but I think that applying the usual array list growth strategy to this array and reusing previously allocated but discarded arrays with mitigate some problems with space wastage.
Nope: That's exactly where the bitmap comes in: To avoid having an array sized for all 64 possible children.

Do I need a HashSet or a Set?

I want to initialise a set of all Ints from 1 to n (n<20000). Then I want to remove them one by one and meanwhile check if certain elements are still in it until the set is empty.
Which data structure is suited best for this task?
If you want to stick to immutable data structures, I would recommend IntSet. It's carefully optimized for precisely this kind of thing. A Set Int is a balanced binary search tree of Ints, which takes a lot of space and a good bit of time. A HashSet Int is an array-mapped trie of Ints, which is likely faster and more compact, but still pretty mediocre. An IntSet is a PATRICIA tree whose leaves are bitsets. So it's pretty compact (a little over twice the size of an unboxed immutable array when full), but much more efficient to modify.
Initializing an IntSet with all Ints from 1 to n takes O(n) time. If you're only initializing once, or once in a while, and n < 20000, then that shouldn't cause any performance trouble. If, however, you need to initialize often (especially if you sometimes only remove a few elements before discarding the set), or n turns out to be much larger (e.g., hundreds of millions) and you want to cut down on initialization time, you can use IntSet to represent the complement of the set you want to store.
data CompSet = CompSet
{ initialMax :: !Int
, size :: !Int
, missingElements :: !IntSet
}
A CompSet stores the initial maximum (n), and an IntSet indicating which elements in [1..initialMax] are no longer in the set. The size of the CompSet is initialized to initialMax and lets you know in O(1) time whether the set is empty (i.e., when size missingElements = initialMax).
Use a bitset (a.k.a. Integer). A 1 bit represents a value still in the set; a 0 bit represents one that just ain't there. For example, the Integer that represents having all the numbers from 1 to n would be bit (n+1) - 2 (assuming you plan to use 0-indexing, as seems sensible to me); to check whether a number is in the set, use testBit; to remove a number, use clearBit.
An alternate implementation strategy for the same underlying idea would be to use an unboxed array of Bool, either mutable or immutable as needed. The unboxed versions do the appropriate bit-packing. The only downside would be possibly having to resize the array if you need to add numbers to the set later that are larger than you originally allocated space for.

Are sequences faster than vectors for searching in haskell?

I am kind of new using data structures in haskell besides of Lists. My goal is to chose one container among Data.Vector, Data.Sequence, Data.List, etc ... My problem is the following:
I have to create a sequence (mathematically speaking). The sequence starts at 0. In each iteration two new elements are generated but only one should be appended based in whether the first element is already in the sequence. So in each iteration there is a call to elem function (see the pseudo-code below).
appendNewItem :: [Integer] -> [Integer]
appendNewItem acc = let firstElem = someFunc
secondElem = someOtherFunc
newElem = if firstElem `elem` acc
then secondElem
else firstElem
in acc `append` newElem
sequenceUptoN :: Int -> [Integer]
sequenceUptoN n = (iterate appendNewItem [0]) !! n
Where append and iterate functions vary depending on which colection you use (I am using lists in the type signature for simplicity).
The question is: Which data structure should I use?. Is Data.Sequence faster for this task because of the Finger Tree inner structure?
Thanks a lot!!
No, sequences are not faster for searching. A Vector is just a flat chunk of memory, which gives generally the best lookup performance. If you want to optimise searching, use Data.Vector.Unboxed. (The normal, “boxed” variant is also pretty good, but it actually contains only references to the elements in the flat memory-chunk, so it's not quite as fast for lookups.)
However, because of the flat memory layout, Vectors are not good for (pure-functional) appending: basically, whenever you add a new element, the whole array must be copied so as to not invalidate the old one (which somebody else might still be using). If you need to append, Seq is a pretty good choice, although it's not as fast as destructive appending: for maximum peformance, you'll want to pre-allocate an uninitialized Data.Vector.Unboxed.Mutable.MVector of the required size, populate it using the ST monad, and freeze the result. But this is much more fiddly than purely-functional alternatives, so unless you need to squeeze out every bit of performance, Data.Sequence is the way to go. If you only want to append, but not look up elements, then a plain old list in reverse order would also do the trick.
I suggest using Data.Sequence in conjunction with Data.Set. The Sequence to hold the sequence of values and the Set to track the collection.
Sequence, List, and Vector are all structures for working with values where the position in the structure has primary importance when it comes to indexing. In lists we can manipulate elements at the front efficiently, in sequences we can manipulate elements based on the log of the distance the closest end, and in vectors we can access any element in constant time. Vectors however, are not that useful if the length keeps changing, so that rules out their use here.
However, you also need to lookup a certain value within the list, which these structures don't help with. You have to search the whole of a list/sequence/vector to be certain that a new value isn't present. Data.Map and Data.Set are two of the structures for which you define an index value based on Ord, and let you lookup/insert in log(n). So, at the cost of memory usage you can lookup the presence of firstElem in your Set in log(n) time and then add newElem to the end of the sequence in constant time. Just make sure to keep these two structures in synch when adding or taking new elements.

Minimal and maximal magnitude in Fortran

I'm trying to rewrite minpack Fortran77 library to Java (for my own needs), so I met this in minpack.f source code:
integer mcheps(4)
integer minmag(4)
integer maxmag(4)
double precision dmach(3)
equivalence (dmach(1),mcheps(1))
equivalence (dmach(2),minmag(1))
equivalence (dmach(3),maxmag(1))
...
data dmach(1) /2.22044604926d-16/
data dmach(2) /2.22507385852d-308/
data dmach(3) /1.79769313485d+308/
dpmpar = dmach(i)
return
What are minmag and maxmag functions, and why dmach(2) and dmach(3) have these values?
There is an explanation in comments:
c dpmpar(1) = b**(1 - t), the machine precision,
c dpmpar(2) = b**(emin - 1), the smallest magnitude,
c dpmpar(3) = b**emax*(1 - b**(-t)), the largest magnitude.
What is smallest and largest magnitude? There must be a way to count these values on runtime; machine constants in source code is a bad style.
EDIT:
I suppose that static fields Double.MIN_VALUE and Double.MAX_VALUE are those values I looked for.
minmag and maxmag (and mcheps too) are not functions, they are declared to be rank 1 integer arrays with 4 elements each. Likewise dmach is a rank 1 3 element array of double precision values. It is very likely, but not certain, that each integer value occupies 4 bytes and each d-p value 8 bytes. Bear this in mind as the answer progresses.
So an expression such as mcheps(1) is not a function call but a reference to the 1st element of an array.
equivalence is an old FORTRAN feature, now deprecated both by language standards and by software engineering practices. A statement such as
equivalence (dmach(1),mcheps(1))
states that the first element of dmach is located, in memory, at the same address as the first element of mcheps. By implication, this also means that the 24 bytes of dmach occupy the same addresses as the 16 bytes of mcheps, and another 8 bytes too. I'll leave you to draw a picture of what is going on. Note that it is conceivable that the code originally (and perhaps still) uses 8 byte integers so that the elements of the equivalenced arrays match 1:1.
Note that equivalence gives, essentially, more than one name, and more than one interpretation, to the same memory locations. mcheps(1) is the name of an integer stored in 4 bytes of memory which form part of the storage for dmach(1). Equivalencing used to be used to implement all sorts of 'clever' tricks back in the days when every byte was precious.
Then the data statements assign values to the elements of dmach. To me those values look to be just what the comment tells us they are.
EDIT: The comment indicates that those magnitudes are the smallest and largest representable double precision numbers on the platform for which the code was last compiled. I think that in Java they are probably called doubles. I don't know Java so don't know what facilities it has for returning the value of the largest and smallest doubles, if you don't know this either hit the 'net or ask another SO question -- to which you'll probably get responses along the lines of search the net.
Most of this you should be able to ignore entirely. As you write, a better approach would be to find out those values at run-time by enquiry using intrinsic functions. Fortran 90 (and later) have such functions, I imagine Java has too but that's your domain not mine.

`Integer` vs `Int64` vs `Word64`

I have some data which can be represented by an unsigned Integral type and its biggest value requires 52 bits. AFAIK only Integer, Int64 and Word64 satisfy these requirements.
All the information I could find out about those types was that Integer is signed and has a floating unlimited bit-size, Int64 and Word64 are fixed and signed and unsigned respectively. What I coudn't find out was the information on the actual implementation of those types:
How many bits will a 52-bit value actually occupy if stored as an Integer?
Am I correct that Int64 and Word64 allow you to store a 64-bit data and weigh exactly 64 bits for any value?
Are any of those types more performant or preferrable for any other reasons than size, e.g. native code implementations or direct processor instructions-related optimizations?
And just in case: which one would you recommend for storing a 52-bit value in an application extremely sensitive in terms of performance?
How many bits will a 52-bit value actually occupy if stored as an Integer?
This is implementation-dependent. With GHC, values that fit inside a machine word are stored directly in a constructor of Integer, so if you're on a 64-bit machine, it should take the same amount of space as an Int. This corresponds to the S# constructor of Integer:
data Integer = S# Int#
| J# Int# ByteArray#
Larger values (i.e. those represented with J#) are stored with GMP.
Am I correct that Int64 and Word64 allow you to store a 64-bit data and weigh exactly 64 bits for any value?
Not quite — they're boxed. An Int64 is actually a pointer to either an unevaluated thunk or a one-word pointer to an info table plus a 64-bit integer value. (See the GHC commentary for more information.)
If you really want something that's guaranteed to be 64 bits, no exceptions, then you can use an unboxed type like Int64#, but I would strongly recommend profiling first; unboxed values are quite painful to use. For instance, you can't use unboxed types as arguments to type constructors, so you can't have a list of Int64#s. You also have to use operations specific to unboxed integers. And, of course, all of this is extremely GHC-specific.
If you're looking to store a lot of 52-bit integers, you might want to use vector or repa (built on vector, with fancy things like automatic parallelism); they store the values unboxed under the hood, but let you work with them in boxed form. (Of course, each individual value you take out will be boxed.)
Are any of those types more performant or preferrable for any other reasons than size, e.g. native code implementations or direct processor instructions-related optimizations?
Yes; using Integer incurs a branch for every operation, since it has to distinguish the machine-word and bignum cases; and, of course, it has to handle overflow. Fixed-size integral types avoid this overhead.
And just in case: which one would you recommend for storing a 52-bit value in an application extremely sensitive in terms of performance?
If you're using a 64-bit machine: Int64 or, if you must, Int64#.
If you're using a 32-bit machine: Probably Integer, since on 32-bit Int64 is emulated with FFI calls to GHC functions that are probably not very highly optimised, but I'd try both and benchmark it. With Integer, you'll get the best performance on small integers, and GMP is heavily-optimised, so it'll probably do better on the larger ones than you might think.
You could select between Int64 and Integer at compile-time using the C preprocessor (enabled with {-# LANGUAGE CPP #-}); I think it would be easy to get Cabal to control a #define based on the word width of the target architecture. Beware, of course, that they are not the same; you will have to be careful to avoid "overflows" in the Integer code, and e.g. Int64 is an instance of Bounded but Integer is not. It might be simplest to just target a single word width (and thus type) for performance and live with the slower performance on the other.
I would suggest creating your own Int52 type as a newtype wrapper over Int64, or a Word52 wrapper over Word64 — just pick whichever matches your data better, there should be no performance impact; if it's just arbitrary bits I'd go with Int64, just because Int is more common than Word.
You can define all the instances to handle wrapping automatically (try :info Int64 in GHCi to find out which instances you'll want to define), and provide "unsafe" operations that just apply directly under the newtype for performance-critical situations where you know there won't be any overflow.
Then, if you don't export the newtype constructor, you can always swap the implementation of Int52 later, without changing any of the rest of your code. Don't worry about the overhead of a separate type — the runtime representation of a newtype is completely identical to the underlying type; they only exist at compile-time.

Resources