Haskell - What data structure to use for a sparse matrix multiply?

Haskell - What data structure to use for a sparse matrix multiply? - haskell

Using Haskell, I am doing exercises on HackerRank in order to familiarize myself with the language. For the particular problem I am currently doing, I will have to do a matrix multiply. Unlike in Python where I could just use Numpy, I've checked on Ideone and it seems Haskell does not have any linear algebra packages plugged in, so I am going to do it by hand. If I was doing this problem in F# I would just use a plain array, but in Haskell I am not sure as it has various array classes. I am looking for some advice on what I should be looking into here as I have a total of three days experience in the language so far.
I am also wondering whether tuples are stack or heap allocated in Haskell as I might have to use them to encode (index,value) positions.

To answer my own question, the goto class in Haskell for plain arrays would be the Data.Vector.Unboxed. Haskell has a distinction between boxed and unboxed arrays, and even though I knew that it still somehow surprised me that in a vector of vectors, the outer vector would have to be a boxed type.
Also regarding tuples, per documentation for efficiency a vector of tuples will get compiled as tuple of vectors which definitely means that the elements will get allocated to a contiguous area on the heap.

Related

Lists, boxed vectors, and unboxed vectors for heavy scientific computations

I'm told that I should use unboxed vectors for heavy scientific computations (simulations that run for hours and even days) instead of lists, or even boxed vectors.
Is this true?
Are there other data structures than lists, boxed vectors, or unboxed vectors that are common?
Can you explain the difference between boxed and unboxed vectors?

A couple of things to understand. First, boxed vs unboxed:
If you have a "boxed" number, basically you have a pointer to a number. (Or possibly a pointer to an unevaluated expression that will produce a number if it ever gets evaluated.)
If you have an "unboxed" number, you have literally the number itself.
For example, on a 64-bit platform, Char is a 64-bit pointer to somewhere on the Haskell heap that holds either a 32-bit Unicode codepoint, or an arbitrarily large unevaluated expression that produces such. On the other hand, Char# is literally just a 32-bit integer. It could be on the heap somewhere, or it could be in a CPU register or something.
Basically Char# is what C would think of as an integer, whereas Char is what C would think of as a pointer to... actually a data structure that tells you if it's evaluated or not, and if it is evaluated, the integer is in there. Which is quite a bit more complicated.
An important thing to notice is that a boxed value is potentially unevaluated; the pointer can point to the result, or the expression for producing the result. An unboxed value can never be unevaluated. For example, a 32-bit integer cannot possibly store an arbitrarily huge expression for calculating an integer; it can only store the integer itself. So boxed vs unboxed is unavoidably tangled up with lazy vs strict.
Lazy evaluation can allow you to avoid calculating result you don't actually need. (Infinite lists and so forth.) But if you actually do always need all the results, strict evaluation is actually faster.
Next, lists vs vectors:
A Haskell list [Double] (or whatever) is a singly-linked list of pointers to double-precision floats. Each float could be unevaluated, and each link in the list could also be unevaluated. As you can imagine, this has no kind of cache coherence whatsoever!
A vector is an array of stuff. This means no infinite vectors; the size of the vector must be known at creation time. It also means that to modify an immutable vector, you must copy the entire thing, which is very inefficient in time and space. Otherwise you must use mutable vectors, which negates some of the benefits of functional programming. On the other hand, vectors have awesome cache coherence!
Now, a boxed vector is basically an array of pointers to the actual data. An unboxed vector is an array of the actual data. Wanna take a guess which of those has the best cache behaviour? As a side effect, an unboxed vector is also strict, which — if you need the whole vector — is going to be faster.
So you see, unboxed vectors place certain restrictions on you, but potentially give best performance.
Having just said all that, GHC performs all kinds of tricky optimisations that can radically alter the performance of the code versus what it "appears" to be doing. GHC may turn lazy code into strict code, and it may perform "list fusion", where chains of functions that loop over lists get turns into a single tight loop. But then again, chains of vector operations get fused as well, so... in reality, actual performance rather depends on what you're trying to do.

Haskell data structure that is efficient for swapping elements?

I am looking for a Haskell data structure that stores an ordered list of elements and that is time-efficient at swapping pairs of elements at arbitrary locations within the list. It's not [a], obviously. It's not Vector because swapping creates new vectors. Which data structure is efficient at this?

The most efficient implementations of persistent data structures, which exhibit O(1) updates (as well as appending, prepending, counting and slicing), are based on the Array Mapped Trie algorithm. The Vector data-structures of Clojure and Scala are based on it, for instance. The only Haskell implementation of that data-structure that I know of is presented by the "persistent-vector" package.
This algorithm is very young, it was only first presented in the year 2000, which might be the reason why not so many people have ever heard about it. But the thing turned out to be such a universal solution that it got adapted for Hash-tables soon after. The adapted version of this algorithm is called Hash Array Mapped Trie. It is as well used in Clojure and Scala to implement the Set and Map data-structures. It is also more ubiquitous in Haskell with packages like "unordered-containers" and "stm-containers" revolving around it.
To learn more about the algorithm I recommend the following links:
http://blog.higher-order.net/2009/02/01/understanding-clojures-persistentvector-implementation.html
http://lampwww.epfl.ch/papers/idealhashtrees.pdf

Data.Sequence from the containers package would likely be a not-terrible data structure to start with for this use case.

Haskell is a (nearly) pure functional language, so any data structure you update will need to make a new copy of the structure, and re-using the data elements is close to the best you can do. Also, the new list would be lazily evaluated and typically only the spine would need to be created until you need the data. If the number of updates is small compared to the number of elements, you could make a difference list that checks a sparse set of updates first, and only then looks in the original vector.

Haskell Vector: not scientific, engineering vectors?

When I see Haskell's Vector, that's not a physics, linear algebra vector, is it? Vectors in Java (as I recall) is an array one can dynamically add to. But that's not usable as a science vector, is it? How could I have a science vector data structure in Haskell?

I assume you are referring to Data.Vector, and yes, you're right. It's not the vector the mathematical parts of us know. For a "legit" vector and any other linear algebra on free vector spaces there is the linear package.
Data.Vector is pretty much a memory array (very optimized, very good API).
Data.Vector really is a misleading name, and I assume it's directly or indirectly inspired from C++'s standard library vector class template which is also misleading.
From here, in C++'s standard library...
It's called a vector because Alex Stepanov, the designer of the
Standard Template Library, was looking for a name to distinguish it
from built-in arrays. He admits now that he made a mistake, because
mathematics already uses the term 'vector' for a fixed-length sequence
of numbers.
...
Alex's lesson: be very careful every time you name something.

I guess you want hmatrix, a Haskell interface to LAPACK and BLAS.

How can I obtain constant time access (like in an array) in a data structure in Haskell?

I'll get straight to it - is there a way to have a dynamically sized constant-time access data-structure in Haskell, much like an array in any other imperative language?
I'm sure there is a module somewhere that does this for us magically, but I'm hoping for a general explanation of how one would do this in a functional manner :)
As far as I'm aware, Map uses a binary tree representation so it has O(log(n)) access time, and lists of course have O(n) access time.
Additionally, if we made it so that it was immutable, it would be pure, right?
Any ideas how I could go about this (beyond something like Array = Array { one :: Int, two :: Int, three :: Int ...} in template Haskell or the like)?

If your key is isomorphic to Int then you can use IntMap as most of its operations are O(min(n,W)), where n is the number of elements and W is the number of bits in Int (usually 32 or 64), which means that as the collection gets large the cost of each individual operation converges to a constant.

a dynamically sized constant-time access data-structure in Haskell,
Data.Array
Data.Vector
etc etc.
For associative structures you can choose between:
Log-N tree and trie structures
Hash tables
Mixed hash mapped tries
With various different log-complexities and constant factors.
All of these are on hackage.

In addition to the other good answers, it might be useful to say that:
When restricted to Algebraic Data Types and purity, all dynamically
sized data structure must have at least logarithmic worst-case access
time.
Personally, I like to call this the price of purity.
Haskell offers you three main ways around this:
Change the problem: Use hashes or prefix trees.
For constant-time reads use pure Arrays or the more recent Vectors; they are not ADTs and need compiler support / hidden IO inside. Constant-time writes are not possible since purity forbids the original data structure to be modified.
For constant-time writes use the IO or ST monad, preferring ST when you can to avoid externally visible side effects. These monads are implemented in the compiler.

It's true that you can't have constant time access arrays in Haskell without compiler/runtime magic.
However, this isn't (just) because Haskell is functional. Arrays in Java and C# also require runtime magic. In Rust you might be able to implement them in unsafe code, but not in safe Rust.
The truth is any language that doesn't allow you to allocate memory of dynamic size, or that doesn't allow you to use pointers is going to require runtime magic to implement arrays.
That excludes any safe language, whether object oriented, or functional.
The only difference between Haskell and eg. Java with respect to Arrays, is that arrays are far less useful in Haskell than in Java, but in Java arrays are so core to everything we do that we don't even notice that they're magic.
There is one way though that Haskell requires more magic for arrays than eg. Java.
With Java you can initialise an empty array (which requires magic) and then fill it up with values (which doesn't).
With Haskell this would obviously go against immutability. So any array would have to be initialised with its values. Thus the compiler magic doesn't just stretch to giving you an empty chunk of memory to index into. It also requires giving you a way to initialise the array with values. So creation and initialisation of the array has to be a single step, entirely handled by the compiler.

Haskell arrays vs lists

I'm playing with Haskell and Project Euler's 23rd problem. After solving it with lists I went here where I saw some array work. This solution was much faster than mine.
So here's the question. When should I use arrays in Haskell? Is their performance better than lists' and in which cases?

The most obvious difference is the same as in other languages: arrays have O(1) lookup and lists have O(n). Attaching something to the head of a list (:) takes O(1); appending (++) takes O(n).
Arrays have some other potential advantages as well. You can have unboxed arrays, which means the entries are just stored contiguously in memory without having a pointer to each one (if I understand the concept correctly). This has several benefits--it takes less memory and could improve cache performance.
Modifying immutable arrays is difficult--you'd have to copy the entire array which takes O(n). If you're using mutable arrays, you can modify them in O(1) time, but you'd have to give up the advantages of having a purely functional solution.
Finally, lists are just much easier to work with if performance doesn't matter. For small amounts of data, I would always use a list.

And if you're doing much indexing as well as much updating,you can
use Maps (or IntMaps), O(log size) indexing and update, good enough for most uses, code easy to get right
or, if Maps are too slow, use mutable (unboxed) arrays (STUArray from Data.Array.ST or STVectors from the vector package; O(1) indexing and update, but the code is easier to get wrong and generally not as nice.
For specific uses, functions like accumArray give very good performance too (uses mutable arrays under the hood).

Arrays have O(1) indexing (this used to be part of the Haskell definition), whereas lists have O(n) indexing. On the other hand, any kind of modification of arrays is O(n) since it has to copy the array.
So if you're going to do a lot of indexing but little updating, use arrays.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string