Handle dependency resolution functionally in Haskell

Handle dependency resolution functionally in Haskell - haskell

I'm implementing something similar to a Spreadsheet engine in Haskell.
There are ETables, which have rows of cells containing expressions in the form of ASTs (e.g. BinOp + 2 2), which can contain references to other cells of ETables.
The main function should convert these ETables into VTables, which contain a fully resolved value in the cell (e.g. the cell BinOp + 2 2 should be resolved to IntValue 4). This is pretty easy when cells have no external references, because you can just build the value bottom up from the expression AST of the cell (e.g. eval (BinOpExpr op l r) = IntValue $ (eval l) op (eval r), minus unboxing and typechecking) all the way to the table (evalTable = (map . map) eval rows)
However, I can't think of a "natural" way of handling this when external references are thrown into the mix. Am I correct to assume that I can't just call eval on the referenced cell and use its value, because Haskell is not smart enough to cache the result and re-use it when that cell is independently evaluated?
The best thing I came up with is using a State [VTable] which is progressively filled, so the caching is explicit (each eval call updates the state with the return value before returning). This should work, however it feels "procedural". Is there a more idiomatic approach available that I'm missing?

Haskell doesn't memoization by default because that would generally take up too much memory, so you can't just rely on eval doing the right thing. However, the nature of lazy evaluation means that data structures are, in a sense, memoized: each thunk in a large lazy structure is only evaluated once. This means that you can memoize a function by defining a large lazy data structure internally and replacing recursive calls with accesses into the structure—each part of the structure will be evaluated at most once.
I think the most elegant approach to model your spreadsheet would be a large, lazy directed graph with the cells as nodes and references as edges. Then you would need to define the VTable graph in a recursive way such that all recursion goes through the graph itself, which will memoize the result in the way I described above.
There are a couple of handy ways to model a graph. One option would be to use an explicit map with integers as node identifiers—IntMap or even an array of some sort could work. Another option is to use an existing graph library; this will save you some work and ensure you have a nice graph abstraction, but will take some effort up front to understand. I'm a big fan of the fgl, the "functional graph library", but it does take a bit of up-front reading and thinking to understand. The performance isn't going to be very different because it's also implemented in terms of IntMap.
Tooting my own horn a bit, I've written a couple of blog posts expanding on this answer: one about memoization with lazy structures (with pictures!) and one about the functional graph library. Putting the two ideas together should get you what you want, I believe.

Related

How can I obtain constant time access (like in an array) in a data structure in Haskell?

I'll get straight to it - is there a way to have a dynamically sized constant-time access data-structure in Haskell, much like an array in any other imperative language?
I'm sure there is a module somewhere that does this for us magically, but I'm hoping for a general explanation of how one would do this in a functional manner :)
As far as I'm aware, Map uses a binary tree representation so it has O(log(n)) access time, and lists of course have O(n) access time.
Additionally, if we made it so that it was immutable, it would be pure, right?
Any ideas how I could go about this (beyond something like Array = Array { one :: Int, two :: Int, three :: Int ...} in template Haskell or the like)?

If your key is isomorphic to Int then you can use IntMap as most of its operations are O(min(n,W)), where n is the number of elements and W is the number of bits in Int (usually 32 or 64), which means that as the collection gets large the cost of each individual operation converges to a constant.

a dynamically sized constant-time access data-structure in Haskell,
Data.Array
Data.Vector
etc etc.
For associative structures you can choose between:
Log-N tree and trie structures
Hash tables
Mixed hash mapped tries
With various different log-complexities and constant factors.
All of these are on hackage.

In addition to the other good answers, it might be useful to say that:
When restricted to Algebraic Data Types and purity, all dynamically
sized data structure must have at least logarithmic worst-case access
time.
Personally, I like to call this the price of purity.
Haskell offers you three main ways around this:
Change the problem: Use hashes or prefix trees.
For constant-time reads use pure Arrays or the more recent Vectors; they are not ADTs and need compiler support / hidden IO inside. Constant-time writes are not possible since purity forbids the original data structure to be modified.
For constant-time writes use the IO or ST monad, preferring ST when you can to avoid externally visible side effects. These monads are implemented in the compiler.

It's true that you can't have constant time access arrays in Haskell without compiler/runtime magic.
However, this isn't (just) because Haskell is functional. Arrays in Java and C# also require runtime magic. In Rust you might be able to implement them in unsafe code, but not in safe Rust.
The truth is any language that doesn't allow you to allocate memory of dynamic size, or that doesn't allow you to use pointers is going to require runtime magic to implement arrays.
That excludes any safe language, whether object oriented, or functional.
The only difference between Haskell and eg. Java with respect to Arrays, is that arrays are far less useful in Haskell than in Java, but in Java arrays are so core to everything we do that we don't even notice that they're magic.
There is one way though that Haskell requires more magic for arrays than eg. Java.
With Java you can initialise an empty array (which requires magic) and then fill it up with values (which doesn't).
With Haskell this would obviously go against immutability. So any array would have to be initialised with its values. Thus the compiler magic doesn't just stretch to giving you an empty chunk of memory to index into. It also requires giving you a way to initialise the array with values. So creation and initialisation of the array has to be a single step, entirely handled by the compiler.

Is there an object-identity-based, thread-safe memoization library somewhere?

I know that memoization seems to be a perennial topic here on the haskell tag on stack overflow, but I think this question has not been asked before.
I'm aware of several different 'off the shelf' memoization libraries for Haskell:
The memo-combinators and memotrie packages, which make use of a beautiful trick involving lazy infinite data structures to achieve memoization in a purely functional way. (As I understand it, the former is slightly more flexible, while the latter is easier to use in simple cases: see this SO answer for discussion.)
The uglymemo package, which uses unsafePerformIO internally but still presents a referentially transparent interface. The use of unsafePerformIO internally results in better performance than the previous two packages. (Off the shelf, its implementation uses comparison-based search data structures, rather than perhaps-slightly-more-efficient hash functions; but I think that if you find and replace Cmp for Hashable and Data.Map for Data.HashMap and add the appropraite imports, you get a hash based version.)
However, I'm not aware of any library that looks answers up based on object identity rather than object value. This can be important, because sometimes the kinds of object which are being used as keys to your memo table (that is, as input to the function being memoized) can be large---so large that fully examining the object to determine whether you've seen it before is itself a slow operation. Slow, and also unnecessary, if you will be applying the memoized function again and again to an object which is stored at a given 'location in memory' 1. (This might happen, for example, if we're memoizing a function which is being called recursively over some large data structure with a lot of structural sharing.) If we've already computed our memoized function on that exact object before, we can already know the answer, even without looking at the object itself!
Implementing such a memoization library involves several subtle issues and doing it properly requires several special pieces of support from the language. Luckily, GHC provides all the special features that we need, and there is a paper by Peyton-Jones, Marlow and Elliott which basically worries about most of these issues for you, explaining how to build a solid implementation. They don't provide all details, but they get close.
The one detail which I can see which one probably ought to worry about, but which they don't worry about, is thread safety---their code is apparently not threadsafe at all.
My question is: does anyone know of a packaged library which does the kind of memoization discussed in the Peyton-Jones, Marlow and Elliott paper, filling in all the details (and preferably filling in proper thread-safety as well)?
Failing that, I guess I will have to code it up myself: does anyone have any ideas of other subtleties (beyond thread safety and the ones discussed in the paper) which the implementer of such a library would do well to bear in mind?
UPDATE
Following #luqui's suggestion below, here's a little more data on the exact problem I face. Let's suppose there's a type:
data Node = Node [Node] [Annotation]
This type can be used to represent a simple kind of rooted DAG in memory, where Nodes are DAG Nodes, the root is just a distinguished Node, and each node is annotated with some Annotations whose internal structure, I think, need not concern us (but if it matters, just ask and I'll be more specific.) If used in this way, note that there may well be significant structural sharing between Nodes in memory---there may be exponentially more paths which lead from the root to a node than there are nodes themselves. I am given a data structure of this form, from an external library with which I must interface; I cannot change the data type.
I have a function
myTransform : Node -> Node
the details of which need not concern us (or at least I think so; but again I can be more specific if needed). It maps nodes to nodes, examining the annotations of the node it is given, and the annotations its immediate children, to come up with a new Node with the same children but possibly different annotations. I wish to write a function
recursiveTransform : Node -> Node
whose output 'looks the same' as the data structure as you would get by doing:
recursiveTransform Node originalChildren annotations =
myTransform Node recursivelyTransformedChildren annotations
where
recursivelyTransformedChildren = map recursiveTransform originalChildren
except that it uses structural sharing in the obvious way so that it doesn't return an exponential data structure, but rather one on the order of the same size as its input.
I appreciate that this would all be easier if say, the Nodes were numbered before I got them, or I could otherwise change the definition of a Node. I can't (easily) do either of these things.
I am also interested in the general question of the existence of a library implementing the functionality I mention quite independently of the particular concrete problem I face right now: I feel like I've had to work around this kind of issue on a few occasions, and it would be nice to slay the dragon once and for all. The fact that SPJ et al felt that it was worth adding not one but three features to GHC to support the existence of libraries of this form suggests that the feature is genuinely useful and can't be worked around in all cases. (BUT I'd still also be very interested in hearing about workarounds which will help in this particular case too: the long term problem is not as urgent as the problem I face right now :-) )
1 Technically, I don't quite mean location in memory, since the garbage collector sometimes moves objects around a bit---what I really mean is 'object identity'. But we can think of this as being roughly the same as our intuitive idea of location in memory.

If you only want to memoize based on object identity, and not equality, you can just use the existing laziness mechanisms built into the language.
For example, if you have a data structure like this
data Foo = Foo { ... }
expensive :: Foo -> Bar
then you can just add the value to be memoized as an extra field and let the laziness take care of the rest for you.
data Foo = Foo { ..., memo :: Bar }
To make it easier to use, add a smart constructor to tie the knot.
makeFoo ... = let foo = Foo { ..., memo = expensive foo } in foo
Though this is somewhat less elegant than using a library, and requires modification of the data type to really be useful, it's a very simple technique and all thread-safety issues are already taken care of for you.

It seems that stable-memo would be just what you needed (although I'm not sure if it can handle multiple threads):
Whereas most memo combinators memoize based on equality, stable-memo does it based on whether the exact same argument has been passed to the function before (that is, is the same argument in memory).
stable-memo only evaluates keys to WHNF.
This can be more suitable for recursive functions over graphs with cycles.
stable-memo doesn't retain the keys it has seen so far, which allows them to be garbage collected if they will no longer be used. Finalizers are put in place to remove the corresponding entries from the memo table if this happens.
Data.StableMemo.Weak provides an alternative set of combinators that also avoid retaining the results of the function, only reusing results if they have not yet been garbage collected.
There is no type class constraint on the function's argument.
stable-memo will not work for arguments which happen to have the same value but are not the same heap object. This rules out many candidates for memoization, such as the most common example, the naive Fibonacci implementation whose domain is machine Ints; it can still be made to work for some domains, though, such as the lazy naturals.

Ekmett just uploaded a library that handles this and more (produced at HacPhi): http://hackage.haskell.org/package/intern. He assures me that it is thread safe.
Edit: Actually, strictly speaking I realize this does something rather different. But I think you can use it for your purposes. It's really more of a stringtable-atom type interning library that works over arbitrary data structures (including recursive ones). It uses WeakPtrs internally to maintain the table. However, it uses Ints to index the values to avoid structural equality checks, which means packing them into the data type, when what you want are apparently actually StableNames. So I realize this answers a related question, but requires modifying your data type, which you want to avoid...

Ordering of parameters to make use of currying

I have twice recently refactored code in order to change the order of parameters because there was too much code where hacks like flip or \x -> foo bar x 42 were happening.
When designing a function signature what principles will help me to make the best use of currying?

For languages that support currying and partial-application easily, there is one compelling series of arguments, originally from Chris Okasaki:
Put the data structure as the last argument
Why? You can then compose operations on the data nicely. E.g. insert 1 $ insert 2 $ insert 3 $ s. This also helps for functions on state.
Standard libraries such as "containers" follow this convention.
Alternate arguments are sometimes given to put the data structure first, so it can be closed over, yielding functions on a static structure (e.g. lookup) that are a bit more concise. However, the broad consensus seems to be that this is less of a win, especially since it pushes you towards heavily parenthesized code.
Put the most varying argument last
For recursive functions, it is common to put the argument that varies the most (e.g. an accumulator) as the last argument, while the argument that varies the least (e.g. a function argument) at the start. This composes well with the data structure last style.
A summary of the Okasaki view is given in his Edison library (again, another data structure library):
Partial application: arguments more likely to be static usually appear before other arguments in order to facilitate partial application.
Collection appears last: in all cases where an operation queries a single collection or modifies an existing collection, the collection argument will appear last. This is something of a de facto standard for Haskell datastructure libraries and lends a degree of consistency to the API.
Most usual order: where an operation represents a well-known mathematical function on more than one datastructure, the arguments are chosen to match the most usual argument order for the function.

Place the arguments that you are most likely to reuse first. Function arguments are a great example of this. You are much more likely to want to map f over two different lists, than you are to want to map many different functions over the same list.

I tend to do what you did, pick some order that seems good and then refactor if it turns out that another order is better. The order depends a lot on how you are going to use the function (naturally).

Represent Flowchart-specified Algorithms in Haskell

I'm confronted with the task of implementing algorithms (mostly business logic style) expressed as flowcharts. I'm aware that flowcharts are not the best algorithm representation due to its spaghetti-code property (would this be a use-case for CPS?), but I'm stuck with the specification expressed as flowcharts.
Although I could transform the flowcharts into more appropriate equivalent representations before implementing them, that could make it harder to "recognize" the orginal flow-chart in the resulting implementation, so I was hoping there is some way to directly represent flowchart-algorithms as (maybe monadic) EDSLs in Haskell, so that the semblance to the original flowchart-specification would be (more) obvious.

One possible representation of flowcharts is by using a group of mutually tail-recursive functions, by translating "go to step X" into "evaluate function X with state S". For improved readability, you can combine into a single function both the action (an external function that changes the state) and the chain of if/else or pattern matching that helps determine what step to take next.
This is assuming, of course, that your flowcharts are to be hardcoded (as opposed to loaded at runtime from an external source).

Sounds like Arrows would fit exactly what you describe. Either do a visualization of arrows (should be quite simple) or generate/transform arrow code from flow-graphs if you must.

Assuming there's "global" state within the flowchart, then that makes sense to package up into a state monad. At least then, unlike how you're doing it now, each call doesn't need any parameters, so can be read as a) modify state, b) conditional on current state, jump.

Explanation of “tying the knot”

In reading Haskell-related stuff I sometimes come across the expression “tying the knot”, I think I understand what it does, but not how.
So, are there any good, basic, and simple to understand explanations of this concept?

Tying the knot is a solution to the problem of circular data structures. In imperative languages you construct a circular structure by first creating a non-circular structure, and then going back and fixing up the pointers to add the circularity.
Say you wanted a two-element circular list with the elements "0" and "1". It would seem impossible to construct because if you create the "1" node and then create the "0" node to point at it, you cannot then go back and fix up the "1" node to point back at the "0" node. So you have a chicken-and-egg situation where both nodes need to exist before either can be created.
Here is how you do it in Haskell. Consider the following value:
alternates = x where
x = 0 : y
y = 1 : x
In a non-lazy language this will be an infinite loop because of the unterminated recursion. But in Haskell lazy evaluation does the Right Thing: it generates a two-element circular list.
To see how it works in practice, think about what happens at run-time. The usual "thunk" implementation of lazy evaluation represents an unevaluated expression as a data structure containing a function pointer plus the arguments to be passed to the function. When this is evaluated the thunk is replaced by the actual value so that future references don't have to call the function again.
When you take the first element of the list 'x' is evaluated down to a value (0, &y), where the "&y" bit is a pointer to the value of 'y'. Since 'y' has not been evaluated this is currently a thunk. When you take the second element of the list the computer follows the link from x to this thunk and evaluates it. It evaluates to (1, &x), or in other words a pointer back to the original 'x' value. So you now have a circular list sitting in memory. The programmer doesn't need to fix up the back-pointers because the lazy evaluation mechanism does it for you.

It's not quite what you asked for, and it's not directly related to Haskell, but Bruce McAdam's paper That About Wraps It Up goes into this topic in substantial breadth and depth. Bruce's basic idea is to use an explicit knot-tying operator called WRAP instead of the implicit knot-tying that is done automatically in Haskell, OCaml, and some other languages. The paper has lots of entertaining examples, and if you are interested in knot-tying I think you will come away with a much better feel for the process.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string