Why are bacon.js observables classified lazy sequences? - observablecollection

My understanding of lazy sequences is that they don't load data in memory until it's accessed by the program. So I can see how this would make sense if there was a large list of numbers, waiting to be consumed, but the sequence only pulled in the data from the producer when the iterator called the next method.
But observables append the item to themselves whenever the producer pushes it to them. So it's not like the sequence loads the data when consumer asks for it, it loads it whenever the producer sends it. So in what way are observables lazy?

There are 2 kinds of laziness in Bacon.js Observables:
Observables don't register to their underlying data source (for example, an AJAX fetch) until there is at least one Observer. This laziness practically gives you automatic resource management in the sense that connections to data sources are automatically opened and closed based on demand. Also if there are multiple Observers, only a single connection to the data source is used and the results are shared.
Observables don't evaluate the functions passed to map, combine etc until the value is actually used. So if you do expensive calculation in the function you give to map and only sample the stream once in a second, only those values will be actually evaluated.
What you don't get is backpressure management. So if you're data source, say observable a, produces infinite values sequentially, you can either
process them immediately using a.onValue
take 1000 first of them using a.take(1000).onValue
take until some condition using a.takeUntil((x) -> x > 1000)).onValue
But you cannot affect the rate that the source produces the values, because Bacon.js provides no way to tell the source that "I'm interested in more values later, I'll tell you when". I've been thinking about adding something like this, but it would complicate matters a lot.
As the bottom line, I'd say that Bacon.js is not your ideal library for working with infinite lists, for instance, in the Haskell style. And AFAIK, neither is any other Javascript FRP lib.

A lazy sequence is a sequence that evaluates its elements at the last time possible. What's useful about them is that you can pass the sequence around and even perform operations on it without evaluating its contents.
Since you don't evaluate the sequence, you can even create an infinite sequence; just make sure that you don't evaluate the whole of it. For example, the following Haskell program creates an infinite sequence of natural numbers, then lazily multiplies each element by 2, producing an infinite sequence of even numbers, then takes the first 5 elements, evaluating them (and only them):
take 5 (map (*2) [1..])
-- [2,4,6,8,10]
Basically, with lazy sequences and a set of functions that work over them, like map, you can write programs that create and process potentially infinite streams of data in a composable way.
Which, incidentally, is exactly what (functional) reactive programming is about. An observable can be seen as a potentially infinite stream of pairs (value, timestamp). An operation on an observable, such as map, simply creates another stream based on the original one. Attaching a consumer to an observable evaluates the elements of it that "already happened", while leaving the rest un-evaluated.
For example, Bacon's observable.map can be implemented as a function that lazily applies a function to the 'value' part of the stream (again in Haskell):
map f = Prelude.map (\(value, tstamp) -> (f value, tstamp))
while observable.delay adds delay to the 'timestamp' part:
delay dt = Prelude.map (\(value, tstamp) -> (value, tstamp + dt))
where Prelude.map is just a regular "map" function over lazy sequences.
Hopefully I haven't confused you any further!

Related

Data.Map: how do I tell if I "need value-strict maps"?

When choosing between Data.Map.Lazy and Data.Map.Strict, the docs tell us for the former:
API of this module is strict in the keys, but lazy in the values. If you need value-strict maps, use Data.Map.Strict instead.
and for the latter likewise:
API of this module is strict in both the keys and the values. If you need value-lazy maps, use Data.Map.Lazy instead.
How do more seasoned Haskellers than me tend to intuit this "need"? Use-case in point, in a run-and-done (ie. not daemon-like/long-running) command-line tool: readFileing a simple lines-based custom config file where many (not all) lines define key:value pairs to be collected into a Map. Once done, we rewrite many values in it depending on other values in it that were read later (thanks to immutability, in this process we create a new Map and discard the initial incarnation).
(Although in practice this file likely won't often or ever reach even a 1000 lines, let's just assume for the sake of learning that for some users it will before long.)
Any given run of the tool will perhaps lookup some 20-100% of the (rewritten on load, although with lazy-eval I'm never quite sure "when really") key:value pairs, anywhere between once and dozens of times.
How do I reason about the differences between "value-strict" and "value-lazy" Data.Maps here? What happens "under the hood", in terms of mainstream computing if you will?
Fundamentally, such hash-maps are of course about "storing once, looking up many times" --- but then, what in computing isn't, "fundamentally". And furthermore the whole concept of lazy-eval's thunks seems to boil down to this very principle, so why not always stay value-lazy?
How do I reason about the differences between "value-strict" and "value-lazy" Data.Maps here?
Value lazy is the normal in Haskell. This means that not just values, but thunks (i.e. recipes of how to compute the value) are stored. For example, lets say you extract the value from a line like this:
tail (dropUntil (==':') line)
Then a value-strict map would actually extract the value upon insert, while a lazy one would happily just remember how to get it. This is then also what you would get on a lookup
Here are some pros and cons:
lazy values may need more memory, not only for the thunk itself, but also for the data that are referenced there (here line).
strict values may need more memory. In our case this could be so when the string gets interpreted to yield some memory hungry structure like lists, JSON or XML.
using lazy values may need less CPU if your code doesn't need every value.
too deep nesting of thunks may cause stack-overflows when the value is finally needed.
there is also a semantic difference: in lazy mode, you may get away when the code to extract the value would fail (like the above one that fails if there isnt a ':' on the line) if you just need to look whether the key is present. In strict mode, your program crashes upon insert.
As always, there are no fixed measures like: "If your evaluated value needs less than 20 bytes and takes less than 30µs to compute, use strict, else use lazy."
Normally, you just go with one and when you notice extreme runtimes/memory usage you try the other.
Here's a small experiment that shows a difference betwen Data.Map.Lazy and Data.Map.Strict. This code exhausts the heap:
import Data.Foldable
import qualified Data.Map.Lazy as M
main :: IO ()
main = print $ foldl' (\kv i -> M.adjust (+i) 'a' kv)
(M.fromList [('a',0)])
(cycle [0])
(Better to compile with a small maximum heap, like ghc Main.hs -with-rtsopts="-M20m".)
The foldl' keeps the map in WHNF as we iterate over the infinite list of zeros. However, thunks accumulate in the modified value until the heap is exhausted.
The same code with Data.Map.Strict simply loops forever. In the strict variant, the values are in WHNF whenever the map is in WHNF.

How does lazy-evaluation allow for greater modularization?

In his article "Why Functional Programming Matters," John Hughes argues that "Lazy evaluation is perhaps the most powerful tool for modularization in the functional programmer's repertoire." To do so, he provides an example like this:
Suppose you have two functions, "infiniteLoop" and "terminationCondition." You can do the following:
terminationCondition(infiniteLoop input)
Lazy evaluation, in Hughes' words "allows termination conditions to be separated from loop bodies." This is definitely true, since "terminationCondition" using lazy evaluation here means this condition can be defined outside the loop -- infiniteLoop will stop executing when terminationCondition stops asking for data.
But couldn't higher-order functions achieve the same thing as follows?
infiniteLoop(input, terminationCondition)
How does lazy evaluation provide modularization here that's not provided by higher-order functions?
Yes you could use a passed in termination check, but for that to work the author of infiniteLoop would have had to forsee the possibility of wanting to terminate the loop with that sort of condition, and hardwire a call to the termination condition into their function.
And even if the specific condition can be passed in as a function, the "shape" of it is predetermined by the author of infiniteLoop. What if they give me a termination condition "slot" that is called on each element, but I need access to the last several elements to check some sort of convergence condition? Maybe for a simple sequence generator you could come up with "the most general possible" termination condition type, but it's not obvious how to do so and remain efficient and easy to use. Do I repeatedly pass the entire sequence so far into the termination condition, in case that's what it's checking? Do I force my callers to wrap their simple termination conditions up in a more complicated package so they fit the most general condition type?
The callers certainly have to know exactly how the termination condition is called in order to supply a correct condition. That could be quite a bit of dependence on this specific implementation. If they switch to a different implementation of infiniteLoop written by another third party, how likely is it that exactly the same design for the termination condition would be used? With a lazy infiniteLoop, I can drop in any implementation that is supposed to produce the same sequence.
And what if infiniteLoop isn't a simple sequence generator, but actually generates a more complex infinite data structure, like a tree? If all the branches of the tree are independently recursively generated (think of a move tree for a game like chess) it could make sense to cut different branches at different depths, based on all sorts of conditions on the information generated thus far.
If the original author didn't prepare (either specifically for my use case or for a sufficiently general class of use cases), I'm out of luck. The author of the lazy infiniteLoop can just write it the natural way, and let each individual caller lazily explore what they want; neither has to know much about the other at all.
Furthermore, what if the decision to stop lazily exploring the infinite output is actually interleaved with (and dependent on) the computation the caller is doing with that output? Think of the chess move tree again; how far I want to explore one branch of the tree could easily depend on my evaluation of the best option I've found in other branches of the tree. So either I do my traversal and calculation twice (once in the termination condition to return a flag telling infinteLoop to stop, and then once again with the finite output so I can actually have my result), or the author of infiniteLoop had to prepare for not just a termination condition, but a complicated function that also gets to return output (so that I can push my entire computation inside the "termination condition").
Taken to extremes, I could explore the output and calculate some results, display them to a user and get input, and then continue exploring the data structure (without recalling infiniteLoop based on the user's input). The original author of the lazy infiniteLoop need have no idea that I would ever think of doing such a thing, and it will still work. If we've got purity enforced by the type system, then that would be impossible with the passed-in termination condition approach unless the whole infiniteLoop was allowed to have side effects if the termination condition needs to (say by giving the whole thing a monadic interface).
In short, to allow the same flexibility you'd get with lazy evaluation by using a strict infiniteLoop that takes higher order functions to control it can be a large amount of extra complexity for both the author of infiniteLoop and its caller (unless a variety of simpler wrappers are exposed, and one of them matches the caller's use case). Lazy evaluation can allow producers and consumers to be almost completely decoupled, while still giving the consumer the ability to control how much output the producer generates. Everything you can do that way you can do with extra function arguments as you say, but it requires to the producer and consumer to essentially agree on a protocol for how the control functions work; and that protocol is almost always either specialised to the use case at hand (tying the consumer and producer together) or so complicated in order to be fully-general that the producer and consumer are up tied to that protocol, which is unlikely to be recreated elsewhere, and so they're still tied together.

Accumulator factory in Haskell

Now, at the start of my adventure with programming I have some problems understanding basic concepts. Here is one related to Haskell or perhaps generally functional paradigm.
Here is a general statement of accumulator factory problem, from
http://rosettacode.org/wiki/Accumulator_factory
[Write a function that]
Takes a number n and returns a function (lets call it g), that takes a number i, and returns n incremented by the accumulation of i from every call of function g(i).
Works for any numeric type-- i.e. can take both ints and floats and returns functions that can take both ints and floats. (It is not enough simply to convert all input to floats. An accumulator that has only seen integers must return integers.) (i.e., if the language doesn't allow for numeric polymorphism, you have to use overloading or something like that)
Generates functions that return the sum of every number ever passed to them, not just the most recent. (This requires a piece of state to hold the accumulated value, which in turn means that pure functional languages can't be used for this task.)
Returns a real function, meaning something that you can use wherever you could use a function you had defined in the ordinary way in the text of your program. (Follow your language's conventions here.)
Doesn't store the accumulated value or the returned functions in a way that could cause them to be inadvertently modified by other code. (No global variables or other such things.)
with, as I understand, a key point being:
"[...] creating a function that [...]
Generates functions that return the sum of every number ever passed to them, not just the most recent. (This requires a piece of state to hold the accumulated value, which in turn means that pure functional languages can't be used for this task.)"
We can find a Haskell solution on the same website and it seems to do just what the quote above says.
Here
http://rosettacode.org/wiki/Category:Haskell
it is said that Haskell is purely functional.
What is then the explanation of the apparent contradiction? Or maybe there is no contradiction and I simply lack some understanding? Thanks.
The Haskell solution does not actually quite follow the rules of the challenge. In particular, it violates the rule that the function "Returns a real function, meaning something that you can use wherever you could use a function you had defined in the ordinary way in the text of your program." Instead of returning a real function, it returns an ST computation that produces a function that itself produces more ST computations. Within the context of an ST "state thread", you can create and use mutable references (STRef), arrays, and vectors. However, it's impossible for this mutable state to "leak" outside the state thread to contaminate pure code.

Repa performance versus lists

In the Numeric Haskell Repa Tutorial Wiki, there is a passage that reads (for context):
10.1 Fusion, and why you need it
Repa depends critically on array fusion to achieve fast code. Fusion is a fancy name for the
combination of inlining and code transformations performed by GHC when
it compiles your program. The fusion process merges the array filling
loops defined in the Repa library, with the "worker" functions that
you write in your own module. If the fusion process fails, then the
resulting program will be much slower than it needs to be, often 10x
slower an equivalent program using plain Haskell lists. On the other
hand, provided fusion works, the resulting code will run as fast as an
equivalent cleanly written C program. Making fusion work is not hard
once you understand what's going on.
The part that I don't understand is this:
"If the fusion process fails, then the
resulting program will be much slower than it needs to be, often 10x
slower an equivalent program using plain Haskell lists."
I understand why it would run slower if stream fusion fails, but why does it run that much slower than lists?
Thanks!
Typically, because lists are lazy, and Repa arrays are strict.
If you fail to fuse a lazy list traversal, e.g.
map f . map g
you pay O(1) cost per value for leaving the intermediate (lazy) cons cell there.
If you fail to fuse the same traversal over a strict sequence, you pay at least O(n) per value for allocating a strict intermediate array.
Also, since fusion mangles your code into an unrecognizable Stream data type, to improve analysis, you can be left with code that has just too many constructors and other overheads.
Edit: This is not correct--see Don Nelson's comment (and his answer--he knows a lot more about the library than I do).
Immutable arrays cannot share components; disregarding fusion, any modification to an immutable array must reallocate the entire array. By contrast, while list operations are non-destructive, they can share parts: f i (h:t) = i:t, for example, replaces the head of a list in constant time by creating a new list in which the first cell points to the second cell of the original list. Moreover, because lists can be built incrementally, such functions as generators that build a list by repeated calls to a function can still run in O(n) time, while the equivalent function on an immutable array without fusion would need to reallocate the array with every call to the function, taking O(n^2) time.

Ordering of parameters to make use of currying

I have twice recently refactored code in order to change the order of parameters because there was too much code where hacks like flip or \x -> foo bar x 42 were happening.
When designing a function signature what principles will help me to make the best use of currying?
For languages that support currying and partial-application easily, there is one compelling series of arguments, originally from Chris Okasaki:
Put the data structure as the last argument
Why? You can then compose operations on the data nicely. E.g. insert 1 $ insert 2 $ insert 3 $ s. This also helps for functions on state.
Standard libraries such as "containers" follow this convention.
Alternate arguments are sometimes given to put the data structure first, so it can be closed over, yielding functions on a static structure (e.g. lookup) that are a bit more concise. However, the broad consensus seems to be that this is less of a win, especially since it pushes you towards heavily parenthesized code.
Put the most varying argument last
For recursive functions, it is common to put the argument that varies the most (e.g. an accumulator) as the last argument, while the argument that varies the least (e.g. a function argument) at the start. This composes well with the data structure last style.
A summary of the Okasaki view is given in his Edison library (again, another data structure library):
Partial application: arguments more likely to be static usually appear before other arguments in order to facilitate partial application.
Collection appears last: in all cases where an operation queries a single collection or modifies an existing collection, the collection argument will appear last. This is something of a de facto standard for Haskell datastructure libraries and lends a degree of consistency to the API.
Most usual order: where an operation represents a well-known mathematical function on more than one datastructure, the arguments are chosen to match the most usual argument order for the function.
Place the arguments that you are most likely to reuse first. Function arguments are a great example of this. You are much more likely to want to map f over two different lists, than you are to want to map many different functions over the same list.
I tend to do what you did, pick some order that seems good and then refactor if it turns out that another order is better. The order depends a lot on how you are going to use the function (naturally).

Resources