Example of a data structure with lazy spine and strict leaves - haskell

One of the performance trick mentioned here is this:
As a safe default: lazy in the spine, strict in the leaves.
I'm having trouble imagining such a data structure.
If I take Lists as an example, and if I make it strict in the leaves then won't the spine will be automatically strict ?
Is there any data structure example where the spine is lazy and leaves are strict ?

"Lazy in the spine, strict in the leaves" is a property of the API, not (just) a property of the data structure. Here's an example of how it might look for lists:
module StrictList (StrictList, runStrictList, nil, cons, uncons, repeat) where
newtype StrictList a = StrictList { runStrictList :: [a] }
nil :: StrictList a
nil = StrictList []
cons :: a -> StrictList a -> StrictList a
cons x (StrictList xs) = x `seq` StrictList (x:xs)
uncons :: StrictList a -> Maybe (a, StrictList a)
uncons (StrictList []) = Nothing
uncons (StrictList (x:xs)) = Just (x, StrictList xs)
repeat :: a -> StrictList a
repeat x = x `seq` StrictList (let xs = x:xs in xs)
Note that compared to built-in lists, this API is a quite impoverished -- that's just to keep the illustration small, not for a fundamental reason. The key point here is that you can still support things like repeat, where the spine is necessarily lazy (it's infinite!) but all the leaves are evaluated before anything else happens. Many of the other list operations that can produce infinite lists can be adapted to leaf-strict versions (though not all, as you observe).
You should also notice that it is not necessarily possible to take a leaf-lazy, spine-lazy structure and turn it into a leaf-strict, spine-lazy one in a natural way; e.g. one could not write a generic fromList :: [a] -> StrictList a such that:
fromList (repeat x) = repeat x and
runStrictList (fromList xs) = xs for all finite-length xs.
(Forgive my punning, I'm a repeat offender).

This bit of advice mixes up two related, but distinct, ideas. Haskell programmers are often sloppy about the distinction, but it matters here.
Strict vs. non-strict
This is a semantic distinction. A function f is strict if f _|_ = _|_, and non-strict otherwise.
Eager (call by value) vs. lazy (call by need)
This is a matter of implementation, and can have major performance implications. Lazy evaluation is one way to implement non-strict semantics.
What the claim really means
It actually means that the data structure should be strict and lazy. The right amount of laziness in the spine of a data structure can be very helpful. Sometimes it gives asymptotic improvements in performance. It can also improve cache utilization and cut garbage collection costs. On the other hand, too much laziness (even in the spine, in some cases!) can lead to a harmful accumulation of deferred computations. From an API standpoint, it can be very helpful to ensure that insertion operations are eager (and therefore strict), so that you know that everything stored in the structure has been forced.

Related

Mutually Exclusive Events (Probability Theory)

Data.Set and Data.List have not a function for intersectionS.
As I was interested in mutually exclusive events, I wrote the following function.
Is it correct? Is it efficient?
mutuallyExclusiveEvents
:: (Foldable t, Ord a) => t (S.Set a) -> Bool
mutuallyExclusiveEvents xss =
isJust $ foldr (\xs acc -> case acc of
Nothing -> Nothing
Just s -> if any (`S.member` s) xs
then Nothing
else Just (S.union xs s)
) (Just $ S.empty) xss
EXAMPLES
mutuallyExclusiveEvents [S.fromList [1,3,5], S.fromList [2,4,6], S.fromList [10,12,1]] == False
mutuallyExclusiveEvents [S.fromList [1,3,5], S.fromList [2,4,6], S.fromList [10,12]] == True
It's a really good start. I think we can identify two goals:
we don't want our algorithm to be quadratic in the number of sub-sets
we want our loop over the input set of sets to truly short-circuit as soon as we determine there is overlap
We would fail at 1 if we tried the naive solution of checking the intersection of each pair of sub-sets in turn. In your solution you've recognized or intuited that if s1 intersects with s2 then it also intersects with the union of s2 and s3 so you can accumulate a union and check for intersection in one pass and save work.
You also partially succeed at (2) in that you avoid doing meaningful work as soon as you find an intersection. The only deficiency is you still have to traverse the entire list. We'd like mutuallyExclusiveEvents to truly short-circuit, that is it should work on infinite lists. A good way to test this when you're developing is using undefined:
*Main S> mutuallyExclusiveEvents' ([S.fromList [1,3,5], S.fromList [10,12,1]] ++ undefined )
*** Exception: Prelude.undefined
CallStack (from HasCallStack):
error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
undefined, called at <interactive>:95:93 in interactive:Ghci3
Due to laziness foldr (and other functions implemented in terms of it), scanr etc. can truly shortcircuit in the way that we want, e.g.:
*Main S> foldr (&&) True ([True, False] ++ undefined )
False
The trick is that if you want to be able to short-circuit when some condition holds, the function :: a -> b -> b you pass to foldr must be able to return a result without inspecting the second argument (of type b), i.e. it must be lazy in its second argument. Vice versa for foldl.
Here's the solution I came up with:
mutuallyExclusiveEvents
:: (Ord a) => [S.Set a] -> Bool
mutuallyExclusiveEvents xss =
all nonOverlapping $ zip xss $ scanl S.union S.empty xss
where nonOverlapping (s1, s2) = S.null $ S.intersection s1 s2
One way of looking at the functions above: scanl and zip are both productive ( inspecting the head of the resulting list only requires one step of evaluation) while all short-circuits in the way we've just been talking about.
Note it's less general than yours, working only over lists. I thought to rewrite it without zip but with scanl1 but was surprised to find that is not polymorphic in Traversable (there may be a good reason).
EDIT: also as you probably know you can likely get very sophisticated with your approach to this (and related) problems if you want to, especially if false-positives or approximations are okay, e.g. https://en.wikipedia.org/wiki/HyperLogLog

Would the ability to detect cyclic lists in Haskell break any properties of the language?

In Haskell, some lists are cyclic:
ones = 1 : ones
Others are not:
nums = [1..]
And then there are things like this:
more_ones = f 1 where f x = x : f x
This denotes the same value as ones, and certainly that value is a repeating sequence. But whether it's represented in memory as a cyclic data structure is doubtful. (An implementation could do so, but this answer explains that "it's unlikely that this will happen in practice".)
Suppose we take a Haskell implementation and hack into it a built-in function isCycle :: [a] -> Bool that examines the structure of the in-memory representation of the argument. It returns True if the list is physically cyclic and False if the argument is of finite length. Otherwise, it will fail to terminate. (I imagine "hacking it in" because it's impossible to write that function in Haskell.)
Would the existence of this function break any interesting properties of the language?
Would the existence of this function break any interesting properties of the language?
Yes it would. It would break referential transparency (see also the Wikipedia article). A Haskell expression can be always replaced by its value. In other words, it depends only on the passed arguments and nothing else. If we had
isCycle :: [a] -> Bool
as you propose, expressions using it would not satisfy this property any more. They could depend on the internal memory representation of values. In consequence, other laws would be violated. For example the identity law for Functor
fmap id === id
would not hold any more: You'd be able to distinguish between ones and fmap id ones, as the latter would be acyclic. And compiler optimizations such as applying the above law would not longer preserve program properties.
However another question would be having function
isCycleIO :: [a] -> IO Bool
as IO actions are allowed to examine and change anything.
A pure solution could be to have a data type that internally distinguishes the two:
import qualified Data.Foldable as F
data SmartList a = Cyclic [a] | Acyclic [a]
instance Functor SmartList where
fmap f (Cyclic xs) = Cyclic (map f xs)
fmap f (Acyclic xs) = Acyclic (map f xs)
instance F.Foldable SmartList where
foldr f z (Acyclic xs) = F.foldr f z xs
foldr f _ (Cyclic xs) = let r = F.foldr f r xs in r
Of course it wouldn't be able to recognize if a generic list is cyclic or not, but for many operations it'd be possible to preserve the knowledge of having Cyclic values.
In the general case, no you can't identify a cyclic list. However if the list is being generated by an unfold operation then you can. Data.List contains this:
unfoldr :: (b -> Maybe (a, b)) -> b -> [a]
The first argument is a function that takes a "state" argument of type "b" and may return an element of the list and a new state. The second argument is the initial state. "Nothing" means the list ends.
If the state ever recurs then the list will repeat from the point of the last state. So if we instead use a different unfold function that returns a list of (a, b) pairs we can inspect the state corresponding to each element. If the same state is seen twice then the list is cyclic. Of course this assumes that the state is an instance of Eq or something.

Rewriting as a practical optimization technique in GHC: Is it really needed?

I was reading the paper authored by Simon Peyton Jones, et al. named “Playing by the Rules: Rewriting as a practical optimization technique in GHC”. In the second section, namely “The basic idea” they write:
Consider the familiar map function, that applies a function to each element of a list. Written in Haskell, map looks like this:
map f [] = []
map f (x:xs) = f x : map f xs
Now suppose that the compiler encounters the following call of map:
map f (map g xs)
We know that this expression is equivalent to
map (f . g) xs
(where “.” is function composition), and we know that the latter expression is more efficient than the former because there is no intermediate list. But the compiler has no such knowledge.
One possible rejoinder is that the compiler should be smarter --- but the programmer will always know things that the compiler cannot figure out. Another suggestion is this: allow the programmer to communicate such knowledge directly to the compiler. That is the direction we explore here.
My question is, why can't we make the compiler smarter? The authors say that “but the programmer will always know things that the compiler cannot figure out”. However, that's not a valid answer because the compiler can indeed figure out that map f (map g xs) is equivalent to map (f . g) xs, and here is how:
map f (map g xs)
map g xs unifies with map f [] = [].
Hence map g [] = [].
map f (map g []) = map f [].
map f [] unifies with map f [] = [].
Hence map f (map g []) = [].
map g xs unifies with map f (x:xs) = f x : map f xs.
Hence map g (x:xs) = g x : map g xs.
map f (map g (x:xs)) = map f (g x : map g xs).
map f (g x : map g xs) unifies with map f (x:xs) = f x : map f xs.
Hence map f (map g (x:xs)) = f (g x) : map f (map g xs).
Hence we now have the rules:
map f (map g []) = []
map f (map g (x:xs)) = f (g x) : map f (map g xs)
As you can see f (g x) is just (f . g) and map f (map g xs) is being called recursively. This is exactly the definition of map (f . g) xs. The algorithm for this automatic conversion seems to be pretty simple. So why not implement this instead of rewriting rules?
Aggressive inlining can derive many of the equalities that rewrite rules are short-hand for.
The differences is that inlining is "blind", so you don't know in advance if the result will be better or worse, or even if it will terminate.
Rewrite rules, however, can do completely non-obvious things, based on much higher level facts about the program. Think of rewrite rules as adding new axioms to the optimizer. By adding these you have a richer rule set to apply, making complicated optimizations easier to apply.
Stream fusion, for example, changes the data type representation. This cannot be expressed through inlining, as it involves a representation type change (we reframe the optimization problem in terms of the Stream ADT). Easy to state in rewrite rules, impossible with inlining alone.
Something in that direction was investigated in a Bachelor’s thesis of Johannes Bader, a student of mine: Finding Equations in Functional Programs (PDF file).
To some degree it is certainly possible, but
it is quite tricky. Finding such equations is in a sense as hard as finding proofs in a theorem proofer, and
it is not often very useful, because it tends to find equations that the programmer would rarely write directly.
It is however useful to clean up after other transformations such as inlining and various form of fusion.
This could be viewed as a balance between balancing expectations in the specific case, and balancing them in the general case. This balance can generate funny situations where you can know how to make something faster, but it is better for the language in general if you don't.
In the specific case of maps in the structure you give, the computer could find optimizations. However, what about related structures? What if the function isn't map? What if there's an additional layer of indirection, such as a function that returns map. In those cases, the compiler cannot optimize easily. This is the general case problem.
How if you do optimize the special case, one of two outcomes occurs
Nobody relies on it, because they aren't sure if it is there or not. In this case, articles like the one you quote get written
People do start relying on it, and now every developer is forced to remember "maps done in this configuration get automatically converted to the fast version for me, but if I do it in this configuration I don't.' This starts to manipulate the way people use the language, and can actually reduce readability!
Given the need for developers to think about such optimizations in the general case, we expect to see developers doing these optimizations in the simple case, decreasing the need to for the optimization in the first place!
Now, if it turns out that the particular case you are interested accounts for something massive like 2% of the world codebase in Haskell, there would be a much stronger argument for applying your special-case optimization.

Benefit of DiffList

Learn You a Haskell demonstrates the DiffList concept:
*Main Control.Monad.Writer> let f = \xs -> "dog" ++ ("meat" ++ xs)
*Main Control.Monad.Writer> f "foo"
"dogmeatfoo"
Is the primary benefit of the DiffList that the list gets constructed from left to right?
The DList package lists some of the asymptotics: https://hackage.haskell.org/package/dlist-0.5/docs/Data-DList.html
You'll note lots of things only take O(1), including cons, snoc, and append. However, note that inspecting the list needs to force lots of operations each time, so if you are doing more inspecting than construction, or interleaving the two, the DList approach won't necessarily be a win.

Counting number of elements in a list that satisfy the given predicate

Does Haskell standard library have a function that given a list and a predicate, returns the number of elements satisfying that predicate? Something like with type (a -> Bool) -> [a] -> Int. My hoogle search didn't return anything interesting. Currently I am using length . filter pred, which I don't find to be a particularly elegant solution. My use case seems to be common enough to have a better library solution that that. Is that the case or is my premonition wrong?
The length . filter p implementation isn't nearly as bad as you suggest. In particular, it has only constant overhead in memory and speed, so yeah.
For things that use stream fusion, like the vector package, length . filter p will actually be optimized so as to avoid creating an intermediate vector. Lists, however, use what's called foldr/build fusion at the moment, which is not quite smart enough to optimize length . filter p without creating linearly large thunks that risk stack overflows.
For details on stream fusion, see this paper. As I understand it, the reason that stream fusion is not currently used in the main Haskell libraries is that (as described in the paper) about 5% of programs perform dramatically worse when implemented on top of stream-based libraries, while foldr/build optimizations can never (AFAIK) make performance actively worse.
No, there is no predefined function that does this, but I would say that length . filter pred is, in fact, an elegant implementation; it's as close as you can get to expressing what you mean without just invoking the concept directly, which you can't do if you're defining it.
The only alternatives would be a recursive function or a fold, which IMO would be less elegant, but if you really want to:
foo :: (a -> Bool) -> [a] -> Int
foo p = foldl' (\n x -> if p x then n+1 else n) 0
This is basically just inlining length into the definition. As for naming, I would suggest count (or perhaps countBy, since count is a reasonable variable name).
Haskell is a high-level language. Rather than provide one function for every possible combination of circumstances you might ever encounter, it provides you with a smallish set of functions that cover all of the basics, and you then glue these together as required to solve whatever problem is currently at hand.
In terms of simplicity and conciseness, this is as elegant as it gets. So yes, length . filter pred is absolutely the standard solution. As another example, consider elem, which (as you may know) tells you whether a given item is present in a list. The standard reference implementation for this is actually
elem :: Eq x => x -> [x] -> Bool
elem x = foldr (||) False . map (x ==)
In order words, compare every element in the list to the target element, creating a new list of Bools. Then fold the logical-OR function over this new list.
If this seems inefficient, try not to worry about it. In particular,
The compiler can often optimise away temporary data structures created by code like this. (Remember, this is the standard way to write code in Haskell, so the compiler is tuned to deal with it.)
Even if it can't be optimised away, laziness often makes such code fairly efficient anyway.
(In this specific example, the OR function will terminate the loop as soon as a match is seen - just like what would happen if you hand-coded it yourself.)
As a general rule, write code by gluing together pre-existing functions. Change this only if performance isn't good enough.
This is my amateurish solution to a similar problem. Count the number of negative integers in a list l
nOfNeg l = length(filter (<0) l)
main = print(nOfNeg [0,-1,-2,1,2,3,4] ) --2
No, there isn't!
As of 2020, there is indeed no such idiom in the Haskell standard library yet! One could (and should) however insert an idiom howMany (resembling good old any)
howMany p xs = sum [ 1 | x <- xs, p x ]
-- howMany=(length.).filter
main = print $ howMany (/=0) [0..9]
Try howMany=(length.).filter
I'd do manually
howmany :: (a -> Bool) -> [a] -> Int
howmany _ [ ] = 0
howmany pred (x:xs) = if pred x then 1 + howmany pred xs
else howmany pred xs

Resources