Universal memoization in Haskell [duplicate]

Universal memoization in Haskell [duplicate] - haskell

I can't figure out why m1 is apparently memoized while m2 is not in the following:
m1 = ((filter odd [1..]) !!)
m2 n = ((filter odd [1..]) !! n)
m1 10000000 takes about 1.5 seconds on the first call, and a fraction of that on subsequent calls (presumably it caches the list), whereas m2 10000000 always takes the same amount of time (rebuilding the list with each call). Any idea what's going on? Are there any rules of thumb as to if and when GHC will memoize a function? Thanks.

GHC does not memoize functions.
It does, however, compute any given expression in the code at most once per time that its surrounding lambda-expression is entered, or at most once ever if it is at top level. Determining where the lambda-expressions are can be a little tricky when you use syntactic sugar like in your example, so let's convert these to equivalent desugared syntax:
m1' = (!!) (filter odd [1..]) -- NB: See below!
m2' = \n -> (!!) (filter odd [1..]) n
(Note: The Haskell 98 report actually describes a left operator section like (a %) as equivalent to \b -> (%) a b, but GHC desugars it to (%) a. These are technically different because they can be distinguished by seq. I think I might have submitted a GHC Trac ticket about this.)
Given this, you can see that in m1', the expression filter odd [1..] is not contained in any lambda-expression, so it will only be computed once per run of your program, while in m2', filter odd [1..] will be computed each time the lambda-expression is entered, i.e., on each call of m2'. That explains the difference in timing you are seeing.
Actually, some versions of GHC, with certain optimization options, will share more values than the above description indicates. This can be problematic in some situations. For example, consider the function
f = \x -> let y = [1..30000000] in foldl' (+) 0 (y ++ [x])
GHC might notice that y does not depend on x and rewrite the function to
f = let y = [1..30000000] in \x -> foldl' (+) 0 (y ++ [x])
In this case, the new version is much less efficient because it will have to read about 1 GB from memory where y is stored, while the original version would run in constant space and fit in the processor's cache. In fact, under GHC 6.12.1, the function f is almost twice as fast when compiled without optimizations than it is compiled with -O2.

m1 is computed only once because it is a Constant Applicative Form, while m2 is not a CAF, and so is computed for each evaluation.
See the GHC wiki on CAFs: http://www.haskell.org/haskellwiki/Constant_applicative_form

There is a crucial difference between the two forms: the monomorphism restriction applies to m1 but not m2, because m2 has explicitly given arguments. So m2's type is general but m1's is specific. The types they are assigned are:
m1 :: Int -> Integer
m2 :: (Integral a) => Int -> a
Most Haskell compilers and interpreters (all of them that I know of actually) do not memoize polymorphic structures, so m2's internal list is recreated every time it's called, where m1's is not.

I'm not sure, because I'm quite new to Haskell myself, but it appears that it's beacuse the second function is parametrized and the first one is not. The nature of the function is that, it's result depends on input value and in functional paradigm especailly it depends ONLY on the input. Obvious implication is that a function with no parameters returns always the same value over and over, no matter what.
Aparently there's an optimizing mechanizm in GHC compiler that exploits this fact to compute the value of such a function only once for whole program runtime. It does it lazily, to be sure, but does it nonetheless. I noticed it myself, when I wrote the following function:
primes = filter isPrime [2..]
where isPrime n = null [factor | factor <- [2..n-1], factor `divides` n]
where f `divides` n = (n `mod` f) == 0
Then to test it, I entered GHCI and wrote: primes !! 1000. It took a few seconds, but finally I got the answer: 7927. Then I called primes !! 1001 and got the answer instantly. Similarly in an instant I got the result for take 1000 primes, because Haskell had to compute the whole thousand-element list to return 1001st element before.
Thus if you can write your function such that it takes no parameters, you probably want it. ;)

Related

Can any partial function be converted to a total version in Haskell?

So far I have seen numerous "Maybe" versions of certain partial functions that would potentially result in ⊥, like readMaybe for read and listToMaybe for head; sometimes I wonder if we can generalise the idea and work out such a function safe :: (a -> b) -> (a -> Maybe b) to convert any partial function into their safer total alternative that returns Nothing on any instances where error stack would have been called in the original function. As till now I have not found a way to implement such safe function or existing implementations of a similar kind, and I come to doubt if this idea is truly viable.

There are two kinds of bottom actually, non-termination and error. You cannot catch non-termination, for obvious reasons, but you can catch errors. Here is a quickly thrown-together version (I am not an expert so there are probably better ways)
{-# LANGUAGE ScopedTypeVariables #-}
import Control.Exception
import System.IO.Unsafe
safe f = unsafePerformIO $ do
z <- try (evaluate f)
let r = case z of
Left (e::SomeException) -> Nothing
Right k -> Just k
return r
Here are some examples
*Main > safe (head [42])
Just 42
*Main > safe (head [])
Nothing
*Main λ safe (1 `div` 0)
Nothing
*Main λ safe (1 `div` 2)
Just 0

No, it's not possible. It violates a property called "monotonicity", which says that a value cannot become more defined as you process it. You can't branch on bottoms - attempting to process one always results in bottom.
Or at least, that's all true of the domain theory Haskell evaluation is based on. But Haskell has a few extra features domain theory doesn't... Like executing IO actions being a different thing than evaluation, and unsafePerformIO letting you hide execution inside evaluation. The spoon library packages all of these ideas together as well as can be done. It's not perfect. It has holes, because this isn't something you're supposed to be able to do. But it does the job in a bunch of common cases.

Consider the function
collatz :: Integer -> ()
collatz 1 = ()
collatz n
| even n = collatz $ n`div`2
| otherwise = collatz $ 3*n + 1
(Let's pretend Integer is the type of positive whole numbers for simplicity)
Is this a total function? Nobody knows! For all we know, it could be total, so your proposed safe-guard can't ever yield Nothing. But neither has anybody found a proof that it is total, so if safe just always gives back Just (collatz n) then this may still be only partial.

GHC: Are there consistent rules for memoization for calls with fixed values?

In my quest to understand and harness GHC automatic memoization, I've hit a wall: when pure functions are called with fixed values like fib 42, they are sometimes fast and sometimes slow when called again. It varies if they're called plainly like fib 42 or implicitly through some math, e.g. (\x -> fib (x - 1)) 43. The cases have no seeming rhyme or reason, so I'll present them with the intention of asking what the logic is behind the behavior.
Consider a slow Fibonacci implementation, which makes it obvious when the memoization is working:
slow_fib :: Int -> Integer
slow_fib n = if n < 2 then 1 else (slow_fib (n - 1)) + (slow_fib (n - 2))
I tested three basic questions to see if GHC (version 8.2.2) will memoize calls with fixed args:
Can slow_fib access previous top-level calls to slow_fib?
Are previous results memoized for later non-trivial (e.g. math) top-level expressions?
Are previous results memoized for later identical top-level expressions?
The answers seem to be:
No
No
Yes [??]
The fact that the last case works is very confusing to me: if I can reprint the result for example, then I should expect to be able to add them. Here's the code that shows this:
main = do
-- 1. all three of these are slow, even though `slow_fib 37` is
-- just the sum of the other two results. Definitely no memoization.
putStrLn $ show $ slow_fib 35
putStrLn $ show $ slow_fib 36
putStrLn $ show $ slow_fib 37
-- 2. also slow, definitely no memoization as well.
putStrLn $ show $ (slow_fib 35) + (slow_fib 36) + (slow_fib 37)
putStrLn $ show $ (slow_fib 35) + 1
-- 3. all three of these are instant. Huh?
putStrLn $ show $ slow_fib 35
putStrLn $ show $ slow_fib 36
putStrLn $ show $ slow_fib 37
Yet stranger, doing math on the results worked when it's embedded in a recursive function: this fibonacci variant that starts at Fib(40):
let fib_plus_40 n = if n <= 0
then slow_fib 40
else (fib_plus_40 (n - 1)) + (fib_plus_40 (n - 2))
Shown by the following:
main = do
-- slow as expected
putStrLn $ show $ fib_plus_40 0
-- instant. Why?!
putStrLn $ show $ fib_plus_40 1
I can't find any reasoning for this in any explanations for GHC memoization, which typically incriminate explicit variables (e.g. here, here, and here). This is why I expected fib_plus_40 to fail to memoize.

To elaborate in case it wasn't clear from #amalloy's answer, the problem is that you're conflating two things here -- the implicit memoization-like-behavior (what people mean when they talk about Haskell's "automatic memoization", though it is not true memoization!) that results directly from thunk-based lazy evaluation, and a compiler optimization technique that's basically a form of common subexpression elimination. The former is predictable, more or less; the latter is at the whim of the compiler.
Recall that real memoization is a property of the implementation of a function: the function "remembers" results calculated for certain combinations of arguments, and may reuse those results instead of recalculating them from scratch when called multiple times with the same arguments. When GHC generates code for functions, it does not automatically generate code to perform this kind of memoization.
Instead, the GHC code generates to implement function application is unusual. Instead of actually applying the function to arguments to generate the final result as a value, a "result" is immediately constructed in the form of a thunk, which you can view as a suspended function call or a "promise" to deliver a value at a later time.
When, at some future point, the actual value is needed, the thunk is forced (which actually causes the original function call to take place), and the thunk is updated with the value. If that same value is needed again later, the value is already available, so the thunk doesn't need to be forced a second time. This is the "automatic memoization". Note that it takes place at the "result" level rather than the "function" level -- the result of a function application remembers its value; a function does not remember the results it previously produced.
Now, normally the concept of the result of a function application remembering its value would be ridiculous. In strict languages, we don't worry that after x = sqrt(10), reusing x will cause multiple sqrt calls because x hasn't "memoized" its own value. That is, in strict languages, all function application results are "automatically memoized" in the same sense they are in Haskell.
The difference is lazy evaluation, which allows us to write something like:
stuff = map expensiveComputation [1..10000]
which returns a thunk immediately without performing any expensive computations. Afterwards:
f n = stuff !! n
magically creates a memoized function, not because GHC generates code in the implementation of f to somehow memoize the call f 1000, but because f 1000 forces (a bunch of list constructor thunks and then) a single expensiveComputation whose return value is "memoized" as the value at index 1000 in the list stuff -- it was a thunk, but after being forced, it remembers its own value, just like any value in a strict language would.
So, given your definition of slow_fib, none of your examples are actually making use of Haskell's automatic memoization, in the usual sense people mean. Any speedups you're seeing are the result of various compiler optimizations that are (or aren't) recognizing common subexpressions or inlining / unwrapping short loops.
To write a memoized fib, you need to do it as explicitly as you would in a strict language, by creating a data structure to hold the memoized values, though lazy evaluation and mutually recursive definitions can sometimes make it seem like it's "automatic":
import qualified Data.Vector as V
import Data.Vector (Vector,(!))
fibv :: Vector Integer
fibv = V.generate 1000000 getfib
where getfib 0 = 1
getfib 1 = 1
getfib i = fibv ! (i-1) + fibv ! (i-2)
fib :: Int -> Integer
fib n = fibv ! n

All of the examples you link at the end exploit the same technique: instead of implementing function f directly, they first introduce a list whose contents are all the calls to f that could ever be made. That list is computed only once, lazily; and then a simple lookup in that list is used as the implementation of the user-facing function. So, they are not relying on any caching from GHC.
Your question is different: you hope that calling some function will be automatically cached for you, and in general that does not happen. The real question is why any of your results are fast. I'm not sure, but I think it is to do with Constant Applicative Forms (CAFs), which GHC may share between multiple use sites, at its discretion.
The most relevant feature of a CAF here is the "Constant" part: GHC will only introduce such a cache for an expression whose value is constant throughout the entire run of the program, not just for some particular scope. So, you can be sure that f x <> f x will never reuse the result of f x (at least not due to CAF folding; maybe GHC can find some other excuse to memoize this for some functions, but typically it does not).
The two things in your program that are not CAFs are the implementation of slow_fib, and the recursive case of fib_plus_40. GHC definitely cannot introduce any caching of the results of those expressions. The base case for fib_plus_40 is a CAF, as are all of the expressions and subexpressions in main. So, GHC can choose to cache/share any of those subexpressions, and not share any of them, as it pleases. Perhaps it sees that slow_fib 40 is "obviously" simple enough to save, but it's not so sure about whether the slow_fib 35 expressions in main should be shared. Meanwhile, it sounds like it does decide to share the IO action putStrLn $ show $ slow_fib 35 for whatever reason. Seems like a weird choice to you and me, but we're not compilers.
The moral here is that you cannot count on this at all: if you want to ensure you compute a value only once, you need to save it in a variable somewhere, and refer to that variable instead of recomputing it.
To confirm this, I took luqui's advice and looked at the -ddump-simpl output. Here are some snippets showing the explicit caching:
-- RHS size: {terms: 2, types: 0, coercions: 0}
lvl1_r4ER :: Integer
[GblId, Str=DmdType]
lvl1_r4ER = $wslow_fib_r4EP 40#
Rec {
-- RHS size: {terms: 21, types: 4, coercions: 0}
Main.main_fib_plus_40 [Occ=LoopBreaker] :: Integer -> Integer
[GblId, Arity=1, Str=DmdType <S,U>]
Main.main_fib_plus_40 =
\ (n_a1DF :: Integer) ->
case integer-gmp-1.0.0.1:GHC.Integer.Type.leInteger#
n_a1DF Main.main7
of wild_a2aQ { __DEFAULT ->
case GHC.Prim.tagToEnum# # Bool wild_a2aQ of _ [Occ=Dead] {
False ->
integer-gmp-1.0.0.1:GHC.Integer.Type.plusInteger
(Main.main_fib_plus_40
(integer-gmp-1.0.0.1:GHC.Integer.Type.minusInteger
n_a1DF Main.main4))
(Main.main_fib_plus_40
(integer-gmp-1.0.0.1:GHC.Integer.Type.minusInteger
n_a1DF lvl_r4EQ));
True -> lvl1_r4ER
}
}
end Rec }
This doesn't tell us why GHC is choosing to introduce this cache - remember, it's allowed to do what it wants. But it does confirm the mechanism, that it introduces a variable to hold the repeated calculation. I can't show you core for your longer main involving smaller numbers, because when I compile it I get more sharing: the expressions in section 2 are cached for me as well.

tail recursion recognition

I'm trying to learn Haskell and I stumbled upon the following:
myAdd (x:xs) = x + myAdd xs
myAdd null = 0
f = let n = 10000000 in myAdd [1 .. n]
main = do
putStrLn (show f)
When compiling with GHC, this yields a stack overflow. As a C/C++ programmer, I would have expected the compiler to do tail call optimization.
I don't like that I would have to "help" the compiler in simple cases like these, but what options are there? I think it is reasonable to require that the calculation given above be done without using O(n) memory, and without deferring to specialized functions.
If I cannot state my problem naturally (even on a toy problem such as this), and expect reasonable performance in terms of time & space, much of the appeal of Haskell would be lost.

Firstly, make sure you're compiling with -O2. It makes a lot of performance problems just go away :)
The first problem I can see is that null is just a variable name there. You want []. It's equivalent here because the only options are x:xs and [], but it won't always be.
The issue here is simple: when you call sum [1,2,3,4], it looks like this:
1 + (2 + (3 + (4 + 0)))
without ever reducing any of these additions to a number, because of Haskell's non-strict semantics. The solution is simple:
myAdd = myAdd' 0
where myAdd' !total [] = total
myAdd' !total (x:xs) = myAdd' (total + x) xs
(You'll need {-# LANGUAGE BangPatterns #-} at the top of your source file to compile this.)
This accumulates the addition in another parameter, and is actually tail recursive (yours isn't; + is in tail position rather than myAdd). But in fact, it's not quite tail recursion we care about in Haskell; that distinction is mainly relevant in strict languages. The secret here is the bang pattern on total: it forces it to be evaluated every time myAdd' is called, so no unevaluated additions build up, and it runs in constant space. In this case, GHC can actually figure this out with -O2 thanks to its strictness analysis, but I think it's usually best to be explicit about what you want strict and what you don't.
Note that if addition was lazy, your myAdd definition would work fine; the problem is that you're doing a lazy traversal of the list with a strict operation, which ends up causing the stack overflow. This mostly comes up with arithmetic, which is strict for the standard numeric types (Int, Integer, Float, Double, etc.).
This is quite ugly, and it would be a pain to write something like this every time we want to write a strict fold. Thankfully, Haskell has an abstraction ready for this!
myAdd = foldl' (+) 0
(You'll need to add import Data.List to compile this.)
foldl' (+) 0 [a, b, c, d] is just like (((0 + a) + b) + c) + d, except that at each application of (+) (which is how we refer to the binary operator + as a function value), the value is forced to be evaluated. The resulting code is cleaner, faster, and easier to read (once you know how the list folds work, you can understand any definition written in terms of them easier than a recursive definition).
Basically, the problem here is not that the compiler can't figure out how to make your program efficient — it's that making it as efficient as you like could change its semantics, which an optimisation should never do. Haskell's non-strict semantics certainly pose a learning curve to programmers in more "traditional" languages like C, but it gets easier over time, and once you see the power and abstraction that Haskell's non-strictness offers, you'll never want to go back :)

Expanding the example ehird hinted at in the comments:
data Peano = Z | S Peano
deriving (Eq, Show)
instance Ord Peano where
compare (S a) (S b) = compare a b
compare Z Z = EQ
compare Z _ = LT
compare _ _ = GT
instance Num Peano where
Z + n = n
(S a) + n = S (a + n)
-- omit others
fromInteger 0 = Z
fromInteger n
| n < 0 = error "Peano: fromInteger requires non-negative argument"
| otherwise = S (fromInteger (n-1))
instance Enum Peano where
succ = S
pred (S a) = a
pred _ = error "Peano: no predecessor"
toEnum n
| n < 0 = error "toEnum: invalid argument"
| otherwise = fromInteger (toInteger n)
fromEnum Z = 0
fromEnum (S a) = 1 + fromEnum a
enumFrom = iterate S
enumFromTo a b = takeWhile (<= b) $ enumFrom a
-- omit others
infinity :: Peano
infinity = S infinity
result :: Bool
result = 3 < myAdd [1 .. infinity]
result is True by the definition of myAdd, but if the compiler transformed into a tail-recursive loop, it wouldn't terminate. So that transformation is not only a change in efficiency, but also in semantics, hence a compiler must not do it.

A little funny example regarding "The issue is why the compiler is unable to optimize something that appears to be rather trivial to optimize."
Let's say I'm coming from Haskell to C++. I used to write foldr because in Haskell foldr is usually more effective than foldl because of laziness and list fusion.
So I'm trying to write a foldr for a (single-linked) list in C and complaining why it's grossly inefficient:
int foldr(int (*f)(int, node*), int base, node * list)
{
return list == NULL
? base
: f(a, foldr(f, base, list->next));
}
It is inefficient not because the C compiler in question is an unrealistic toy tool developed by ivory tower theorists for their own satisfaction, but because the code in question is grossly non-idiomatic for C.
It is not the case that you cannot write an efficient foldr in C: you just need a doubly-linked list. In Haskell, similarly, you can write an efficient foldl, you need strictness annotations for foldl to be efficient. The standard library provides both foldl (without annotations) and foldl' (with annotations).
The idea of left folding a list in Haskell is the same kind of perversion as a desire to iterate a singly-linked list backwards using recursion in C. Compiler is there to help normal people, not perverts lol.
As your C++ projects probably don't have code iterating singly-linked lists backwards, my HNC project contains only 1 foldl I incorrectly wrote before I mastered Haskell enough. You hardly ever need to foldl in Haskell.
You must unlearn that forward iteration is natural and fastest, and learn that backward iteration is. The forward iteration (left folding) does not do what you intend, until you annotate: it does three passes - list creation, thunk chain buildup and thunk evaluation, instead of two (list creation and list traversal). Note that in immutable world lists can be only efficiently created backwards: a : b is O(1), and a ++ [b] is O(N).
And the backward iteration doesn't do what you intend either. It does one pass instead of three as you might expect from your C background. It doesn't create a list, traverse it to the bottom and then traverse it backwards (2 passes) - it traverses the list as it creates it, that is, 1 pass. With optimizations on, it is just a loop - no actual list elements are created. With optimizations off, it is still O(1) space operation with bigger constant overhead, but explanation is a bit longer.

So there are two things I will address about your problem, firstly the performance problem, and then secondly the expressive problem, that of having to help the compiler with something that seems trivial.
The performance
The thing is that your program is in fact not tail recursive, that is, there is no single call to a function that can replace the recursion. Lets have a look at what happens when we expand myAdd [1..3]:
myAdd [1,2,3]
1 + myAdd [2,3]
1 + 2 + myAdd [3]
As you can see, at any given step, we cannot replace the recursion with a function call, we could simplify the expression by reducing 1 + 2 to 3, but that is not what tail recursion is about.
So here is a version that is tail recursive:
myAdd2 = go 0
where go a [] = a
go a (x:xs) = go (a + x) xs
Lets have a look at how go 0 [1,2,3] is evaluated:
go 0 [1,2,3]
go (1+0) [2,3]
go (2 + 1 + 0) [3]
As you see, at every step, we only need to keep track of
one function call, and as long the first parameter is
evaluated strictly we should not get an exponential space
blow up, and in fact, if you compile with optimization (-O1 or -O2)
ghc is smart enough to figure that out on its own.
Expressiveness
Alright so it is a bit harder to reason about performance in haskell, but most of the time you don't have to. The thing is that you can use combinators that ensure efficiency. This particular pattern above is captured by foldl (and its strict cousin foldl') so myAdd can be written as:
myAdd = foldl (+) 0
and if you compile that with optimization it will not give you an exponential space blowup!

Extent of GHC's optimization

I am not very familiar with the degree that Haskell/GHC can optimize code. Below I have a pretty "brute-force" (in the declarative sense) implementation of the n queens problem. I know it can be written more efficiently, but thats not my question. Its that this got me thinking about the GHC optimizations capabilities and limits.
I have expressed it in what I consider a pretty straightforward declarative sense. Filter permutations of [1..n] that fulfill the predicate For all indices i,j s.t j<i, abs(vi - vj) != j-i I would hope this is the kind of thing that can be optimized, but it also kind of feels like asking a lot of compiler.
validQueens x = and [abs (x!!i - x!!j) /= j-i | i<-[0..length x - 2], j<-[i+1..length x - 1]]
queens n = filter validQueens (permutations [1..n])
oneThru x = [1..x]
pointlessQueens = filter validQueens . permutations . oneThru
main = do
n <- getLine
print $ pointlessQueens $ (read :: String -> Int) n
This runs fairly slow and grows quickly. n=10 takes about a second and n=12 takes forever. Without optimization I can tell the growth is factorial (# of permutations) multiplied by quadratic (# of differences in the predicate to check). Is there any way this code can perform better thru intelligent compilation? I tried the basic ghc options such has -O2 and didn't notice a significant difference, but I don't know the finer points (just added the flagS)
My impression is that the function i call queens can not be optimized and must generate all permutations before filter. Does the point-free version have a better chance? On the one hand I feel like a smart function comprehension between a filter and a predicate might be able to knock off some obviously undesired elements before they are even fully generated, but on the other hand it kind of feels like a lot to ask.
Sorry if this seems rambling, i guess my question is
Is the pointfree version of above function more capable of being optimized?
What steps could I take at make/compile/link time to encourage optimization?
Can you briefly describe some possible (and contrast with the impossible!) means of optimization for the above code? At what point in the process do these occur?
Is there any particular part of ghc --make queensN -O2 -v output I should be paying attention to? Nothing stands out to me. Don't even see much difference in output due to optimization flags
I am not overly concerned with this code example, but I thought writing it got me thinking and it seems to me like a decent vehicle for discussing optimization.
PS - permutations is from Data.List and looks like this:
permutations :: [a] -> [[a]]
permutations xs0 = xs0 : perms xs0 []
where
perms [] _ = []
perms (t:ts) is = foldr interleave (perms ts (t:is)) (permutations is)
where interleave xs r = let (_,zs) = interleave' id xs r in zs
interleave' _ [] r = (ts, r)
interleave' f (y:ys) r = let (us,zs) = interleave' (f . (y:)) ys r
in (y:us, f (t:y:us) : zs)

At a more general level regarding "what kind of optimizations can GHC do", it may help to break the idea of an "optimization" apart a little bit. There are conceptual distinctions that can be drawn between aspects of a program that can be optimized. For instance, consider:
The intrinsic logical structure of the algorithm: You can safely assume in almost every case that this will never be optimized. Outside of experimental research, you're not likely to find a compiler that will replace a bubble sort with a merge sort, or even an insertion sort, and extremely unlikely to find one that would replace a bogosort with something sensible.
Non-essential logical structure of the algorithm: For instance, in the expression g (f x) (f x), how many times will f x be computed? What about an expression like g (f x 2) (f x 5)? These aren't intrinsic to the algorithm, and different variations can be interchanged without impacting anything other than performance. The difficulties in performing optimization here are essentially recognizing when a substitution can in fact be done without changing the meaning, and predicting which version will have the best results. A lot of manual optimizations fall into this category, along with a great deal of GHC's cleverness.
This is also the part that trips a lot of people up, because they see how clever GHC is and expect it to do even more. And because of the reasonable expectation that GHC should never make things worse, it's not uncommon to have potential optimizations that seem obvious (and are, to the programmer) that GHC can't apply because it's nontrivial to distinguish cases where the same transformation would significantly degrade performance. This is, for instance, why memoization and common subexpression elimination aren't always automatic.
This is also the part where GHC has a huge advantage, because laziness and purity make a lot of things much easier, and is I suspect what leads to people making tongue-in-cheek remarks like "Optimizing compilers are a myth (except perhaps in Haskell).", but also to unrealistic optimism about what even GHC can do.
Low-level details: Things like memory layout and other aspects of the final code. These tend to be somewhat arcane and highly dependent on implementation details of the runtime, the OS, and the processor. Optimizations of this sort are essentially why we have compilers, and usually not something you need to worry about unless you're writing code that is very computationally demanding (or are writing a compiler yourself).
As far as your specific example here goes: GHC isn't going to significantly alter the intrinsic time complexity of your algorithm. It might be able to remove some constant factors. What it can't do is apply constant-factor improvements that it can't be sure are correct, particularly ones that technically change the meaning of the program in ways that you don't care about. Case in point here is #sclv's answer, which explains how your use of print creates unnecessary overhead; there's nothing GHC could do about that, and in fact the current form would possibly inhibit other optimizations.

There's a conceptual problem here. Permutations is generating streaming permutations, and filter is streaming too. What's forcing everything prematurely is the "show" implicit in "print". Change your last line to:
mapM print $ pointlessQueens $ (read :: String -> Int) n
and you'll see that results are generated in a streaming fashion much more rapidly. That fixes, for large result sets, a potential space leak, and other than that just lets things be printed as computed rather than all at once at the end.
However, you shouldn't expect any order of magnitude improvements from ghc optimizations (there are a few, obvious ones that you do get, mostly having to do with strictness and folds, but its irritating to rely on them). What you'll get are constant factors, generally.
Edit: As luqui points out below, show is also streaming (or at least show of [Int] is), but the line buffering nonetheless makes it harder to see the genuine speed of computation...

It should be noted, although you do express that it is not part of your question, that the big problem with your code is that you do not do any pruning.
In the case of your question, it feels foolish to talk about possible/impossible optimization, compiler flags, and how to best formulate it etc. when an improvement of the algorithm is staring us so blatantly in the face.
One of the first things that will be tried is the permutations starting with the first queen in position 1 and the second queen in position 2 ([1,2...]). This is of course not a solution and we will have to move one of the queens. However, in your implementation, all permutations involving this combination of the two first queens will be tested! The search should stop there and instantly move to the permutations involving [1,3,...].
Here is a version that does this sort of pruning:
import Data.List
import Control.Monad
main = getLine >>= mapM print . queens . read
queens :: Int -> [[Int]]
queens n = queens' [] n
queens' xs n
| length xs == n = return xs
| otherwise = do
x <- [1..n] \\ xs
guard (validQueens (x:xs))
queens' (x:xs) n
validQueens x =
and [abs (x!!i - x!!j) /= j-i | i<-[0..length x - 2], j<-[i+1..length x - 1]]

I understand that your question was about compiler optimization but as the discussion has shown pruning is necessary.
The first paper that I know of about how to do this for the n queens problem in a lazy functional language is Turner's paper "Recursion Equations as a Programming Language" You can read it in Google Books here.
In terms of your comment about a pattern worth remembering, this problem introduces a very powerful pattern. A great paper on this idea is Philip Wadler's paper, "How to Replace Failure by a List of Successes", which can be read in Google Books here
Here is a pure, non-monadic, implementation based on Turner's Miranda implementation. In the case of n = 12 (queens 12 12) it returns the first solution in .01 secs and will compute all 14,200 solutions in under 6 seconds. Of course printing those takes much longer.
queens :: Int -> Int -> [[Int]]
queens n boardsize =
queensi n
where
-- given a safe arrangement of queens in the first n - 1 rows,
-- "queensi n" returns a list of all the safe arrangements of queens
-- in the first n rows
queensi :: Int -> [[Int]]
queensi 0 = [[]]
queensi n = [ x : y | y <- queensi (n-1) , x <- [1..boardsize], safe x y 1]
-- "safe x y n" tests whether a queen at column x would be safe from previous
-- queens in y where the first element of y is n rows away from x, the second
-- element is (n+1) rows away from x, etc.
safe :: Int -> [Int] -> Int -> Bool
safe _ [] _ = True
safe x (c:y) n = and [ x /= c , x /= c + n , x /= c - n , safe x y (n+1)]
-- we only need to check for queens in the same column, and the same diagonals;
-- queens in the same row are not possible by the fact that we only pick one
-- queen per row

When is memoization automatic in GHC Haskell?

I can't figure out why m1 is apparently memoized while m2 is not in the following:
m1 = ((filter odd [1..]) !!)
m2 n = ((filter odd [1..]) !! n)
m1 10000000 takes about 1.5 seconds on the first call, and a fraction of that on subsequent calls (presumably it caches the list), whereas m2 10000000 always takes the same amount of time (rebuilding the list with each call). Any idea what's going on? Are there any rules of thumb as to if and when GHC will memoize a function? Thanks.

GHC does not memoize functions.
It does, however, compute any given expression in the code at most once per time that its surrounding lambda-expression is entered, or at most once ever if it is at top level. Determining where the lambda-expressions are can be a little tricky when you use syntactic sugar like in your example, so let's convert these to equivalent desugared syntax:
m1' = (!!) (filter odd [1..]) -- NB: See below!
m2' = \n -> (!!) (filter odd [1..]) n
(Note: The Haskell 98 report actually describes a left operator section like (a %) as equivalent to \b -> (%) a b, but GHC desugars it to (%) a. These are technically different because they can be distinguished by seq. I think I might have submitted a GHC Trac ticket about this.)
Given this, you can see that in m1', the expression filter odd [1..] is not contained in any lambda-expression, so it will only be computed once per run of your program, while in m2', filter odd [1..] will be computed each time the lambda-expression is entered, i.e., on each call of m2'. That explains the difference in timing you are seeing.
Actually, some versions of GHC, with certain optimization options, will share more values than the above description indicates. This can be problematic in some situations. For example, consider the function
f = \x -> let y = [1..30000000] in foldl' (+) 0 (y ++ [x])
GHC might notice that y does not depend on x and rewrite the function to
f = let y = [1..30000000] in \x -> foldl' (+) 0 (y ++ [x])
In this case, the new version is much less efficient because it will have to read about 1 GB from memory where y is stored, while the original version would run in constant space and fit in the processor's cache. In fact, under GHC 6.12.1, the function f is almost twice as fast when compiled without optimizations than it is compiled with -O2.

m1 is computed only once because it is a Constant Applicative Form, while m2 is not a CAF, and so is computed for each evaluation.
See the GHC wiki on CAFs: http://www.haskell.org/haskellwiki/Constant_applicative_form

There is a crucial difference between the two forms: the monomorphism restriction applies to m1 but not m2, because m2 has explicitly given arguments. So m2's type is general but m1's is specific. The types they are assigned are:
m1 :: Int -> Integer
m2 :: (Integral a) => Int -> a
Most Haskell compilers and interpreters (all of them that I know of actually) do not memoize polymorphic structures, so m2's internal list is recreated every time it's called, where m1's is not.

I'm not sure, because I'm quite new to Haskell myself, but it appears that it's beacuse the second function is parametrized and the first one is not. The nature of the function is that, it's result depends on input value and in functional paradigm especailly it depends ONLY on the input. Obvious implication is that a function with no parameters returns always the same value over and over, no matter what.
Aparently there's an optimizing mechanizm in GHC compiler that exploits this fact to compute the value of such a function only once for whole program runtime. It does it lazily, to be sure, but does it nonetheless. I noticed it myself, when I wrote the following function:
primes = filter isPrime [2..]
where isPrime n = null [factor | factor <- [2..n-1], factor `divides` n]
where f `divides` n = (n `mod` f) == 0
Then to test it, I entered GHCI and wrote: primes !! 1000. It took a few seconds, but finally I got the answer: 7927. Then I called primes !! 1001 and got the answer instantly. Similarly in an instant I got the result for take 1000 primes, because Haskell had to compute the whole thousand-element list to return 1001st element before.
Thus if you can write your function such that it takes no parameters, you probably want it. ;)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string