Programmatic proofs for monad laws

Programmatic proofs for monad laws - haskell

In haskell, the users do not have to prove that their monads satisfy the monad laws.
return a >>= k = k a
m >>= return = m
m >>= (\x -> k x >>= h) = (m >>= k) >>= h
If I understand correctly, even if they want, there's no way for the compiler could read such a proof.
Questions
What technology is missing in haskell for there to be a way to implement such proof checkable by a compiler?
Which functional language supports such a functionality (i.e. can check if a claim-to-be monad satisfies the monad laws)?

In practice, laws usually aren't proven in Haskell, but they may very well be tested. If you throw lots of random inputs to the expressions on both sides of the equation for your monad, and the result always comes out the same on both sides, that may not guarantee anything but it does make it quite likely that any law-violating behaviour would be caught. That is, provided you generate the inputs in a sufficiently representative way. QuickCheck is usually pretty good at this.
If you do want to prove laws then, well, Haskell isn't really the right tool. You'd want the proof to be checked at compile time, but Haskell makes it rather difficult to express complicated values in the type level. If you do it at runtime instead, then first of all: no good if the deployed executable crashes because of a mistake. But more importantly, since Haskell isn't total you could “prove” any proposition by just giving undefined as the result – or some other ⊥ value, more typically this might be some subtle infinite loop.
The right tool is a dependently typed language. The most popular are Coq and Lean, which resemble ML more than Haskell, and Agda. These are primarily intended to be proof assistants rather that general programming languages which also allow you to formulate theorems; Idris goes more in that direction.
All that said, modern Haskell is now also have somewhat capable of dependently-typed programming. The key tool is to have your functions as type families, and use singletons to get value-level standins to the type-level values, and then use either GADTs or constrained CPS to pass around the proofs.
It's still really awkward to use this to specify laws for a type class, but it can be used quite nicely to Curry-Howard-express concrete theorems. The singletons-base package contains a lot of standard functions in type-lifted form, thus suitable for proving stuff about. For example, here's how you could formulate that the list concatenation operator is associative:
{-# LANGUAGE TypeFamilies, DataKinds, KindSignatures, PolyKinds, TypeOperators #-}
import Data.List.Singletons
listConcatAssoc :: ∀ k l m ρ . Sing k -> Sing l -> Sing m
-> (((k++l)++m ~ k++(l++m)) => ρ) -> ρ
listConcatAssoc SNil SNil SNil φ = φ
...
The complete proof will be quite annoying to write, but TBH proofs are annoying to write even in Coq, though that is specifically its job. Coq does make it a lot nicer to really express typeclasses with laws etc., though.

Related

Why does Haskell contain so many equivalent functions

It seems like there are a lot of functions that do the same thing, particularly relating to Monads, Functors, and Applicatives.
Examples (from most to least generic):
fmap == liftA == liftM
(<*>) == ap
liftA[2345] == liftM[2345]
pure == return
(*>) == (>>)
An example not directly based on the FAM class tree:
fmap == map
(I thought there were quite a few more with List, Foldable, Traversable, but it looks like most were made more generic some time ago, as I only see the old, less generic type signatures in old stack overflow / message board questions)
I personally find this annoying, as it means that if I need to do x, and some function such as liftM allows me to do x, then I will have made my function less generic than it could have been, and I am only going to notice that kind of thing by thoroughly reasoning about the differences between types (such as FAM, or perhaps List, Foldable, Traversable combinations as well), which is not beginner friendly at all, as while simply using those types isn't all that hard, reasoning about their properties and laws requires a lot more mental effort.
I am guessing a lot of these equivalencies come from the Applicative Monad Proposal. If that is the reason for them (and not some other reason I am missing for having less generic functions available for confusion), are they going to be deprecated / deleted ever? I can understand waiting a long time to delete them, due to breaking existing code, but surely deprecation is a good idea?

The short answers are "history" and "regularity".
Originally "map" was defined for lists. Then type-classes were introduced, with the Functor type class, so the generalised version of "map" for any functor had to be called something different, otherwise existing code would be broken. Hence "fmap".
Then monads came along. Instances of monads did not need to be functors, so "liftM" was created, along with "liftM2", "liftM3" etc. Of course if a type is an instance of both Monad and Functor then fmap = liftM.
Monads also have "ap", used in expressions like f `ap` arg1 `ap` arg2. This was very handy, but then Applicative Functors were added. (<*>) did the same job for applicative functors as 'ap', but because many applicative functors are not monads it had to be called something different. Likewise liftAx versus liftMx and "pure" versus "return".

They aren't equivalent though. equivalent things in haskell can be interchanged with no difference at all in functionality. Consider for example pure and return
EDIT: I wrote some examples down, but they were really bad since they involved Maybe a, a type that is both an applicative and a monad, so the functions could be used pretty interchangeably.
There are types that are applicatives but not monads though (see this question for examples), and by studying the type of the following expression, we can see that this could lead to some roadbumps:
pure 1 >>= pure :: (Monad m, Num b) => m b

I personally find this annoying, as it means that if I need to do x, and some function such as liftM allows me to do x, then I will have made my function less generic than it could have been
This logic is backwards.
Normally you know in advance the type of the thing you want to write, be it IO String or (Foldable f, Monoid t, Monad m) => f (m t) -> m t or whatever. Let's take the first case, getLineCapitalized :: IO String. You could write it as
getLineCapitalized = liftM (map toUpper) getLine
or
getLineCapitalized = fmap (fmap toUpper) getLine
Is the former "less generic" because it uses the specialized functions liftM and map? Of course not. This is intrinsically an IO action that produces a list. It cannot become "more generic" by changing it to the second version since those fmaps will have their types fixed to IO and [] anyways. So, there is no advantage to the second version.
By writing the first version, you provide contextual information to the reader for free. In liftM (map foo) bar, the reader knows that bar is going to be an action in some monad that returns a list. In fmap (fmap foo) bar, it could be any sort of doubly-nested structure whatsoever. If bar is something complicated rather than just getLine, then this kind of information is helpful for understanding more easily what is going on in bar.
In general, you should write a function in two steps.
Decide what the type of the function should be. Make it as general or as specific as you want. The more general the type of the function, the stronger guarantees you get on its behavior from parametricity.
Once you have decided on the type of your function, implement it using the most specific available functions. By doing so, you are providing the most information to the reader of your function. You never lose any generality or parametricity guarantees by doing so, since those only depend on the type, which you already determined in step 1.
Edit in response to comments: I was reminded of the biggest reason to use the most specific function available, which is catching bugs. The type length :: [a] -> Int is essentially the entire reason that I still use GHC 7.8. It's never happened that I wanted to take the length of an unknown Foldable structure. On the other hand, I definitely do not want to ever accidentally take the length of a pair, or take the length of foo bar baz which I think has type [a], but actually has type Maybe [a].
In the use cases for Foldable that are not already covered by the rest of the Haskell standard, lens is a vastly more powerful alternative. If I want the "length" of a Maybe t, lengthOf _Just :: Maybe t -> Int expresses my intent clearly, and the compiler can check that the program actually matches my intent; and I can go on to write lengthOf _Nothing, lengthOf _Left, etc. Explicit is better than implicit.

There are some "redundant" functions like liftM, ap, and liftA that have a very real use and taking them out would cause loss of functionality --- you can use liftM, ap, and liftA to implement your Functor or Applicative instances if all you've written is a Monad instance. It lets you be lazy and do, say:
instance Monad Foo where
return = ...
(>>=) = ...
Now you've done all of the rewarding work of defining a Monad instance, but this won't compile. Why? Because you also need a Functor and Applicative instance.
So, because you're quickly prototyping, or lazy, or can't think of a better way, you can just get a free Functor and Applicative instance:
instance Functor Foo where
fmap = liftM
instance Applicative Foo where
pure = return
(<*>) = ap
In fact, you can just copy-and-paste that chunk of code everywhere you need to quickly define a Functor or Applicative instance when you already have a Monad instance defined.
The same goes for fmapDefault from Data.Traversable. If you've implemented Traversable, you can also implement Foldable and Functor:
instance Functor Bar where
fmap = fmapDefault
no extra work required!
There are some redundant functions, however, that really have no actual usage other than being historical accidents from a time when Functor was not a superclass of Monad. These have literally zero use/point in existing...and include things like the liftM2, liftM3 etc., and (>>) and friends.

Why not be dependently typed?

I have seen several sources echo the opinion that "Haskell is gradually becoming a dependently-typed language". The implication seems to be that with more and more language extensions, Haskell is drifting in that general direction, but isn't there yet.
There are basically two things I would like to know. The first is, quite simply, what does "being a dependently-typed language" actually mean? (Hopefully without being too technical about it.)
The second question is... what's the drawback? I mean, people know we're heading that way, so there must be some advantage to it. And yet, we're not there yet, so there must be some downside stopping people going all the way. I get the impression that the problem is a steep increase in complexity. But, not really understanding what dependent typing is, I don't know for sure.
What I do know is that every time I start reading about a dependently-typed programming language, the text is utterly incomprehensible... Presumably that's the problem. (?)

Dependently Typed Haskell, Now?
Haskell is, to a small extent, a dependently typed language. There is a notion of type-level data, now more sensibly typed thanks to DataKinds, and there is some means (GADTs) to give a run-time
representation to type-level data. Hence, values of run-time stuff effectively show up in types, which is what it means for a language to be dependently typed.
Simple datatypes are promoted to the kind level, so that the values
they contain can be used in types. Hence the archetypal example
data Nat = Z | S Nat
data Vec :: Nat -> * -> * where
VNil :: Vec Z x
VCons :: x -> Vec n x -> Vec (S n) x
becomes possible, and with it, definitions such as
vApply :: Vec n (s -> t) -> Vec n s -> Vec n t
vApply VNil VNil = VNil
vApply (VCons f fs) (VCons s ss) = VCons (f s) (vApply fs ss)
which is nice. Note that the length n is a purely static thing in
that function, ensuring that the input and output vectors have the
same length, even though that length plays no role in the execution of
vApply. By contrast, it's much trickier (i.e., impossible) to
implement the function which makes n copies of a given x (which
would be pure to vApply's <*>)
vReplicate :: x -> Vec n x
because it's vital to know how many copies to make at run-time. Enter
singletons.
data Natty :: Nat -> * where
Zy :: Natty Z
Sy :: Natty n -> Natty (S n)
For any promotable type, we can build the singleton family, indexed
over the promoted type, inhabited by run-time duplicates of its
values. Natty n is the type of run-time copies of the type-level n
:: Nat. We can now write
vReplicate :: Natty n -> x -> Vec n x
vReplicate Zy x = VNil
vReplicate (Sy n) x = VCons x (vReplicate n x)
So there you have a type-level value yoked to a run-time value:
inspecting the run-time copy refines static knowledge of the
type-level value. Even though terms and types are separated, we can
work in a dependently typed way by using the singleton construction as
a kind of epoxy resin, creating bonds between the phases. That's a
long way from allowing arbitrary run-time expressions in types, but it ain't nothing.
What's Nasty? What's Missing?
Let's put a bit of pressure on this technology and see what starts
wobbling. We might get the idea that singletons should be manageable a
bit more implicitly
class Nattily (n :: Nat) where
natty :: Natty n
instance Nattily Z where
natty = Zy
instance Nattily n => Nattily (S n) where
natty = Sy natty
allowing us to write, say,
instance Nattily n => Applicative (Vec n) where
pure = vReplicate natty
(<*>) = vApply
That works, but it now means that our original Nat type has spawned
three copies: a kind, a singleton family and a singleton class. We
have a rather clunky process for exchanging explicit Natty n values
and Nattily n dictionaries. Moreover, Natty is not Nat: we have
some sort of dependency on run-time values, but not at the type we
first thought of. No fully dependently typed language makes dependent
types this complicated!
Meanwhile, although Nat can be promoted, Vec cannot. You can't
index by an indexed type. Full on dependently typed languages impose
no such restriction, and in my career as a dependently typed show-off,
I've learned to include examples of two-layer indexing in my talks,
just to teach folks who've made one-layer indexing
difficult-but-possible not to expect me to fold up like a house of
cards. What's the problem? Equality. GADTs work by translating the
constraints you achieve implicitly when you give a constructor a
specific return type into explicit equational demands. Like this.
data Vec (n :: Nat) (x :: *)
= n ~ Z => VNil
| forall m. n ~ S m => VCons x (Vec m x)
In each of our two equations, both sides have kind Nat.
Now try the same translation for something indexed over vectors.
data InVec :: x -> Vec n x -> * where
Here :: InVec z (VCons z zs)
After :: InVec z ys -> InVec z (VCons y ys)
becomes
data InVec (a :: x) (as :: Vec n x)
= forall m z (zs :: Vec x m). (n ~ S m, as ~ VCons z zs) => Here
| forall m y z (ys :: Vec x m). (n ~ S m, as ~ VCons y ys) => After (InVec z ys)
and now we form equational constraints between as :: Vec n x and
VCons z zs :: Vec (S m) x where the two sides have syntactically
distinct (but provably equal) kinds. GHC core is not currently
equipped for such a concept!
What else is missing? Well, most of Haskell is missing from the type
level. The language of terms which you can promote has just variables
and non-GADT constructors, really. Once you have those, the type family machinery allows you to write type-level programs: some of
those might be quite like functions you would consider writing at the
term level (e.g., equipping Nat with addition, so you can give a
good type to append for Vec), but that's just a coincidence!
Another thing missing, in practice, is a library which makes
use of our new abilities to index types by values. What do Functor
and Monad become in this brave new world? I'm thinking about it, but
there's a lot still to do.
Running Type-Level Programs
Haskell, like most dependently typed programming languages, has two
operational semanticses. There's the way the run-time system runs
programs (closed expressions only, after type erasure, highly
optimised) and then there's the way the typechecker runs programs
(your type families, your "type class Prolog", with open expressions). For Haskell, you don't normally mix
the two up, because the programs being executed are in different
languages. Dependently typed languages have separate run-time and
static execution models for the same language of programs, but don't
worry, the run-time model still lets you do type erasure and, indeed,
proof erasure: that's what Coq's extraction mechanism gives you;
that's at least what Edwin Brady's compiler does (although Edwin
erases unnecessarily duplicated values, as well as types and
proofs). The phase distinction may not be a distinction of syntactic category
any longer, but it's alive and well.
Dependently typed languages, being total, allow the typechecker to run
programs free from the fear of anything worse than a long wait. As
Haskell becomes more dependently typed, we face the question of what
its static execution model should be? One approach might be to
restrict static execution to total functions, which would allow us the
same freedom to run, but might force us to make distinctions (at least
for type-level code) between data and codata, so that we can tell
whether to enforce termination or productivity. But that's not the only
approach. We are free to choose a much weaker execution model which is
reluctant to run programs, at the cost of making fewer equations come
out just by computation. And in effect, that's what GHC actually
does. The typing rules for GHC core make no mention of running
programs, but only for checking evidence for equations. When
translating to the core, GHC's constraint solver tries to run your type-level programs,
generating a little silvery trail of evidence that a given expression
equals its normal form. This evidence-generation method is a little
unpredictable and inevitably incomplete: it fights shy of
scary-looking recursion, for example, and that's probably wise. One
thing we don't need to worry about is the execution of IO
computations in the typechecker: remember that the typechecker doesn't have to give
launchMissiles the same meaning that the run-time system does!
Hindley-Milner Culture
The Hindley-Milner type system achieves the truly awesome coincidence
of four distinct distinctions, with the unfortunate cultural
side-effect that many people cannot see the distinction between the
distinctions and assume the coincidence is inevitable! What am I
talking about?
terms vs types
explicitly written things vs implicitly written things
presence at run-time vs erasure before run-time
non-dependent abstraction vs dependent quantification
We're used to writing terms and leaving types to be inferred...and
then erased. We're used to quantifying over type variables with the
corresponding type abstraction and application happening silently and
statically.
You don't have to veer too far from vanilla Hindley-Milner
before these distinctions come out of alignment, and that's no bad thing. For a start, we can have more interesting types if we're willing to write them in a few
places. Meanwhile, we don't have to write type class dictionaries when
we use overloaded functions, but those dictionaries are certainly
present (or inlined) at run-time. In dependently typed languages, we
expect to erase more than just types at run-time, but (as with type
classes) that some implicitly inferred values will not be
erased. E.g., vReplicate's numeric argument is often inferable from the type of the desired vector, but we still need to know it at run-time.
Which language design choices should we review because these
coincidences no longer hold? E.g., is it right that Haskell provides
no way to instantiate a forall x. t quantifier explicitly? If the
typechecker can't guess x by unifiying t, we have no other way to
say what x must be.
More broadly, we cannot treat "type inference" as a monolithic concept
that we have either all or nothing of. For a start, we need to split
off the "generalisation" aspect (Milner's "let" rule), which relies heavily on
restricting which types exist to ensure that a stupid machine can
guess one, from the "specialisation" aspect (Milner's "var" rule)
which is as effective as your constraint solver. We can expect that
top-level types will become harder to infer, but that internal type
information will remain fairly easy to propagate.
Next Steps For Haskell
We're seeing the type and kind levels grow very similar (and they
already share an internal representation in GHC). We might as well
merge them. It would be fun to take * :: * if we can: we lost
logical soundness long ago, when we allowed bottom, but type
soundness is usually a weaker requirement. We must check. If we must have
distinct type, kind, etc levels, we can at least make sure everything
at the type level and above can always be promoted. It would be great
just to re-use the polymorphism we already have for types, rather than
re-inventing polymorphism at the kind level.
We should simplify and generalise the current system of constraints by
allowing heterogeneous equations a ~ b where the kinds of a and
b are not syntactically identical (but can be proven equal). It's an
old technique (in my thesis, last century) which makes dependency much
easier to cope with. We'd be able to express constraints on
expressions in GADTs, and thus relax restrictions on what can be
promoted.
We should eliminate the need for the singleton construction by
introducing a dependent function type, pi x :: s -> t. A function
with such a type could be applied explicitly to any expression of type s which
lives in the intersection of the type and term languages (so,
variables, constructors, with more to come later). The corresponding
lambda and application would not be erased at run-time, so we'd be
able to write
vReplicate :: pi n :: Nat -> x -> Vec n x
vReplicate Z x = VNil
vReplicate (S n) x = VCons x (vReplicate n x)
without replacing Nat by Natty. The domain of pi can be any
promotable type, so if GADTs can be promoted, we can write dependent
quantifier sequences (or "telescopes" as de Briuijn called them)
pi n :: Nat -> pi xs :: Vec n x -> ...
to whatever length we need.
The point of these steps is to eliminate complexity by working directly with more general tools, instead of making do with weak tools and clunky encodings. The current partial buy-in makes the benefits of Haskell's sort-of dependent types more expensive than they need to be.
Too Hard?
Dependent types make a lot of people nervous. They make me nervous,
but I like being nervous, or at least I find it hard not to be nervous
anyway. But it doesn't help that there's quite such a fog of ignorance
around the topic. Some of that's due to the fact that we all still
have a lot to learn. But proponents of less radical approaches have
been known to stoke fear of dependent types without always making sure
the facts are wholly with them. I won't name names. These "undecidable typechecking", "Turing incomplete", "no phase distinction", "no type erasure", "proofs everywhere", etc, myths persist, even though they're rubbish.
It's certainly not the case that dependently typed programs must
always be proven correct. One can improve the basic hygiene of one's
programs, enforcing additional invariants in types without going all
the way to a full specification. Small steps in this direction quite
often result in much stronger guarantees with few or no additional
proof obligations. It is not true that dependently typed programs are
inevitably full of proofs, indeed I usually take the presence of any
proofs in my code as the cue to question my definitions.
For, as with any increase in articulacy, we become free to say foul
new things as well as fair. E.g., there are plenty of crummy ways to
define binary search trees, but that doesn't mean there isn't a good way. It's important not to presume that bad experiences cannot be
bettered, even if it dents the ego to admit it. Design of dependent
definitions is a new skill which takes learning, and being a Haskell
programmer does not automatically make you an expert! And even if some
programs are foul, why would you deny others the freedom to be fair?
Why Still Bother With Haskell?
I really enjoy dependent types, but most of my hacking projects are
still in Haskell. Why? Haskell has type classes. Haskell has useful
libraries. Haskell has a workable (although far from ideal) treatment
of programming with effects. Haskell has an industrial strength
compiler. The dependently typed languages are at a much earlier stage
in growing community and infrastructure, but we'll get there, with a
real generational shift in what's possible, e.g., by way of
metaprogramming and datatype generics. But you just have to look
around at what people are doing as a result of Haskell's steps towards
dependent types to see that there's a lot of benefit to be gained by
pushing the present generation of languages forwards, too.

Dependent typing is really just the unification of the value and type levels, so you can parametrize values on types (already possible with type classes and parametric polymorphism in Haskell) and you can parametrize types on values (not, strictly speaking, possible yet in Haskell, although DataKinds gets very close).
Edit: Apparently, from this point forward, I was wrong (see #pigworker's comment). I'll preserve the rest of this as a record of the myths I've been fed. :P
The issue with moving to full dependent typing, from what I've heard, is that it would break the phase restriction between the type and value levels that allows Haskell to be compiled to efficient machine code with erased types. With our current level of technology, a dependently typed language must go through an interpreter at some point (either immediately, or after being compiled to dependently-typed bytecode or similar).
This is not necessarily a fundamental restriction, but I'm not personally aware of any current research that looks promising in this regard but that has not already made it into GHC. If anyone else knows more, I would be happy to be corrected.

John that's another common misconception about dependent types: that they don't work when data is only available at run-time. Here's how you can do the getLine example:
data Some :: (k -> *) -> * where
Like :: p x -> Some p
fromInt :: Int -> Some Natty
fromInt 0 = Like Zy
fromInt n = case fromInt (n - 1) of
Like n -> Like (Sy n)
withZeroes :: (forall n. Vec n Int -> IO a) -> IO a
withZeroes k = do
Like n <- fmap (fromInt . read) getLine
k (vReplicate n 0)
*Main> withZeroes print
5
VCons 0 (VCons 0 (VCons 0 (VCons 0 (VCons 0 VNil))))
Edit: Hm, that was supposed to be a comment to pigworker's answer. I clearly fail at SO.

pigworker gives an excellent discussion of why we should be headed towards dependent types: (a) they're awesome; (b) they would actually simplify a lot of what Haskell already does.
As for the "why not?" question, there are a couple points I think. The first point is that while the basic notion behind dependent types is easy (allow types to depend on values), the ramifications of that basic notion are both subtle and profound. For example, the distinction between values and types is still alive and well; but discussing the difference between them becomes far more nuanced than in yer Hindley--Milner or System F. To some extent this is due to the fact that dependent types are fundamentally hard (e.g., first-order logic is undecidable). But I think the bigger problem is really that we lack a good vocabulary for capturing and explaining what's going on. As more and more people learn about dependent types, we'll develop a better vocabulary and so things will become easier to understand, even if the underlying problems are still hard.
The second point has to do with the fact that Haskell is growing towards dependent types. Because we're making incremental progress towards that goal, but without actually making it there, we're stuck with a language that has incremental patches on top of incremental patches. The same sort of thing has happened in other languages as new ideas became popular. Java didn't use to have (parametric) polymorphism; and when they finally added it, it was obviously an incremental improvement with some abstraction leaks and crippled power. Turns out, mixing subtyping and polymorphism is inherently hard; but that's not the reason why Java Generics work the way they do. They work the way they do because of the constraint to be an incremental improvement to older versions of Java. Ditto, for further back in the day when OOP was invented and people started writing "objective" C (not to be confused with Objective-C), etc. Remember, C++ started out under the guise of being a strict superset of C. Adding new paradigms always requires defining the language anew, or else ending up with some complicated mess. My point in all of this is that, adding true dependent types to Haskell is going to require a certain amount of gutting and restructuring the language--- if we're going to do it right. But it's really hard to commit to that kind of an overhaul, whereas the incremental progress we've been making seems cheaper in the short term. Really, there aren't that many people who hack on GHC, but there's a goodly amount of legacy code to keep alive. This is part of the reason why there are so many spinoff languages like DDC, Cayenne, Idris, etc.

What is a monad in FP, in categorical terms?

Every time someone promises to "explain monads", my interest is piqued, only to be replaced by frustration when the alleged "explanation" is a long list of examples terminated by some off-hand remark that the "mathematical theory" behind the "esoteric ideas" is "too complicated to explain at this point".
Now I'm asking for the opposite. I have a solid grasp on category theory and I'm not afraid of diagram chasing, Yoneda's lemma or derived functors (and indeed on monads and adjunctions in the categorical sense).
Could someone give me a clear and concise definition of what a monad is in functional programming? The fewer examples the better: sometimes one clear concept says more than a hundred timid examples. Haskell would do nicely as a language for demonstration though I'm not picky.

This question has some good answers: Monads as adjunctions
More to the point, Derek Elkins' "Calculating Monads with Category Theory" article in TMR #13 should have the sort of constructions you're looking for: http://www.haskell.org/wikiupload/8/85/TMR-Issue13.pdf
Finally, and perhaps this is really the closest to what you're looking for, you can go straight to the source and look at Moggi's seminal papers on the topic from 1988-91: http://www.disi.unige.it/person/MoggiE/publications.html
See in particular "Notions of computation and monads".
My own I'm sure too condensed/imprecise take:
Begin with a category Hask whose objects are Haskell types, and whose morphisms are functions. Functions are also objects in Hask, as are products. So Hask is Cartesian closed. Now introduce an arrow mapping every object in Hask to MHask which is a subset of the objects in Hask. Unit!
Next introduce an arrow mapping every arrow on Hask to an arrow on MHask. This gives us map, and makes MHask a covariant endofunctor. Now introduce an arrow mapping every object in MHask which is generated from an object in MHask (via unit) to the object in MHask which generates it. Join! And from the that, MHask is a monad (and a monoidal endofunctor to be more precise).
I'm sure there is a reason why the above is deficient, which is why I'd really direct you, if you're looking for formalism, to the Moggi papers in particular.

As a compliment to Carl's answer, a Monad in Haskell is (theoretically) this:
class Monad m where
join :: m (m a) -> m a
return :: a -> m a
fmap :: (a -> b) -> m a -> m b
Note that "bind" (>>=) can be defined as
x >>= f = join (fmap f x)
According to the Haskell Wiki
A monad in a category C is a triple (F : C → C, η : Id → F, μ : F ∘ F → F)
...with some axioms. For Haskell, fmap, return, and join line up with F, η, and μ, respectively. (fmap in Haskell defines a Functor). If I'm not mistaken, Scala calls these map, pure, and join respectively. (Scala calls bind "flatMap")

Ok, using Haskell terminology and examples...
A monad, in functional programming, is a composition pattern for data types with the kind * -> *.
class Monad (m :: * -> *) where
return :: a -> m a
(>>=) :: m a -> (a -> m b) -> m b
(There's more to the class than that in Haskell, but those are the important parts.)
A data type is a monad if it can implement that interface while satisfying three conditions in the implementation. These are the "monad laws", and I'll leave it to those long-winded explanations for the full explanation. I summarize the laws as "(>>= return) is an identity function, and (>>=) is associative." It's really not more than that, even if it can be expressed more precisely.
And that's all a monad is. If you can implement that interface while preserving those behavioral properties, you have a monad.
That explanation is probably shorter than you expected. That's because the monad interface really is very abstract. The incredible level of abstraction is part of why so many different things can be modeled as monads.
What's less obvious is that as abstract as the interface is, it allows generically modeling any control-flow pattern, regardless of the actual monad implementation. This is why the Control.Monad package in GHC's base library has combinators like when, forever, etc. And this is why the ability to explicitly abstract over any monad implementation is powerful, especially with support from a type system.

You should read the paper by Eugenio Moggi "Notions of computations and monads" which explain the then proposed role of monads to structure denotational semantic of effectful languages.
Also there is a related question:
References for learning the theory behind pure functional languages such as Haskell?
As you don't want hand-waving, you have to read scientific papers, not forum answers or tutorials.

A monad is a monoid in the category of endofunctors, whats the problem?.
Humor aside, I personally believe that monads, as they are used in Haskell and functional programming, are better understood from the monads-as-an-interface point of view (as in Carl's and Dan's answers) instead of from the monads-as-the-term-from-category-theory point of view. I have to confess that I only really internalized the whole monad thing when I had to use a monadic library from another language in a real project.
You mention that you didn't like all the "lots of examples" tutorials. Has anyone ever pointed you to the Awkward squad paper? It focuses manly in the IO monad but the introduction gives a good technical and historical explanation of why the monad concept was embraced by Haskell in the first place.

I don't really know what I'm talking about, but here's my take:
Monads are used to represent computations. You can think of a normal procedural program, which is basically a list of statements, as a bunch of composed computations. Monads are a generalization of this concept, allowing you to define how the statements get composed. Each computation has a value (it could just be ()); the monad just determines how the value strung through a series of computations behaves.
Do notation is really what makes this clear: it's basically a special sort of statement-based language that lets you define what happens between statements. It's as if you could define how ";" worked in C-like languages.
In this light all of the monads I've used so far makes sense: State doesn't affect the value but updates a second value which is passed along from computation to computation in the background; Maybe short-circuits the value if it ever encounters a Nothing; List lets you have a variable number of values passed through; IO lets you have impure values passed through in a safe way. The more specialized monads I've used like Gen and Parsec parsers are also similar.
Hopefully this is a clear explanation which isn't completely off-base.

Since you understand monads in the category-theoretic sense I am interpreting your question as being about the presentation of monads in functional programming.
Thus my answer avoids any explanation of what a monad is, or any intuition about its meaning or use.
Answer: In Haskell a monad is presented, in an internal language for some category, as the (internalised) maps of a Kleisli triple.
Explanation:
It is hard to be precise about the properties of the "Hask category", and these properties are largely irrelevant for understanding Haskell's presentation of monads.
Instead, for this discussion, it is more useful to understand Haskell as an internal language for some category C. Haskell functions define morphisms in C and Haskell types are objects in C, but the particular category in which these definitions are made is unimportant.
Parameteric data types, e.g. data F a = ..., are object mappings, e.g. F : |C| -> |C|.
The usual description of a monad in Haskell is in Kleisli triple (or Kleisli extension) form:
class Monad m where
return :: a -> m a
(>>=) :: m a -> (a -> m b) -> m b
where:
m is the object mapping m :|C| -> |C|
return is the unit operation on objects
>>= (pronounced "bind" by Haskellers) is the extension operation on morphisms but with its first two parameters swapped (cf. usual signature of extension (-)* : (a -> m b) -> m a -> m b)
(These maps are themselves internalised as families of morphisms in C, which is possible since m :|C| -> |C|).
Haskell's do-notation (if you have come across this) is therefore an internal language for Kleisli categories.

The Haskell wikibook page has a good basic explanation.

Why should I use applicative functors in functional programming?

I'm new to Haskell, and I'm reading about functors and applicative functors. Ok, I understand functors and how I can use them, but I don't understand why applicative functors are useful and how I can use them in Haskell. Can you explain to me with a simple example why I need applicative functors?

Applicative functors are a construction that provides the midpoint between functors and monads, and are therefore more widespread than monads, while more useful than functors. Normally you can just map a function over a functor. Applicative functors allow you to take a "normal" function (taking non-functorial arguments) use it to operate on several values that are in functor contexts. As a corollary, this gives you effectful programming without monads.
A nice, self-contained explanation fraught with examples can be found here. You can also read a practical parsing example developed by Bryan O'Sullivan, which requires no prior knowledge.

Applicative functors are useful when you need sequencing of actions, but don't need to name any intermediate results. They are thus weaker than monads, but stronger than functors (they do not have an explicit bind operator, but they do allow running arbitrary functions inside the functor).
When are they useful? A common example is parsing, where you need to run a number of actions that read parts of a data structure in order, then glue all the results together. This is like a general form of function composition:
f a b c d
where you can think of a, b and so on as the arbitrary actions to run, and f as the functor to apply to the result.
f <$> a <*> b <*> c <*> d
I like to think of them as overloaded 'whitespace'. Or, that regular Haskell functions are in the identity applicative functor.
See "Applicative Programming with Effects"

Conor McBride and Ross Paterson's Functional Pearl on the style has several good examples. It's also responsible for popularizing the style in the first place. They use the term "idiom" for "applicative functor", but other than that it's pretty understandable.

It is hard to come up with examples where you need applicative functors. I can understand why an intermediate Haskell programmer would ask them self that question since most introductory texts present instances derived from Monads using Applicative Functors only as a convenient interface.
The key insight, as mentioned both here and in most introductions to the subject, is that Applicative Functors are between Functors and Monads (even between Functors and Arrows). All Monads are Applicative Functors but not all Functors are Applicative.
So necessarily, sometimes we can use applicative combinators for something that we can't use monadic combinators for. One such thing is ZipList (see also this SO question for some details), which is just a wrapper around lists in order to have a different Applicative instance than the one derived from the Monad instance of list. The Applicative documentation uses the following line to give an intuitive notion of what ZipList is for:
f <$> ZipList xs1 <*> ... <*> ZipList xsn = ZipList (zipWithn f xs1 ... xsn)
As pointed out here, it is possible to make quirky Monad instances that almost work for ZipList.
There are other Applicative Functors that are not Monads (see this SO question) and they are easy to come up with. Having an alternative Interface for Monads is nice and all, but sometimes making a Monad is inefficient, complicated, or even impossible, and that is when you need Applicative Functors.
disclaimer: Making Applicative Functors might also be inefficient, complicated, and impossible, when in doubt, consult your local category theorist for correct usage of Applicative Functors.

In my experience, Applicative functors are great for the following reasons:
Certain kinds of data structures admit powerful types of compositions, but cannot really be made monads. In fact, most of the abstractions in functional reactive programming fall into this category. While we might technically be able to make e.g. Behavior (aka Signal) a monad, it typically cannot be done efficiently. Applicative functors allow us to still have powerful compositions without sacrificing efficiency (admittedly, it is a bit trickier to use an applicative than a monad sometimes, just because you don't have quite as much structure to work with).
The lack of data-dependence in an applicative functor allows you to e.g. traverse an action looking for all the effects it might produce without having the data available. So you could imagine a "web form" applicative, used like so:
userData = User <$> field "Name" <*> field "Address"
and you could write an engine which would traverse to find all the fields used and display them in a form, then when you get the data back run it again to get the constructed User. This cannot be done with a plain functor (because it combines two forms into one), nor a monad, because with a monad you could express:
userData = do
name <- field "Name"
address <- field $ name ++ "'s address"
return (User name address)
which cannot be rendered, because the name of the second field cannot be known without already having the response from the first. I'm pretty sure there's a library that implements this forms idea -- I've rolled my own a few times for this and that project.
The other nice thing about applicative functors is that they compose. More precisely, the composition functor:
newtype Compose f g x = Compose (f (g x))
is applicative whenever f and g are. The same cannot be said for monads, which has creates the whole monad transformer story which is complicated in some unpleasant ways. Applicatives are super clean this way, and it means you can build up the structure of a type you need by focusing on small composable components.
Recently the ApplicativeDo extension has appeared in GHC, which allows you to use do notation with applicatives, easing some of the notational complexity, as long as you don't do any monady things.

One good example: applicative parsing.
See [real world haskell] ch16 http://book.realworldhaskell.org/read/using-parsec.html#id652517
This is the parser code with do-notation:
-- file: ch16/FormApp.hs
p_hex :: CharParser () Char
p_hex = do
char '%'
a <- hexDigit
b <- hexDigit
let ((d, _):_) = readHex [a,b]
return . toEnum $ d
Using functor make it much shorter:
-- file: ch16/FormApp.hs
a_hex = hexify <$> (char '%' *> hexDigit) <*> hexDigit
where hexify a b = toEnum . fst . head . readHex $ [a,b]
'lifting' can hide the underlying details of some repeating code. then you can just use fewer words to tell the exact & precise story.

I would also suggest to take a look at this
In the end of the article there's an example
import Control.Applicative
hasCommentA blogComments =
BlogComment <$> lookup "title" blogComments
<*> lookup "user" blogComments
<*> lookup "comment" blogComments
Which illustrates several features of applicative programming style.

Relax ordering constraints in monadic computation

here is some food for thought.
When I write monadic code, the monad imposes ordering on the operations done. For example, If I write in the IO monad:
do a <- doSomething
b <- doSomethingElse
return (a + b)
I know doSomething will be executed before doSomethingElse.
Now, consider the equivalent code in a language like C:
return (doSomething() + doSomethingElse());
The semantics of C don't actually specify what order these two function calls will be evaluated, so the compiler is free to move things around as it pleases.
My question is this: How would I write monadic code in Haskell that also leaves this evaluation order undefined? Ideally, I would reap some benefits when my compiler's optimizer looks at the code and starts moving things around.
Here are some possible techniques that don't get the job done, but are in the right "spirit":
Write the code in functorial style, that is, write plus doSomething doSomethingElse and let plus schedule the monadic calls. Drawbacks: You lose sharing on the results of the monadic actions, and plus still makes a decision about when things end up being evaluated.
Use lazy IO, that is, unsafeInterleaveIO, which defers the scheduling to the demands lazy of evaluation. But lazy is different from strict with undefined order: in particular I do want all of my monadic actions to get executed.
Lazy IO, combined with immediately seq'ing all of the arguments. In particular, seq does not impose ordering constraints.
In this sense, I want something more flexible than monadic ordering but less flexible than full-out laziness.

This problem of over-sequentializing monad code is known as the "commutative monads problem".
Commutative monads are monads for which the order of actions makes no difference (they commute), that is when following code:
do a <- f x
b <- g y
m a b
is the same as:
do b <- g y
a <- f x
m a b
there are many monads that commute (e.g. Maybe, Random). If the monad is commutative, then the operations captured within it can be computed in parallel, for example. They are very useful!
However, we don't have a good syntax for monads that commute, though a lot of people have asked for such a thing -- it is still an open research problem.
As an aside, applicative functors do give us such freedom to reorder computations, however, you have to give up on the notion of bind (as suggestions for e.g. liftM2 show).

This is a deep dirty hack, but it seems like it should do the trick to me.
{-# OPTIONS_GHC -fglasgow-exts #-}
{-# LANGUAGE MagicHash #-}
module Unorder where
import GHC.Types
unorder :: IO a -> IO b -> IO (a, b)
unorder (IO f) (IO g) = IO $ \rw# ->
let (# _, x #) = f rw#
(# _, y #) = g rw#
in (# rw# , (x,y) #)
Since this puts non-determinism in the hands of the compiler, it should behave "correctly" (i.e. nondeterministically) with regards to control flow issues (i.e. exceptions) as well.
On the other hand, we can't pull the same trick most standard monads such as State and Either a since we're really relying on spooky action at a distance available via messing with the RealWorld token. To get the right behavior, we'd need some annotation available to the optimizer that indicated we were fine with nondeterministic choice between two non-equivalent alternatives.

The semantics of C don't actually specify what order these two function calls will be evaluated, so the compiler is free to move things around as it pleases.
But what if doSomething() causes a side-effect that will change the behavior of doSomethingElse()? Do you really want the compiler to mess with the order? (Hint: no) The fact that you are in a monad at all suggests that such may be the case. Your note that "you lose sharing on the results" also suggests this.
However, note that monadic does not always mean sequenced. It's not exactly what you describe, but you may be interested in the Par monad, which allows you to run your actions in parallel.
You are interested in leaving the order undefined so that the compiler can magically optimize it for you. Perhaps instead you should use something like the Par monad to indicate the dependencies (some things inevitably have to happen before others) and then let the rest run in parallel.
Side note: don't confuse Haskell's return to be anything like C's return

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string