Why doesn't Haskell support mutually recursive modules? - haskell

Haskell supports mutually recursive let-bindings, which is great. Haskell doesn't support mutually recursive modules, which is sometimes terrible. I know that GHC has its .hs-boot mechanism, but I think that's a bit of a hack.
As far as I know, transparent support for mutually recursive modules should be relatively "simple", and it can be done exactly like mutually recursive let-bindings: instead of taking each separate module as a compilation unit, I would take every strongly connected component of the module dependency graph as a compilation unit.
Am I missing something here? Is there any non-trivial reason why Haskell doesn't support mutually recursive modules in this way?

This 6-year-old feature request ticket contains a fair amount of discussion, which you may have already seen. The gist of it is that it's not entirely a simple change as far as GHC is concerned. A few specific issues raised:
GHC currently has a lot of baked-in assumptions about how modules are processed during compilation, and changing those assumptions significantly would vastly outweigh the benefits of transparent support for mutually recursive modules.
Lumping groups of modules together means they have to be compiled together, which means more recompilation and awkwardness with generating separate .hi and .o files.
Backward compatibility with existing builds that use hs-boot files.
You have the potential for mutually-recursive bindings that cross module boundaries in a mutually-recursive module group, which raises issues with anything that involves implicit, module-level scope (such as defaulting, and possibly type class instances).
And of course, the potential for unknown, unanticipated bugs, as with anything that alters long-standing assumptions in GHC. Even without massive changes to the compilation process, many things are currently assumed to be compiled on a per-module basis.
A lot of people would like to see this supported, but so far nobody has either produced a possible implementation or worked out a detailed, well-specified design that handles all the fiddly corner cases of the sort mentioned above.

Related

Are code reloading and laziness two incompatible features in Haskell?

One of Haskell's best known feature is its laziness, which permits to write more elegant and reusable code. Internally, laziness builds computations in memory for execution only when needed, which can happen much later at another place in the code. Also, the GHC API can reload new module versions, a feature that some programs use, most notably GHCi.
Now suppose that a lazy computation holds a reference to a function in a module, and the module is reloaded. Which version of the function will be executed when the lazy computation is evaluated to its full extent?
It seems natural that the old version should be executed, because its semantics was the one intended by the time the lazy computation was created, and so running the new version instead seems unreasonable. However, if the old version fails, subsequent debugging will be problematic because the code that failed no longer exists in the system, making it hard to track the cause of the failure.
All this reasoning leads me to think that laziness and code reloading (i.e. hot upgrades) are incompatible at a conceptual level. Is this really so, or is there some design decision in Haskell or GHC that solves this conceptual incompatibility?
Thanks

GHC Partial Evaluation and Separate Compilation

Whole-program compilers like MLton create optimized binaries in part to their ability to use the total source of the binary to perform partial evaluation: aggressively inlining constants and evaluating them until stuck—all during compilation!
This has been explored public ally a bit in the Haskell space by Gabriel Gonzalez's Morte.
Now my understanding is that Haskell does not do very much of this—if any at all. The cited reason I understand is that it is antithetical to separate compilation. This makes sense to prohibit partial evaluation across source-file boundaries, but it seems like in-file partial evaluation would still be an option.
As far as I know, in-file partial evaluation is still not performed, though.
My question is: is this true? If so, what are the tradeoffs for performing in-file partial evaluation? If not, what is an example file where one can improve compiled performance by putting more functionality into the same file?
(Edit: To clarify the above, I know there are a lot of questions as to what the best set of reductions to perform are—many are undecidable! I'd like to know the tradeoffs made in an "industrial strength" compiler with separate compilation that live at a level above choosing the right equational theory if there are any interesting things to talk about there. Things like compilation speed or file bloat are more toward the scope I'm interested in. Another question in the same space might be: "Why can't MLton get separate compilation just by compiling each module separately, leaving the API exposed, and then linking them all together?")
This is definitely an optimization that a small set of people are interested in and are pursuing. The Google search term to find information on it is "supercompilation". I believe there are at least two approaches floating about at the moment.
It seems one of the big tradeoffs is compilation-time resources (time and memory both), and at the moment the performance wins of paying these costs appear to be somewhat unpredictable. There's quite some work left. A few links:
A page on the GHC wiki
Neil Mitchell's Supero
Max Bolingbroke's Supercompilation by evaluation

Create orphan instances or add spurious dependencies?

I'm working on updating my ReadArgs package. I had a request to add Arguable instances for Data.Text and FileSystem.Path.FilePath. The former is no big deal, since it's in the base package, but the latter requires system-filepath
So I could release a ReadArgs-ext package, chock full of orphan instances, or I could update the ReadArgs package with an additional external dependency. Which option makes more sense?
My usual rule of thumb is to tend towards adding the instances for packages that are in the Haskell Platform, but don't involve less portable elements such as graphics. This covers both filepath and text. Since you are already dealing with the outside world for command line arguments, neither one of those seems like a particularly egregious addition.
Orphans can lead to pretty terrible problems.
I don't use them in 95% of my packages, and I go out of my way to avoid packages that use them.
The two exceptions I have at this point are a few missing monoids in reducers and a package full of vector-instances I picked up because I wasn't willing to make my entire hierarchy of packages depend on vector, downgrading everything from Safe to Trustworthy.
I find when I'm tempted to add an orphan instance, I can usually work around it by providing some kind of WrappedMonad-like newtype wrapper for lifting or lowering another class.

Haskell module naming conventions

How should I name my Haskell modules for a program, not a library, and organize them in a hierarchy?
I'm making a ray tracer called Luminosity. First I had these modules:
Vector Colour Intersect Trace Render Parse Export
Each module was fine on it's own, but I felt like this lacked organization.
First, I put every module under Luminosity, so for example Vector was now Luminosity.Vector (I assume this is standard for a haskell program?).
Then I thought: Vector and Colour are independent and could be reused, so they should be separated. But they're way too small to turn into libraries.
Where should they go? There is already (on hackage) a Data.Vector and Data.Colour, so should I put them there? Or will that cause confusion (even if I import them grouped with my other local imports)? If not there, should it be Luminosity.Data.Vector or Data.Luminosity.Vector? I'm pretty sure I've seen both used, although maybe I just happened to look at a project using a nonconventional structure.
I also have a simple TGA image exporter (Export) which can be independent from Luminosity. It appears the correct location would be Codec.Image.TGA, but again, should Luminosity be in there somewhere and if so, where?
It would be nice if Structure of a Haskell project or some other wiki explained this.
Unless your program is really big, don't organize the modules in a hierarchy. Why not? Because although computers are good at hierarchy, people aren't. People are good at meaningful names. If you choose good names you can easily handle 150 modules in a flat name space.
I felt like [a flat name space] lacked organization.
Hierarchical organization is not an end in itself. To justify splitting modules up into a hierarchy, you need a reason. Good reasons tend to have to do with information hiding or reuse. When you bring in information hiding, you are halfway to a library design, and when you are talking about reuse, you are effectively building a library. To morph a big program into "smaller program plus library" is a good strategy for software evolution, but it looks like you're just starting, and your program isn't yet big enough to evolve that way.
These issues are largely independent of the programming language you are using. I recommend reading some of David Parnas's work on product lines and program families, and also Matthias Blume's underappreciated paper Hierarchical Modularity. These works will give you some more concrete ideas about when hierarchy starts to serve a purpose.
First of all I put every module under Luminosity
I think this was a good move. It clarifies to anyone that is reading the code that these modules were made specifically for the Luminosity project.
If you write a module with the intent of simulating or improving upon an existing library, or of filling a gap where you believe a particular generic library is missing, then in that rare case, drop the prefix and name it generically. For an example of this, see how the pipes package exports Control.Monad.Trans.Free, because the author was, for whatever reason, not satisfied with existing implementations of Free monads.
Then I thought, Vector and Colour are pretty much independent and could be reused, so they should be separated. But they're way to small to separate off into a library (125 and 42 lines respectively). Where should they go?
If you don't make a separate library, then probably leave them at Luminosity.Vector and Luminosity.Colour. If you do make separate libraries, then try emailing the target audience of those libraries and see how other people think these libraries should be named and categorized. Whether or not you split these out into separate libraries is entirely up to you and how much benefit you think these separate libraries might provide for other people.

Expression trees vs IL.Emit for runtime code specialization

I recently learned that it is possible to generate C# code at runtime and I would like to put this feature to use. I have code that does some very basic geometric calculations like computing line-plane intersections and I think I could gain some performance benefits by generating specialized code for some of the methods because many of the calculations are performed for the same plane or the same line over and over again. By specializing the code that computes the intersections I think I should be able to gain some performance benefits.
The problem is that I'm not sure where to begin. From reading a few blog posts and browsing MSDN documentation I've come across two possible strategies for generating code at runtime: Expression trees and IL.Emit. Using expression trees seems much easier because there is no need to learn anything about OpCodes and various other MSIL related intricacies but I'm not sure if expression trees are as fast as manually generated MSIL. So are there any suggestions on which method I should go with?
The performance of both is generally same, as expression trees internally are traversed and emitted as IL using the same underlying system functions that you would be using yourself. It is theoretically possible to emit a more efficient IL using low-level functions, but I doubt that there would be any practically important performance gain. That would depend on the task, but I have not come of any practical optimisation of emitted IL, compared to one emitted by expression trees.
I highly suggest getting the tool called ILSpy that reverse-compiles CLR assemblies. With that you can look at the code actually traversing the expression trees and actually emitting IL.
Finally, a caveat. I have used expression trees in a language parser, where function calls are bound to grammar rules that are compiled from a file at runtime. Compiled is a key here. For many problems I came across, when what you want to achieve is known at compile time, then you would not gain much performance by runtime code generation. Some CLR JIT optimizations might be also unavailable to dynamic code. This is only an opinion from my practice, and your domain would be different, but if performance is critical, I would rather look at native code, highly optimized libraries. Some of the work I have done would be snail slow if not using LAPACK/MKL. But that is only a piece of the advice not asked for, so take it with a grain of salt.
If I were in your situation, I would try alternatives from high level to low level, in increasing "needed time & effort" and decreasing reusability order, and I would stop as soon as the performance is good enough for the time being, i.e.:
first, I'd check to see if Math.NET, LAPACK or some similar numeric library already has similar functionality, or I can adapt/extend the code to my needs;
second, I'd try Expression Trees;
third, I'd check Roslyn Project (even though it is in prerelease version);
fourth, I'd think about writing common routines with unsafe C code;
[fifth, I'd think about quitting and starting a new career in a different profession :) ],
and only if none of these work out, would I be so hopeless to try emitting IL at run time.
But perhaps I'm biased against low level approaches; your expertise, experience and point of view might be different.

Resources