Dead code and/or how to generate a cross reference from Haskell source

Dead code and/or how to generate a cross reference from Haskell source - haskell

I've got some unused functionality in my codebase, but it's hard to identify. The code has evolved over the last year as I explore its problem space and possible solutions. What I'm needing to do is find that unused code so I can get rid of it. I'm happy if it deals with the problem on an exportable name basis.GHC has warnings that deal with non-exported unused code. Any tools specific to this task would be of interest.
However, I'm curious about a comprehensive cross referencing tool. I can find the unused code with such a tool. Years ago when I was working in C and assembler, I found that a good xref was a pretty handy tool, useful for many different purposes.
I'm getting nowhere with googling. Apparently in Haskell the dominant meaning of cross-reference is within literate programming. Though maybe something there would be useful.

I don’t know of such a tool, so in the past I have done a bit of a hack instead.
If you have a comprehensive test suite, you can run it with GHC’s code coverage tracing enabled. Compile with -fhpc and use hpc markup to generate annotated source. This gives you the union of unused code and untested code, both of which you would probably like to address anyway.
SourceGraph can give you a bunch of information which you may also find useful.

There is now a tool for this very purpose: https://hackage.haskell.org/package/weeder
It's been around since 2017, and while it has limitations, it definitely helps with large codebases.

Related

Is it common to have example values for compile-time checking and where should they go in the code?

I have a reasonably complex structure of data types and records, which is not so easy to make sense of by just looking at production code for someone not familiar with the codebase.
To make sense of it, I created this kind of dummy functions that have two advantages: 1. they're checked at compile-time and 2. they serve as some kind of documentation, showcasing an example of how the overall structure of data types and records mixes together:
-- benefit 1: a newcomer can quickly make sense of the type system
-- benefit 2: easier to keep track of how the types evolve because of compile-time checking
exampleValue1 :: ApiResponseContent
exampleValue1 = ApiOnlineResponseContent [
OnlineResultRow (EntityId 10) [Just (FvInt 1), Just (FvFloat 1.5), Nothing],
OnlineResultRow (EntityId 20) [Just (FvInt 2), Nothing, Just (FvBool True)]
]
The only thing that bothers me a bit is that they feel a bit awkward put within the production code, as they're clearly dead code. However they're not tests either, they're just compiled-time-checked examples of how values can be assembled together from complex nested types. Therefore, they clearly don't belong to production code, but they don't quite belong to tests either.
Is it common practice to have this kind of compile-time examples? And where should they be placed within the codebase?

Have you considered including the examples as part of Haddock documentation?
Otherwise, there's a school of thought that try to reframe tests as examples. You can find a brief mention of this in Gerard Meszaros's work on unit testing. Dan North has also repeatedly used similar language, but it can often be difficult to track down his many iterations of such ideas. He tends to 'think in public' - here's one such reflection:
"I call this development, using example-guided design"
I understand that the kind of example being asked about isn't quite like that. Still, I think that it better belongs with test code than with production code.
If you put the examples in the production code, you essentially make it part of the API of your library (if you're shipping a library). This means that changing the examples would constitute a breaking change. That doesn't seem right to me.
In the BDD/DDD community, there's a lot of emphasis on tests as examples, also in the sense that automated tests serve as documentation. If Haddock documentation isn't an option, I'd consider putting the examples in the test code, as documentation. I sometimes do that by simply putting such 'vacuous tests' in a file called examples, perhaps with a little comment at the top to explain that the code in that file assists learning rather than verify behaviour.
It dodges the risk of introducing redundant breaking changes in the production code, and seems conceptually like a better fit.

I disagree that these aren't tests. Even “does this example compile” could be seen as a test, but probably you could also use them to actually test some functions, while you're at it.
So, put these definitions in your test suite, and use them for unit testing the functions that will actually be dealing with such values.

The main convention I’ve seen for housing such examples is to include ….Tutorial modules, such as Dhall.Tutorial and Clash.Tutorial, or parallel …-tutorial packages such as lens-tutorial.
These contain Haddock documentation, organised so as to read in linear order, alongside example definitions like yours.
With good examples, I find this convention very helpful for understanding and experimenting with a new package in addition to types and reference docs. It’s also fairly discoverable when browsing packages on Hackage, especially if you ensure that the reference and README link to the tutorial for longer explanations.
These modules can be just compiled as basic validation, but having more machine validation for docs is incredibly valuable, so I think it’s even better to link them into the test suite (e.g. hspec) as examples for unit tests (HUnit) or property tests (QuickCheck/hedgehog), or organise them as documentation tests (doctest).
In addition to GHC’s code coverage tools, weeder can help identify examples that aren’t being tested.

How to create a language these days?

I need to get around to writing that programming language I've been meaning to write. How do you kids do it these days? I've been out of the loop for over a decade; are you doing it any differently now than we did back in the pre-internet, pre-windows days? You know, back when "real" coders coded in C, used the command line, and quibbled over which shell was superior?
Just to clarify, I mean, not how do you DESIGN a language (that I can figure out fairly easily) but how do you build the compiler and standard libraries and so forth? What tools do you kids use these days?

One consideration that's new since the punched card era is the existence of virtual machines already bountifully provided with "standard libraries." Targeting the JVM or the .NET CLR instead of ye olde "language walled garden" saves you a lot of bootstrapping. If you're creating a compiled language, you may also find Java byte code or MSIL an easier compile target than machine code (of course, if you're in this for the fun of creating a tight optimising compiler then you'll see this as a bug rather than a feature).
On the negative side, the idioms of the JVM or CLR may not be what you want for your language. So you may still end up building "standard libraries" just to provide idiomatic interfaces over the platform facility. (An example is that every languages and its dog seems to provide its own method for writing to the console, rather than leaving users to manually call System.out.println or Console.WriteLine.) Nevertheless, it enables an incremental development of the idiomatic libraries, and means that the more obscure libraries for which you never get round to building idiomatic interfaces are still accessible even if in an ugly way.
If you're considering an interpreted language, .NET also has support for efficient interpretation via the Dynamic Language Runtime (DLR). (I don't know if there's an equivalent for the JVM.) This should help free you up to focus on the language design without having to worry so much about the optimisation of the interpreter.

I've written two compilers now in Haskell for small domain-specific languages, and have found it to be an incredibly productive experience. The parsec library makes playing with syntax easy, and interpreters are very simple to write over a Haskell data structure. There is a description of writing a Lisp interpreter in Haskell that I found helpful.
If you are interested in a high-performance backend, I recommend LLVM. It has a concise and elegant byte-code and the best x86/amd64 generating backend you can find. There is an optional garbage collector, and some experimental backends that target the JVM and CLR.
You can write a compiler in any language that produces LLVM bytecode. If you are adventurous enough to learn Haskell but want LLVM, there are a set of Haskell-LLVM bindings.

What has changed considerably but hasn't been mentioned yet is IDE support and interoperability:
Nowadays we pretty much expect Intellisense, step-by-step execution and state inspection "right in the editor window", new types that tell the debugger how to treat them and rather helpful diagnostic messages. The old "compile .x -> .y" executable is not enough to create a language anymore. The environment is nothing to focus on first, but affects willingness to adopt.
Also, libraries have become much more powerful, noone wants to implement all that in yet another language. Try to borrow, make it easy to call existing code, and make it easy to be called by other code.
Targeting a VM - as itowlson suggested - is probably a good way to get started. If that turns out a problem, it can still be replaced by native compilers.

I'm pretty sure you do what's always been done.
Write some code, and show your results to the world.
As compared to the olden times, there are some tools to make your job easier though. Might I suggest ANTLR for parsing your language grammar?

Speaking as someone who just built a very simple assembly like language and interpreter, I'd start out with the .NET framework or similar. Nothing can beat the powerful syntax of C# + the backing of the entire .NET community when attempting to write most things. From here i designed a simple bytecode format and assembly syntax and proceeeded to write my interpreter + assembler.
Like i said, it was a very simple language.

You should not accept wimpy solutions like using the latest tools. You should bootstrap the language by writing a minimal compiler in Visual Basic for Applications or a similar language, then write all the compilation tools in your new language and then self-compile it using only the language itself.
Also, what is the proposed name of the language?
I think recently there have not been languages with ALL CAPITAL LETTER names like COBOL and FORTRAN, so I hope you will call it something like MIKELANG with all capital letters.

Not so much an implementation but a design decision which effects implementation - if you make every statement of your language have a unique parse tree without context, you'll get something that it's easy to hand-code a parser, and that doesn't require large amounts of work to provide syntax highlighting for. Similarly simple things like using a different symbol for module namespaces and object namespaces ( unlike Java which uses . for both package and class namespaces ) means you can parse the code without loading every module that it refers to.
Standard libraries - include the equivalent of everything in C99 standard libraries other than setjmp. Add whatever else you need for your domain. Work out an easy way to do this, either something like SWIG or an in-line FFI such as Ruby's [can't remember module name] and Python's ctypes.
Building as much of the language in the language is an option, but projects which start out doing either give up (rubinius moved to using C++ for parts of its standard library), or is only for research purposes (Mozilla Narcissus)

I am actually a kid, haha. I've never written an actual compiler before or designed a language, but I have finished The Red Dragon Book, so I suppose I have somewhat of an idea (I hope).
It would depend firstly on the grammar. If it's LR or LALR I suppose tools like Bison/Flex would work well. If it's more LL, I'd use Spirit, which is a component of Boost. It allows you to write the language's grammar in C++ in an EBNF-like syntax, so no muddling around with code generators; the C++ compiler compiles the grammar for you. If any of these fail, I'd write an EBNF grammar on paper, and then proceed to do some heavy recursive descent parsing, which seems to work; if C++ can be parsed pretty well using RDP (as GCC does it), then I suppose with enough unit tests and patience you could write entire compilers using RDP.
Once I have a parser running and some sort of intermediate representation, it then depends on how it runs. If it's some bytecode or native code compiler, I'll use LLVM or libJIT to process it. LLVM is more suited for general compilation, but I like the libJIT API and documentation better. Alternatively, if I'm really lazy, I'll generate C code and let GCC do the actual compilation. Another alternative, is to target an existing VM, like Parrot or the JVM or the CLR. Parrot is the VM being designed for Perl. If it's just an interpreter, I'll walk the syntax tree.
A radical alternative is to use Prolog, which has syntax features which remarkably simulate EBNF. I have no experience with it though, and if I am not wrong (which I am almost certainly going to be), Prolog would be quite slow if used to parse heavy duty programming languages with a lot of syntactical constructs and quirks (read: C++ and Perl).
All this I'll do in C++, if only because I am more used to writing in it than C. I'd stay away from Java/Python or anything of that sort for the actual production code (writing compilers in C/C++ help to make it portable), but I could see myself using them as a prototyping language, especially Python, which I am partial towards. Of course, I've never actually done any of this before, so I'm not one to say.

On lambda-the-ultimate there's a link to Create Your Own Programming Language by Marc-André Cournoyer, which appears to describe how to leverage some modern tools for creating little languages.

Just to clarify, I mean, not how do you DESIGN a language (that I can figure out fairly easily)
Just a hint: Look at some quite different languages first, before designing a new languge (i.e. languages with a very different evaluation strategy). Haskell and Oz come to mind. Though you should also know Prolog and Scheme. A year ago I also was like "hey, let's design a language that behaves exactly as I want", but fortunatly I looked at those other languages first (or you could also say unfortunatly, because now I don't know how I want a language to behave anymore...).

Before you start creating a language you should read this:
Hanspeter Moessenboeck, The Art of Niklaus Wirth
ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe00b.pdf

There's a big shortcut to implementing a language that I don't see in the other answers here. If you use one of Lukasiewicz's "unparenthesized" forms (ie. Forward Polish or Reverse Polish) you don't need a parser at all! With reverse polish, the dependencies go right-to-left so you simply execute each token as it's scanned. With forward polish, it's the reverse of that, so you actually execute the program "backwards", simplifying subexpressions until reaching the starting token.
To understand why this works, you should investigate the 3 primary tree-traversal algorithms: pre-order, in-order, post-order. These three traversals are the inverse of the parsing task that a language reader (i. parser) has to perform. Only the in-order notation "requires" a recursive decent to re-construct the expression tree. With the other two, you can get away with just a stack.
This may require more "thinking' and less "implementing".
BTW, if you've already found an answer (this question is a year old), you can post that and accept it.

Real coders still code in C. Just that it's a litte sharper.
Hmmm... language design? or writing a compiler?
If you want to write a compiler, you'd use Flex + Bison. (google)

Not an easy answer, but..
You essentially want to define a set of rules written in text (tokens) and then some parser that checks these rules and assembles them into fragments.
http://www.mactech.com/articles/mactech/Vol.16/16.07/UsingFlexandBison/
People can spend years on this, The above article talks about using two tools (Flex and Bison) That can be used to turn text into code you can feed to a compiler.

First I spent a year or so to actually think how the language should look like. At the same time I helped in developing Ioke (www.ioke.org) to learn language internals.
I have chosen Objective-C as implementation platform as it's fast (enough), simple and rich language. It also provides test framework so agile approach is a go. It also has a rich standard library I can build upon.
Since my language is simple on syntactic level (no keywords, only literals, operators and messages) I could go with Ragel (http://www.complang.org/ragel/) for building scanner. It's fast as hell and simple to use.
Now I have a working object model, scanner and simple operator shuffling plus standard library bootstrap code. I can even run a simple programs - as long as they fit in one file that is :)

Of course older techniques are still common (e.g. using Flex and Bison) many newer language implementations combine the lexing and parsing phase, by using a parser based on a parsing expression grammar (PEG). This works for recursive descent parsers created using combinators, or memoizing Packrat parsers. Many compilers are built using the Antlr framework also.

Use bison/flex which is the gnu version of yacc/lex. This book is extremely helpful.
The reason to use bison is it catches any conflicts in the language. I used it and it made my life many years easier (ok so i'm on my 2nd year but the first 6months was a few years ago writing it in C++ and the parsing/conflicts/results were terrible! :(.)

If you want to write a compiler obviously you need to read the Dragon Book ;)
Here is another good book that I have just read. It is practical and easier to understand than the Dragon Book:
http://www.amazon.co.uk/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=language+implementation+patterns&x=0&y=0

Mike --
If you're interested in an efficient native-code-generating compiler for Windows so you can get your bearings -- without wading through all the unnecessary widgets, gadgets, and other nonsense that clutter today's machines -- I recommend the Osmosian Order's Plain English development system. It includes a unique interface, a simplified file manager, a friendly text editor, a handy hexadecimal dumper, the compiler/linker (of course), and a wysiwyg page-layout application for documentation. Written entirely in Plain English, it is a quick download (less than a megabyte), small enough to understand in short order (about 25,000 lines of Plain English code, with just 4,000 in the compiler/linker), yet powerful enough to reproduce itself on a bottom-of-the-line Dell in less than three seconds. Really: three seconds. And it's free to all who write and ask for a copy, including the source code and and a rather humorous tongue-in-cheek 100-page manual. See www.osmosian.com for details on how to get a copy, or write to me directly with questions or comments: Gerry.Rzeppa#pobox.com

Resources for learning a new language quickly?

The title may seem slightly self-contradictory, and I accept that you can't really learn a language quickly. However, an experienced programmer that already has knowledge of a few languagues and different styles (functional, OO, imperative etc.) often wants to get started quickly. I've seen a few websites doing effective "translations" in the form of "just show me syntax equivalence". I can't remember the sites now, but for related languages (e.g. Perl/PHP) it's quite common.
Is there a better resource that covers more languages? Is there a resource that covers idioms as well as syntax? I think this would be incredibly useful for doing small amounts of work on existing code bases where you are not familiar with the language. Looking at the existing code, as we know, is not always a good indicator of quality. Likewise, for "learn by doing" weekend project I always have the urge to write reasonably idiomatic, clean code from the start. Such a resource could also link to known good example projects of varying sizes for those that prefer to learn by reading. Reading a well-written medium sized code base can also be much more practical when access to development environments might be limited.
I think it's possible to find tutorials and summaries for individual languages that provide some of this functionality in disparate web locations but I'm hoping there is a good, centralised, comparative place that the busy programmer can turn to.

You generally have two main things to overcome:
Syntax
Reference
Syntax you can pick up fairly quickly with a language tutorial and a stack of samplecode.
Reference (library/API calls) you need to find a proper guide to; perhaps the language reference, or perhaps google...
With those two in place, following a walkthrough (to get you used to using the development environment) will have you pretty much ready - you'll be able to look up what you want to say (reference), and know how to say it (syntax).
This, of course, applies principally to procedural/oop languages; languages that require a paradigm switch (ML/Haskell) you should go to lectures for ;)
(and for the weirder moments, there's SO!)

In the past my favour was "learning by doing". So e.g. I know a little bit of C++ and a lot of C#.Net but I must write a FTP Tool in Python.
So I sit for an hour and so the syntax differences by a tutorial, than I develop the form itself and look at the generated code. Then I search a open source Python FTP Client and get pieces of code (Not copy and paste, write it self to see, feel and remember the code!)
After a few hours I get it.
So: The mix is the best. A book, a piece of good code, the willing to learn and a free night with much coffee.

At the risk of sounding cheesy, I would start with the language's website tutorial and/or FAQ, followed by asking more specific questions here. SO is my centralized location for programming knowledge.
I remember when I learned Perl. I was asked to modify some Perl code at work and I'd never seen the language before. I had experience with several other languages, however, so it wasn't hard to figure out the syntax with the online Perl docs in one window and the code in another, side-by-side. I don't know that solely reading existing code is necessarily the best way to learn. In my case, I didn't know Perl but I could tell that the person who originally wrote the code didn't know Perl either. I'm not sure I could've distinguished between good Perl and really confusing Perl. It would've been nice to be able to ask questions here at the time.

Language isn't important. What is important is learning your ways around designing algorithms and the proper application of design patterns. Focus on the technique, not the language that implements a certain technique. Once you understand the proper development techniques, any programming language will just become real easy, no matter how obscure they are...
When you put a focus on a language, you're restricting your own knowledge.

http://devcheatsheet.com/ seems to be a step in the right direction: it aggregates cheat sheets/quick references and they are (somewhat) manually reviewed. It's also wide-ranging. It still comes up short a bit in terms of "idiomatic" quick reference: for example, the page on Ruby doesn't mention yield.

Rosetta Code appears to be an excellent resource that includes hints on coding idiomatically and moves from simple (like for-loops) to things like drawing. I haven't checked out how comprehensive it is, but there are a large number of languages and tasks listed. The drawbacks re: original question are:
Some of the linking is not accurate
(navigating Python->ForLoop will
take you to the top of the ForLoop
page, not the Python section). It's a
wiki, this can be improved.
Ideally you could "slice" the wiki
however you chose to see e.g. the top
20 tasks for two languages
side-by-side.

http://hyperpolyglot.org/ seems to be an almost perfect match for what I was looking for. The quality is not always there, or idiom can be lacking, but it has the same intention and is pretty comprehensive.

Should Programmers Use Decompilers?

Hear lately I've been listening to Jeff Atwood and Joel Spolsky's radio show and they have been talking about dogfooding (the process of reusing your own code, see Jeff Atwood's blog post). So my question is should programmers use decompilers to see how that programmers code is implemented and works, to make sure it won't break your code. Or should you just trust that programmers code and adapt to it because using decompilers go against everything we as programmers have ever learn about hiding data (well OO programmers at least)?
Note: I wasn't sure which tags this would go under so feel free to retag it.
Edit: Just to clarify I was asking about decompilers as a last resort, say you can't get the source code for some reason. Sorry, I should have supplied this in the original question.

Yes, It can be useful to use the output of a decompiler, but not for what you suggest. The output of a compiler doesn't ever look much like what a human would write (except when it does.) It can't tell you why the code does what it does, or what a particular variable should mean. It's unlikely to be worth the trouble to do this unless you already have the source.
If you do have the source, then there are lots of good reasons to use a decompiler in your development process.
Most often, the reasons for using the output of a decompiler is to better optimize code. Sometimes, with high optimization settings, a compiler will just get it wrong. This can be almost impossible to sort out in some cases without comparing the output of the compiler at different levels of optimization.
Other times, when trying to squeeze the most performance out of a very hot code path, a developer can try arranging their code in a few different ways and compare the compiled results. As a last resort, this may be the simplest way to start when implementing a code block in assembly language, by duplicating the compiler's output.

Dogfooding is the process of using the code that you write, not necessarily re-using code.
However, code re-use typically means you have the source, hence 'code-reuse' otherwise its just using a library supplied by someone else.
Decompiling is hard to get right, and the output is typically very hard to follow.

You should use a decompiler if it is the tool that's required to get the job done. However, I don't think it's the proper use of a decompiler to get an idea of how well the code which is being decompiled was written. Depending on the language you use, the decompiled code can be very different from the code which was actually written. If you want to see some real code, look at open source code. If you want to see the code of some particular product, it's probably better to try to get access to the actual code through some legal means.

I'm not sure what exactly it is you are asking, what you expect "decompilers" to show you, or what this has to do with Atwood and Spolsky, or what the question is exactly. If you're programming to public interfaces then why would you need to see the original source of the the third party code to see if it will "break" your code? You could more effectively build tests to in order to determine this. As well, what the "decompiler" will tell you largely depends on the language/platform the software was written in, whether it is Java, .NET, C and so forth. It's not the same as having the original source to read, even in the case of .NET assemblies. Anyway, if you are worried about third party code not working for you then you should really be doing typical kinds of unit tests against the code rather than trying to "decompile" it. As far as whether you "should," if you mean whether you "should" in some other way other than what would be the best use of your time then I'm not sure what you mean.

Should Programmers Use Decompilers?
Use the right tool for the right job. Decompilers don't often produce results that are easy to understand, but sometimes they are what's needed.
should programmers use decompilers to
see how that programmers code is
implemented and works, to make sure it
won't break your code.
No, not unless you find a problem and need support. In general you don't use it if you don't trust it, and if you have to use it you even when you don't trust it you develop tests to prove the functionality and verify that later upgrades still work as expected.
Don't use functionality you don't test, unless you have very good support or a relationship of trust.
-Adam

Or should you just trust that programmers code and adapt to it because using decompilers go against everything we as programmers have ever learn about hiding data (well OO programmers at least)?
This is not true at all. You would use a decompiler not because you want to get around any sort of abstraction, encapsulation, or defeat OO principles, but because you want to understand why the code is behaving the way it is better.
Sometimes you need to use a decompiler (or in the Java world, a bytecode viewer) when you are troubleshooting an annoying bug with a 3rd party library where an exception is thrown with no useful error message, no logging, etc.
Use of a decompiler has nothing to do with OO principles.

The short answer to this... Program to a public and documented specification, not to an implementation. Relying on implementation specifics and side-effects will burn you.
Decompilation is not a tool to help you program correctly, though it might, in a pinch, assist you in understanding a problem with someone else's code for which you don't have source.
Also, beware of the possible legal risk of decompiling; many software companies have no-decompile clauses which could expose you and your employer to legal consequences.

Good APIs for scope analyzers

I'm working on some code generation tools, and a lot of complexity comes from doing scope analysis.
I frequently find myself wanting to know things like
What are the free variables of a function or block?
Where is this symbol declared?
What does this declaration mask?
Does this usage of a symbol potentially occur before initialization?
Does this variable potentially escape?
and I think it's time to rethink my scoping kludge.
I can do all this analysis but am trying to figure out a way to structure APIs so that it's easy to use, and ideally, possible to do enough of this work lazily.
What tools like this are people familiar with, and what did they do right and wrong in their APIs?

I'm a bit surprised at at the question, as I've done tons of code generation and the question of scoping rarely comes up (except occasionally the desire to generate unique names).
To answer your example questions requires serious program analysis well beyond scoping. Escape analysis by itself is nontrivial. Use-before-initialization can be trivial or nontrivial depending on the target language.
In my experience, APIs for program analysis are difficult to design and frequently language-specific. If you're targeting a low-level language you might learn something useful from the Machine SUIF APIs.
In your place I would be tempted to steal someone else's framework for program analysis. George Necula and his students built CIL, which seems to be the current standard for analyzing C code. Laurie Hendren's group have built some nice tools for analyzing Java.
If I had to roll my own I'd worry less about APIs and more about a really good representation for abstract-syntax trees.
In the very limited domain of dataflow analysis (which includes the uninitialized-variable question), João Dias and I have adapted some nice work by Sorin Lerner, David Grove, and Craig Chambers. Only our preliminary results are published.
Finally if you want to generate code in multiple languages this is a complete can of worms. I have done it badly several times. If you create something you like, publish it!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string