What would change if a JVM Language Compilation process had an STG phase like Haskell? - haskell

I had a friend say:
For me the most interesting thing about Haskell is not the language and the types. It is the Spineless Tagless Graph Machine behind it.
Because Haskell people talk about types all the time, this quote really caught my attention. Now we can look at the Haskell compilation process like this:
Parsing
Type checking
Desugaring + a few bobs and bits
Translation to core
Lion share of optimization
Translation to STG language
STG language to C–
C– to assembly or llvm
Which we can simplify down to:
.. front end stuff ..
Translate IL to STG language
Compile STG language to C/ASM/LLVM/Javascript
Ie - there is an intermediate 'graph language' that Haskell is compiled to, and various optimisations happen there, prior to it being compiled to LLVM/C etc.
This contrasts to a potential JVM Language compilation process that looks like this:
Convert JVM Language Code to Java bytecode inside a class.
Run the Bytecode on a Java Virtual Machine.
Assuming it were possible to add an intermediate STG Compilation step to the Java Compilation process, I'm wondering what impact would this change have? What would change about the compiled code?
(I'm aware that you need a pure functional language to get the most use out of the spineless tagless graph machine, so if it is helpful to answer the question, assume we're compiling Frege [Haskell for the JVM].)
My question is: What would change if the JVM Language Compilation process had an STG phase like Haskell?

You need to clarify if you mean Java the language or some language running on the JVM.
My knowledge of Java the language is limited to having read the specification, and I know nothing about the Haskell IR you're talking about. However, Java is, by spec, a dynamic language, and it would be illegal to perform any AOT xform which uses any information outside of each end classfile.
Of course a project that doesn't use these features could break these rules.

Related

GHC internals: is there C implementation of the type system?

I'm looking into internals of GHC and I find all the parsing and type system written completely in Haskell. Low-level core of the language is provided by RTS. The question is which one of the following is true?
RTS contains C implementation of the type system and other basic parts of Haskell (I didn't find it, RTS is mainly GC and threading)
Everything is implemented in Haskell itself. But it seems quite tricky because building GHC already requires GHC.
Could you explain development logic of the compiler? For example Python internals provide an opaque implementation of everything in C.
As others have noted in the comments, GHC is written almost entirely
in Haskell (plus select GHC extensions) and is intended to be compiled with itself. In fact, the only program in the world that can compile the GHC compiler is the GHC compiler! In particular,
parsing and type inference are implemented in Haskell code, and you
won't find a C implementation hidden in there anywhere.
The best source for understanding the internal structure of the
compiler (and what's implemented how) is the GHC Developer Wiki
and specifically the "GHC Commentary" link. If you have a fair bit of spare time, the video
series from the
Portland 2006 GHC Hackathon is absolutely fascinating.
Note that the idea of a compiler being written in the language it
compiles is not unusual. Many compilers are "self-hosting" meaning
that they are written in the language they compile and are intended to
compile themselves. See, for example, this question on another Stack
Exchange sister site: Why are self-hosting compilers considered a
rite of passage for new languages?, or simply Google for
"self-hosting compiler"
As you say, this is "tricky", because you need a way to get the
process started. Some approaches are:
You can write the first compiler in a different language that
already has a compiler (or write it in assembly language); then,
once you have a running compiler, you can port it to the same
language it compiles. According to this Quora answer, the
first C compiler was written this way. It was written in "NewB"
whose compiler was written in "B", a self-hosting compiler that
had originally been written in assembly and then rewritten in
itself.
If the language is popular enough to have another compiler, write
the compiler in its own language and compile it in phases, first
with the other compiler, then with itself (as compiled by the
other compiler), then again with itself (as compiled by itself).
The last two compiler executables can be compared as a sort of
massive test that the compiler is correct. The Gnu C Compiler can
be compiled this way (and this certainly used to be the standard way to install it from source, using the vendor's [inferior!] C compiler to get started).
If an interpreter written in another language already exists or is
easy to write, the compiler can be run by the interpreter to
compile its own source code, and thereafter the compiled compiler
can be used to compile itself. The first LISP compiler is
claimed to be the first compiler to bootstrap itself this way.
The bootstrapping process can often be simplified by writing the compiler (at least initially) in a restricted core of the language, even though the compiler itself is capable of compiling the full language. Then, a sub-par existing compiler or a simplified bootstrapping compiler or interpreter can get the process started.
According to the Wikipedia entry for GHC, the original GHC compiler was written in 1989 in Lazy ML, then rewritten in Haskell later the same year. These days, new versions of GHC with all their shiny new features are compiled on older versions of GHC.
The situation for the Python interpreter is a little different. An
interpreter can be written in the language it interprets, of course,
and there are many examples in the Lisp world of writing Lisp
interpreters in Lisp (for fun, or in developing a new Lisp dialect, or
because you're inventing Lisp), but it can't be interpreters all
the way down, so eventually you'd need either a compiler or an
interpreter implemented in another language. As a result, most
interpreters aren't self-hosting: the mainstream interpreters for
Python, Ruby, and PHP are written in C. (Though, PyPy is an alternate
implementation of the Python interpreter that's written in Python,
so...)

GHC pipeline: Core, STG - ASTs or text?

In the pipeline of GHC there is a stage of translating Haskell source code to Core and then (not necessarily as an immediate next step) translating Core to STG.
However, one issue escapes me from my understanding - when do we have a "normal" code (i.e. as plain text), and when something actually living in memory, like abstract syntax trees (ASTs)?
And to make my question a bit more precise, I'll divide it into parts:
1) in the parsing of Haskell source file phase, do we immediately construct ASTs of Core language? If not, then it seems to me that we have to construct ASTs of full Haskell (which seems strange) and then either transform them to ASTs of Core, or firstly to textual representation of them in Core and again invoking parsing to obtain Core's ASTs.
2) almost the same question applies to Core to STG transition (but in this case I think I can assume that what we have is Core's ASTs - correct?)
The Haskell source is first parsed into an AST of full Haskell, which is then typechecked.
From then on, it gets desugared to Core, translated to STG, from there to Cmm to either assembly or LLVM code. All these phases are built on ASTs, there is no textual representation of any of these stages until assembly/llvm code, which is then written to a file and compiled using external tools.
It’s not strage to have an AST of full Haskell. In fact, it is a requirement to give type errors in terms of the code the user wrote, instead of detecting type errors only at the level of Core.
You can find the AST for Haskell in the modules from HsSym and the AST of Core in CoreSyn.

Are features of programming languages a concept in semantics, syntax or something else?

When talking about features of programming languages, such as in Programming Language Comparison and D Language Feature Comparison Table, I was wondering what aspect of languages the concept "features" belong to or are discussed under?
Semantics,
syntax
or something else?
Thanks and regards!
This is just a gut feeling, I'm not a language theory guy or anything. I'd say adding a feature to a programming language means both
adding semantics for certain circumstances or construction (e.g. "Is-expressions return a boolean according whether the type of a template argument matches some type according to the following fifty rules: ...")
defining a syntax that belongs to it (e.g. adding IsExpr : "is" "(" someKindOfExpression ")" in the grammar)
It depends entirely on what you mean by a "feature," and how it's implemented. Some features, like Java's generics, are nothing but syntactic sugar - so that's a "syntax feature." The bytecode generated is unaffected by using Java's generics due to type erasure. This allows for backwards compatibility with pre-generic (e.g. Java 1.5) bytecode.
Other language features go much deeper than the syntactic level, like C#'s generics, which are implemented using reification to provide "first-class" generic objects.
I don't think that there is a clean separation for the concept of programming language "features", as many features like garbage collection (Java) or pattern matching (Haskell) are being provided by the runtime environment. So, generally I would say that the programming language - the grammar - per se provides no features. It just determines the rules of the language (Syntax). As the behaviour is being determined by how the code (produced by the grammar by obeying its rules) is being interpreted, programming language features are a sematic aspect.

What are the different programming language concepts and which languages show them in a pure way

I am no language expert but I'm recently into languages and trying to get an overview of major concepts and "their" languages. This is similar to another question about books. So first, what are the major programming language concepts, e.g.
structured
procedural
object orientated
object orientated - prototype based (e.g. Java Script)
functional (e.g. Haskell)
logic orientated (e.g. Prolog)
meta (if a pure concept of it's own?)
stack based (e.g. Forth)
math based/array oriented (e.g. APL)
declarative
concatenative (e.g. PostScript)
(definitely incomplete list...)
and second to get a good crasp of these concepts, what would be the programming language that's based on/implementing its core concept most naturally and pure?
For example Java is OO, but it's not a good example because it's not pure OO due to atoms.
Lisp is a known to be a functional language, but it's multi-paradigm, so it's not pure. But Lisp may be a pure implementation of "list-based" (if it counts as concept).
Is there a language that's structured (no GOTO) but not procedural? (Maybe XSLT v1.x)
The term you're looking for here is "programming paradigm" and there are a whole lot of them out there. You can get a list of languages which support each from that Wikipedia page and its follow-up links.
For "pure" renditions of any of these, that's harder because it depends on what level of purity you're looking for.
For pure structured (under any sufficiently-loose definition of "pure" here) you can look, for instance, at Modula-2.
For pure object-orientation you're looking primarily at Smalltalk and its ilk if you want absolutely everything to be uniformly treated (not actually necessary under the most common definitions!) or you're looking at languages like Java and Eiffel if you'll accept primitive types under that heading.
For functional you're looking most likely at Haskell.
For logic programming the archetypical language is Prolog, but it's not really pure. The only (mostly-)pure logic language I know of is Mercury, and that only if you view its functional chunks as being essentially compatible with its logical chunks.
...and so on and so on. You get the idea.
I think Pascal is the canonical procedural language.
I also think Lisp (ironically not ML) is the canonical "meta" language.
For one, a macro is a program fragment which modifies a data structure that represents a program fragment---so you use the language to tweak the language. Secondly, it's considered common practice to write self-hosting interpretors, traditionally called metacircular evaluators: they are programs which programs and run them.
Of course, any other language can do that. In Python you have access to the python compiler, and PyPy is a python implementation in python. But Lisp has, I think, the strongest tradition of doing this.
But I'm a Lisp outsider, so what do I know... 'hope-this-helps ;-)
Thanks to JUST MY correct OPINION's answer I was pointed in the right direction. I will give the list of paradigms together with their pure languages as far as I found out till now:
imperative
non-structured --- early BASIC, Assembly
structured --- ?
procedural --- ?
modular --- Modula-2, maybe Pascal
object-oriented
class-based --- Smalltalk
prototype-based --- Self, maybe Java Script, Lua
declarative --- SQL, Regular Expressions, CSS
logic --- Mercury, maybe Prolog
functional --- Scheme, Haskell
tacit/point-free
concatenative --- Joy, Cat
On a different "axis" we have
scalar --- most of them
array --- APL
Don't know where to put it:
stack based --- Forth, Postscript

Why are most S-Expression languages dynamically typed?

How come most Lisps and Schemes are dynamically typed?
Does static typing not mix with some of their common features?
Typing and s-expressions can be made to work together, see typed scheme.
Partly it is a historical coincidence that s-expression languages are dynamically typed. These languages tend to rely more heavily on macros, and the ease of parsing and pattern-matching on s-expressions makes macro processing much easier. Most research on sophisticated macros happens in s-expression languages.
Typed Hygienic Macros are hard.
When Lisp was invented in the years from 1958 to 1960 it introduced a lot of features both as a language and an implementation (garbage collection, a self-hosting compiler, ...). Some features were inherited (with some improvements) from other languages (list processing, ...). The language implemented computation with functions. The s-expressions were more an implementation detail ( at that time ), than a language feature. A type system was not part of the language. Using the language in an interactive way was also an early implementation feature.
The useful type systems for functional languages were not yet invented at that time. Still until today it is also relatively difficult to use statically typed languages in an interactive way. There are many implementations of statically typed languages which also provide some interactive interface - but mostly they don't offer the same level of support of interactive use as a typical Lisp system. Programming in an interactive Lisp system means that many things can be changed on the fly and it could be problematic if type changes had to be propagated through whole programs and data in such an interactive Lisp system. note that Some Schemers have a different view about these things. R6RS is mostly a batch language generally not that much in the spirit of Lisp...
The functional languages that got invented later with static type systems then also got a non-s-expression syntax - they did not offer support for macros or related features. later some of these languages/implementations used a preprocessor for syntactic extensions.
Static typing is lexical, it means that all information about types can be inferred from reading source code without evaluating any expressions or computing any things, conditionals being most important here. A statically typed language is designed so that this can happen, a better term would be 'lexically typed', as in, a compiler can prove from reading the source alone that no type errors will occur.
In the case of lisp, this is awkwardly different because lisp's source code itself is not static, lisp is homo-iconic, it uses data as code and can to some extend dynamically edit its own running source.
Lisp was the first dynamically typed language, and probably for this reason, program code itself is no longer lexical in Lisp.
Edit: a far more powerful reason, in the case of static typing you'd have to type lists. You can either have extremely complex types for each lists which account for all elements, of demand that each element has the same type and type it as a list of that. The former option will produce hell with lists of lists. The latter option demands that source code only contains the same type for each datum, this means that you can't even build expressions as a list is anyhow a different type than an integer.
So I dare say that it is completely and utterly infeasible to realize.

Resources