Let C be a compiler, an abstract machine consisting of the following procedures:
Preprocessing
Syntactic analysis
Semantic analysis
Intermediate representation
Optimization
Code generation
that compiles a programming language P, that is Turing-complete by definition. The output generated by C is some low level Turing-complete language, for practical purposes the output is x86_64 or something like it.
Consider the decision problem D to be a problem that receives as an input a compiler C that compiles P, and a string S. D outputs 1 if C is able to finish and 0 otherwise.
My question then being, is D decidable?
Important notes:
The compiler is not more powerful than a Turing machine, but it is as powerful as one
The compiler works under the specification of P, not a subset of it like compilers in practice works
From more practical examples, imagine P as being C/C++, C#, Java, Python, etc.
Related
AFAIU the object file generated by Haskell compiler should be machine code. So does that object file have a representation of the original AST and reduces it at run time or does this reduction happen at compile time and only final WHNF values are converted to the corresponding machine code.
I understand the compilation time of latter would be a function of the time complexity of program itself which I think is less likely.
Can someone give a clear explanation of what happens at run time and what happens at compile time in the case of Haskell (GHC)?
A compiler could perform its job performing all the reduction at runtime. That is, the resulting executable could have a (large) data section, where the whole program AST is encoded, and a (small) text/code section with a generic WHNF reducer which operates on the AST.
Note that the above approach would work in any language. E.g. a Python compiler could also generate an executable file comprising AST data and generic reducer.
The reducer would follow the so-called small-step semantics of the language, which is a very well known notion in computer science (more specifically, in programming languages theory).
However, the performance of such approach would be quite poor.
Researchers in programming languages worked on finding better approaches, resulting in the definition of abstract machines. Essentially, an abstract machine is an algorithm for running an high level program in a lower level setting. Usually, it exploits a few data structures (e.g. stacks) to make the process more efficient. In the case of functional languages such as Haskell, well known abstract machines include:
Categorical Abstract Machine (Caml originally used this, I think)
Krivine Machine
SECD
Spineless Tagless G-machine (GHC uses this one)
The problem itself is far from being trivial. There has been, and I'd say there still is, research on making WHNF reduction more efficient.
Each Haskell definition, after GHC compilation, becomes a sequence of assembly instructions, which manipulate the state of the STG machine. There is no AST around, only code which manipulates data / closures / etc.
One could say that it is very important to use such advanced techniques to improve performance, coupled with heavy optimizations. A negative consequence, though, is that it becomes hard to understand the performance of the resulting code from the original one, since one needs to take into account how the abstract machine works (which is non trivial) and the optimizations (which are quite complex nowadays). To some lesser extent, this is also the case for heavily optimizing C or C++ compilers, where it becomes harder to know when the optimization was triggered or not.
Ultimately, an experienced programmer (in Haskell, C, C++, or anything else) will come to understand the basic optimizations of their compiler, and the basic mechanisms abstract machine which is being used. However, this is not something which is easy to master, I think.
In the question, it is mentioned that WHNF reduction could be performed at compile time. This is only partially true, since the value of the variables originating from IO actions can not be known until runtime, so reduction can only happen at runtime when it involves those values. Further, performing reduction can also make performance worse! E.g.
let x = complex computation in x + x
-- vs
complex computation + complex computation
The latter is the result of reducing the former, but it duplicates work! Indeed, most abstract machines use a lazy reduction approach that makes x to be computed only once in such cases.
I'm looking into internals of GHC and I find all the parsing and type system written completely in Haskell. Low-level core of the language is provided by RTS. The question is which one of the following is true?
RTS contains C implementation of the type system and other basic parts of Haskell (I didn't find it, RTS is mainly GC and threading)
Everything is implemented in Haskell itself. But it seems quite tricky because building GHC already requires GHC.
Could you explain development logic of the compiler? For example Python internals provide an opaque implementation of everything in C.
As others have noted in the comments, GHC is written almost entirely
in Haskell (plus select GHC extensions) and is intended to be compiled with itself. In fact, the only program in the world that can compile the GHC compiler is the GHC compiler! In particular,
parsing and type inference are implemented in Haskell code, and you
won't find a C implementation hidden in there anywhere.
The best source for understanding the internal structure of the
compiler (and what's implemented how) is the GHC Developer Wiki
and specifically the "GHC Commentary" link. If you have a fair bit of spare time, the video
series from the
Portland 2006 GHC Hackathon is absolutely fascinating.
Note that the idea of a compiler being written in the language it
compiles is not unusual. Many compilers are "self-hosting" meaning
that they are written in the language they compile and are intended to
compile themselves. See, for example, this question on another Stack
Exchange sister site: Why are self-hosting compilers considered a
rite of passage for new languages?, or simply Google for
"self-hosting compiler"
As you say, this is "tricky", because you need a way to get the
process started. Some approaches are:
You can write the first compiler in a different language that
already has a compiler (or write it in assembly language); then,
once you have a running compiler, you can port it to the same
language it compiles. According to this Quora answer, the
first C compiler was written this way. It was written in "NewB"
whose compiler was written in "B", a self-hosting compiler that
had originally been written in assembly and then rewritten in
itself.
If the language is popular enough to have another compiler, write
the compiler in its own language and compile it in phases, first
with the other compiler, then with itself (as compiled by the
other compiler), then again with itself (as compiled by itself).
The last two compiler executables can be compared as a sort of
massive test that the compiler is correct. The Gnu C Compiler can
be compiled this way (and this certainly used to be the standard way to install it from source, using the vendor's [inferior!] C compiler to get started).
If an interpreter written in another language already exists or is
easy to write, the compiler can be run by the interpreter to
compile its own source code, and thereafter the compiled compiler
can be used to compile itself. The first LISP compiler is
claimed to be the first compiler to bootstrap itself this way.
The bootstrapping process can often be simplified by writing the compiler (at least initially) in a restricted core of the language, even though the compiler itself is capable of compiling the full language. Then, a sub-par existing compiler or a simplified bootstrapping compiler or interpreter can get the process started.
According to the Wikipedia entry for GHC, the original GHC compiler was written in 1989 in Lazy ML, then rewritten in Haskell later the same year. These days, new versions of GHC with all their shiny new features are compiled on older versions of GHC.
The situation for the Python interpreter is a little different. An
interpreter can be written in the language it interprets, of course,
and there are many examples in the Lisp world of writing Lisp
interpreters in Lisp (for fun, or in developing a new Lisp dialect, or
because you're inventing Lisp), but it can't be interpreters all
the way down, so eventually you'd need either a compiler or an
interpreter implemented in another language. As a result, most
interpreters aren't self-hosting: the mainstream interpreters for
Python, Ruby, and PHP are written in C. (Though, PyPy is an alternate
implementation of the Python interpreter that's written in Python,
so...)
I had a friend say:
For me the most interesting thing about Haskell is not the language and the types. It is the Spineless Tagless Graph Machine behind it.
Because Haskell people talk about types all the time, this quote really caught my attention. Now we can look at the Haskell compilation process like this:
Parsing
Type checking
Desugaring + a few bobs and bits
Translation to core
Lion share of optimization
Translation to STG language
STG language to C–
C– to assembly or llvm
Which we can simplify down to:
.. front end stuff ..
Translate IL to STG language
Compile STG language to C/ASM/LLVM/Javascript
Ie - there is an intermediate 'graph language' that Haskell is compiled to, and various optimisations happen there, prior to it being compiled to LLVM/C etc.
This contrasts to a potential JVM Language compilation process that looks like this:
Convert JVM Language Code to Java bytecode inside a class.
Run the Bytecode on a Java Virtual Machine.
Assuming it were possible to add an intermediate STG Compilation step to the Java Compilation process, I'm wondering what impact would this change have? What would change about the compiled code?
(I'm aware that you need a pure functional language to get the most use out of the spineless tagless graph machine, so if it is helpful to answer the question, assume we're compiling Frege [Haskell for the JVM].)
My question is: What would change if the JVM Language Compilation process had an STG phase like Haskell?
You need to clarify if you mean Java the language or some language running on the JVM.
My knowledge of Java the language is limited to having read the specification, and I know nothing about the Haskell IR you're talking about. However, Java is, by spec, a dynamic language, and it would be illegal to perform any AOT xform which uses any information outside of each end classfile.
Of course a project that doesn't use these features could break these rules.
why most of the scripting languages are loosely typed ? for example
javascript , python , etc ?
First of all, there are some issues with your terminology. There is no such thing as a loosely typed language and the term scripting language is vague too, most commonly referring to so called dynamic programming languges.
There is weak typing vs. strong typing about how rigorously is distinguished between different types (i.e. if 1 + "2" yields 3 or an error).
And there is dynamic vs. static typing, which is about when type information is determined - while or before running.
So now, what is a dynamic language? A language that is interpreted instead of compiled? Surely not, since the way a language is run is never some inherent characteristic of the language, but a pure implementation detail. In fact, there can be interpreters and compilers for one-and-the-same language. There is GHC and GHCi for Haskell, even C has the Ch interpreter.
But then, what are dynamic languges? I'd like to define them through how one works with them.
In a dynamic language, you like to rapidly prototype your program and just get it work somehow. What you don't want to do is formally specifying the behaviour of your programs, you just want it to behave like intended.
Thus if you write
foo = greatFunction(42)
foo.run()
in a scripting language, you'll simply assume that there is some greatFunction taking a number that will returns some object you can run. You don't prove this for the compiler in any way - no predetmined types, no IRunnable ... . This automatically gets you in the domain of dynamic typing.
But there is type inference too. Type inference means that in a statically-typed language, the compiler does automatically figure out the types for you. The resulting code can be extremely concise but is still statically typed. Take for example
square list = map (\x -> x * x) list
in Haskell. Haskell figures out all types involved here in advance. We have list being a list of numbers, map some function that applies some other function to any element of a list and square that produces a list of numbers from another list of numbers.
Nonetheless, the compiler can prove that everything works out in advance - the operations anything supports are formally specified. Hence, I'd never call Haskell a scripting language though it can reach similar levels of expressiveness (if not more!).
So all in all, scripting languages are dynamically typed because that allows you to prototype a running system without specifying, but assuming every single operation involved exists, which is what scripting languages are used for.
I don't quite understand your question. Apart from PHP, VBScript, COMMAND.COM and the Unix shell(s) I can't really think of any loosely typed scripting languages.
Some examples of scripting languages which are not loosely typed are Python, Ruby, Mondrian, JavaFXScript, PowerShell, Haskell, Scala, ELisp, Scheme, AutoLisp, Io, Ioke, Seph, Groovy, Fantom, Boo, Cobra, Guile, Slate, Smalltalk, Perl, …
I've read the Wikipedia article on concatenative languages, and I am now more confused than I was when I started. :-)
What is a concatenative language in stupid people terms?
In normal programming languages, you have variables which can be defined freely and you call methods using these variables as arguments. These are simple to understand but somewhat limited. Often, it is hard to reuse an existing method because you simply can't map the existing variables into the parameters the method needs or the method A calls another method B and A would be perfect for you if you could only replace the call to B with a call to C.
Concatenative language use a fixed data structure to save values (usually a stack or a list). There are no variables. This means that many methods and functions have the same "API": They work on something which someone else left on the stack. Plus code itself is thought to be "data", i.e. it is common to write code which can modify itself or which accepts other code as a "parameter" (i.e. as an element on the stack).
These attributes make this languages perfect for chaining existing code to create something new. Reuse is built in. You can write a function which accepts a list and a piece of code and calls the code for each item in the list. This will now work on any kind of data as long it's behaves like a list: results from a database, a row of pixels from an image, characters in a string, etc.
The biggest problem is that you have no hint what's going on. There are only a couple of data types (list, string, number), so everything gets mapped to that. When you get a piece of data, you usually don't care what it is or where it comes from. But that makes it hard to follow data through the code to see what is happening to it.
I believe it takes a certain set of mind to use the languages successfully. They are not for everyone.
[EDIT] Forth has some penetration but not that much. You can find PostScript in any modern laser printer. So they are niche languages.
From a functional level, they are at par with LISP, C-like languages and SQL: All of them are Turing Complete, so you can compute anything. It's just a matter of how much code you have to write. Some things are more simple in LISP, some are more simple in C, some are more simple in query languages. The question which is "better" is futile unless you have a context.
First I'm going to make a rebuttal to Norman Ramsey's assertion that there is no theory.
Theory of Concatenative Languages
A concatenative language is a functional programming language, where the default operation (what happens when two terms are side by side) is function composition instead of function application. It is as simple as that.
So for example in the SKI Combinator Calculus (one of the simplest functional languages) two terms side by side are equivalent to applying the first term to the second term. For example: S K K is equivalent to S(K)(K).
In a concatenative language S K K would be equivalent to S . K . K in Haskell.
So what's the big deal
A pure concatenative language has the interesting property that the order of evaluation of terms does not matter. In a concatenative language (S K) K is the same as S (K K). This does not apply to the SKI Calculus or any other functional programming language based on function application.
One reason this observation is interesting because it reveals opportunities for parallelization in the evaluation of code expressed in terms of function composition instead of application.
Now for the real world
The semantics of stack-based languages which support higher-order functions can be explained using a concatenative calculus. You simply map each term (command/expression/sub-program) to be a function that takes a function as input and returns a function as output. The entire program is effectively a single stack transformation function.
The reality is that things are always distorted in the real world (e.g. FORTH has a global dictionary, PostScript does weird things where the evaluation order matters). Most practical programming languages don't adhere perfectly to a theoretical model.
Final Words
I don't think a typical programmer or 8 year old should ever worry about what a concatenative language is. I also don't find it particularly useful to pigeon-hole programming languages as being type X or type Y.
After reading http://concatenative.org/wiki/view/Concatenative%20language and drawing on what little I remember of fiddling around with Forth as a teenager, I believe that the key thing about concatenative programming has to do with:
viewing data in terms of values on a specific data stack
and functions manipulating stuff in terms of popping/pushing values on the same the data stack
Check out these quotes from the above webpage:
There are two terms that get thrown
around, stack language and
concatenative language. Both define
similar but not equal classes of
languages. For the most part though,
they are identical.
Most languages in widespread use today
are applicative languages: the central
construct in the language is some form
of function call, where a function is
applied to a set of parameters, where
each parameter is itself the result of
a function call, the name of a
variable, or a constant. In stack
languages, a function call is made by
simply writing the name of the
function; the parameters are implicit,
and they have to already be on the
stack when the call is made. The
result of the function call (if any)
is then left on the stack after the
function returns, for the next
function to consume, and so on.
Because functions are invoked simply
by mentioning their name without any
additional syntax, Forth and Factor
refer to functions as "words", because
in the syntax they really are just
words.
This is in contrast to applicative languages that apply their functions directly to specific variables.
Example: adding two numbers.
Applicative language:
int foo(int a, int b)
{
return a + b;
}
var c = 4;
var d = 3;
var g = foo(c,d);
Concatenative language (I made it up, supposed to be similar to Forth... ;) )
push 4
push 3
+
pop
While I don't think concatenative language = stack language, as the authors point out above, it seems similar.
I reckon the main idea is 1. We can create new programs simply by joining other programs together.
Also, 2. Any random chunk of the program is a valid function (or sub-program).
Good old pure RPN Forth has those properties, excluding any random non-RPN syntax.
In the program 1 2 + 3 *, the sub-program + 3 * takes 2 args, and gives 1 result. The sub-program 2 takes 0 args and returns 1 result. Any chunk is a function, and that is nice!
You can create new functions by lumping two or more others together, optionally with a little glue. It will work best if the types match!
These ideas are really good, we value simplicity.
It is not limited to RPN Forth-style serial language, nor imperative or functional programming. The two ideas also work for a graphical language, where program units might be for example functions, procedures, relations, or processes.
In a network of communicating processes, every sub-network can act like a process.
In a graph of mathematical relations, every sub-graph is a valid relation.
These structures are 'concatenative', we can break them apart in any way (draw circles), and join them together in many ways (draw lines).
Well, that's how I see it. I'm sure I've missed many other good ideas from the concatenative camp. While I'm keen on graphical programming, I'm new to this focus on concatenation.
My pragmatic (and subjective) definition for concatenative programming (now, you can avoid read the rest of it):
-> function composition in extreme ways (with Reverse Polish notation (RPN) syntax):
( Forth code )
: fib
dup 2 <= if
drop 1
else
dup 1 - recurse
swap 2 - recurse +
then ;
-> everything is a function, or at least, can be a function:
( Forth code )
: 1 1 ; \ define a function 1 to push the literal number 1 on stack
-> arguments are passed implicitly over functions (ok, it seems to be a definition for tacit-programming), but, this in Forth:
a b c
may be in Lisp:
(c a b)
(c (b a))
(c (b (a)))
so, it's easy to generate ambiguous code...
you can write definitions that push the xt (execution token) on stack and define a small alias for 'execute':
( Forth code )
: <- execute ; \ apply function
so, you'll get:
a b c <- \ Lisp: (c a b)
a b <- c <- \ Lisp: (c (b a))
a <- b <- c <- \ Lisp: (c (b (a)))
To your simple question, here's a subjective and argumentative answer.
I looked at the article and several related web pages. The web pages say themselves that there isn't a real theory, so it's no wonder that people are having a hard time coming up with a precise and understandable definition. I would say that at present, it is not useful to classify languages as "concatenative" or "not concatenative".
To me it looks like a term that gives Manfred von Thun a place to hang his hat but may not be useful for other programmers.
While PostScript and Forth are worth studying, I don't see anything terribly new or interesting in Manfred von Thun's Joy programming language. Indeed, if you read Chris Okasaki's paper on Techniques for Embedding Postfix Languages in Haskell you can try out all this stuff in a setting that, relative to Joy, is totally mainstream.
So my answer is there's no simple explanation because there's no mature theory underlying the idea of a concatenative language. (As Einstein and Feynman said, if you can't explain your idea to a college freshman, you don't really understand it.) I'll go further and say although studying some of these languages, like Forth and PostScript, is an excellent use of time, trying to figure out exactly what people mean when they say "concatenative" is probably a waste of your time.
You can't explain a language, just get one (Factor, preferably) and try some tutorials on it. Tutorials are better than Stack Overflow answers.