Conditionally adding custom code in ANTLR grammar

Conditionally adding custom code in ANTLR grammar - antlr4

I am writing a grammar that needs some custom code written in its target language. It is fairly easy to add e.g.
#parser::members {
}
The problem is that I am targeting multiple languages, and I haven't found a way to target multiple languages without copy+pasting the entire grammar.
Is there a way without resorting to copy+paste or external preprocessors?

I'm afraid there is no solution. Action code is by definition written in the target language, as it is directly copied from the grammar to the generated files. If you have target languages that all can handle #ifdef #endif (say, C, C++ and Obj-C) then you could use that to separate individual code parts. Otherwise you could use a base grammar with placeholders and process that in a pre-compilation step (where you generate your parsers/lexers) and replace the placeholders with the real target code. That even makes the grammar cleaner.

Related

How to implement source map in a compiler?

I'm implementing a compiler compiling a source language to a target language (assembly like) in Haskell.
For debugging purpose, a source map is needed to map target language assembly instruction to its corresponding source position (line and column).
I've searched extensively compiler implementation, but none includes a source map.
Can anyone please point me in the right direction on how to generate a source map?
Code samples, books, etc. Haskell is preferred, other languages are also welcome.

Details depend on a compilation technique you're applying.
If you're doing it via a sequence of transforms of intermediate languages, as most sane compilers do these days, your options are following:
Annotate all intermediate representation (IR) nodes with source location information. Introduce special nodes for preserving variable names (they'll all go after you do, say, an SSA-transform, so you need to track their origins separately)
Inject tons of intrinsic function calls (see how it's done in LLVM IR) instead of annotating each node
Do a mixture of the above
The first option can even be done nearly automatically - if each transform preserves source location of an original node in all nodes it creates from it, you'd only have to manually change some non-trivial annotations.
Also you must keep in mind that some optimisations may render your source location information absolutely meaningless. E.g., a value numbering would collapse a number of similar expressions into one, probably preserving a source location information for one random origin. Same for rematerialisation.
With Haskell, the first approach will result in a lot of boilerplate in your ADT definitions and pattern matching, even if you sugar coat it with something like Scrap Your Boilerplate (SYB), so I'd recommend the second approach, which is extensively documented and nicely demonstrated in LLVM IR.

Given an antlr4 grammar, can I build up an expression tree?

So I have written my grammar in antlr4 syntax. Then I setup codegeneration, and now I can parse source files in my own defined language. This works great!
The next step I took is to create an object model from the expression tree. This is also working well.
However, now I want to generate an expression from my object model.
Can I generate code using the generated language parser objects API? Obviously, I can write methods that hand-generates strings. But I want to use a geenrated API based on the grammar to achieve some level of type safety and to detect errors when I make a grammar change.
I'm using the latest antlr4: antlr 4.7.1.

There's no generated solution. You have to wire this all up manually.

System Verilog 2012 dependency analysis

I am in the process of adapting the System Verilog LRM into Antlr4. This is a huge overkill for what I really need, however. Basically I need dependency analysis similar to the -M switch in gcc. This problem has been surprisingly difficult to solve, and my current regex based solution is incomplete, buggy and constantly breaks when exposed to new code, even though it has been patched many times. I have tried to use various freely available parsers, but none of them seem to handle code that conforms to the latest Systemverilog (2012) standard.
I think I need a parser based approach, and I think I am stuck building my own parser. But I am very interested to hear any other suggestions about this. I can't be the only one who has this problem.
Here is my Antlr question: I am attempting to use the "Island in the stream" approach where the Antlr grammar will ignore most of the details and complexity of the Systemverilog language and only parse code where modules are being instanced or headers are being referenced. Obviously the difficulty here is determining how to distinguish between code I care about and code I don't. Has anyone used Antlr this way (not necessarily for Systemverilog)? I am hoping to get a strategy about how to write the "catch all" rule that matches everything that is not related to module instances.
Thanks.

The idiomatic strategy is to match what is wanted and let everything else be consumed by an 'other' rule. So the basic structure of the parser will be:
verilog : statement+ EOF ;
statement : header
| module
| <<etc>>
| other
;
header : INCLUDE filePathspec SEMI ;
filePathspec: <<whatever>> ;
module : MODULE <<whatever>> SEMI ;
other : . ; // consume a single, uninteresting token at a time
The only requirement is to make the statement rules sufficiently detailed to uniquely match their statements. The Verilog syntax gives you that explicitly.
UPDATE
Take a look at the example Verilog grammar is in the grammar achieve.

Is it possible, in any language, to implement rules that will affect every instance of an object?

For example, could I implement a rule that would change every string that followed the pattern '1..4' into the array [1,2,3,4]? In JavaScript:
//here you create a rule that changes every string that matches /$([0-9]+)_([0-9]+)*/
//ever created into range($1,$2) (imagine a b are the results of the regexp)
var a = '1..4';
console.log(a);
>> output: [1,2,3,4];
Of course, I'm pretty confident that would be impossible in most languages. My question is: is there any language in which that would be possible? Or have anyone ever proposed something like that? Does this thing have a 'name' for which I can google to read more about?

Modifying the language from whithin itself falls under the umbrell of reflection and metaprogramming. It is referred as behavioral reflection. It differs from structural reflection that opperates at the level of the application (e.g. classes, methods) and not the language level. Support for behavioral reflection varies greatly across languages.
We can broadly categorize language changes in two categories:
changes that modify the semantics (i.e. the rules) of the language itself (e.g. redefine the method lookup algorithm),
changes that modify the syntax (e.g. your syntax '1..4' to create arrays).
For case 1, certain languages expose the structure of the application (structural reflection) and the inner working of their implementation (behavioral reflection) to the application itself via special object, called meta-objects. Meta-objects are reifications of otherwise implicit aspects, that become then explicitely manipulable: the application can modify the meta-objects to redefine part of its structure, or part of the language. When it comes to langauge changes, the focus is usually on modifiying message sending / method invocation since it is the core mechanism of object-oriented language. But the same idea could be applied to expose other aspects of the language, e.g. field accesses, synchronization primitives, foreach enumeration, etc. depending on the language.
For case 2, the program must be representated in a suitable data structure to be modified. For languages of the lisp family, the program manipulates lists, and the program can be itself represented as lists. This is called homoiconicity and is handy for metaprogramming, hence the flexibility of lisp-like languages. For other languages, their representation is usually an AST. Transforming the representation of the program, or rewriting it, is possible with macro, preprocessors, or hooks during compilation or class loading.
The line between 1 and 2 is however blurry. Syntactical changes can appear to modify the semantics of the language. For instance, I can rewrite all fields accesses with proper getter and setter and perform additional logic there, say to implement transactional memory. Did I perform a semantical change of what a field access is, or merely a syntax change?
Also, there are other constructs the fall bewten the lines. For instance, proxies and #doesNotUnderstand trap are popular techniques to simulate the reification of message sends to some extent.
Lisp and Smalltalk have been very influencial in the field of metaprogramming, and I think the two following projects/platform are interesting to look at for a representative of each of these:
Racket, a lisp-like language focused on growing languages from within the langauge
Helvetia, a Smalltalk extension to embed new languages into the host language by leveraging the AST of the host environment.
I hope you enjoyed this even if I did not really address your question ;)
Your desired change require modifying the way literals are created. This is AFAIK not usually exposed to the application. The closed work that I can think of is Virtual Values for Language Extension, that tackled Javascript.

Yes. Common Lisp (and certain other lisps) have "reader macros" which allow the user to reprogram (incrementally) the mapping between the input stream and the actual language construct as parsed.
See http://dorophone.blogspot.com/2008/03/common-lisp-reader-macros-simple.html
If you want to operate on the level of objects, you will want to use a debugging/memory management framework that keeps track of all objects, and processes the rules on each evaluation step (nasty). This seems like the kind of thing you might be able to shoehorn into smalltalk.

CLOS (Common Lisp Object System) allows redefinition of live objects.
Ultimately you need two things to implement this:
Access to the running system's AST (Abstract Syntax Tree), and
Access to the running system's objects.
You'll want to study meta-object protocols and the languages that use them, then the implementations of both the MOPs and the environment within which these programs are executed.
Image-based systems will be the easiest to modify (e.g., Lisp, potentially Smalltalk).
(Image-based systems store a snapshot of a running system, allowing complete shutdown and restarts, redefinitions, etc. of a complete environment, including existing objects, and their definitions.)

Ruby allows you to extend classes. For instance, this example adds functionality to the String class. But you can do more than add methods to classes. You can also overwrite methods, but defining a method that's already been defined. You may want to preserve access to the original method using alias_method.
Putting all this together, you can overload a constructor in Ruby, but in your case, there's a catch: It sounds like you want the constructor to return a different type. Constructors by definition return instances of their class. If you just want it to return the string "[1,2,3,4]", that's simple enough:
class string
alias_method :initialize :old_constructor
def initialize
old_constructor
# code that applies your transformation
end
end
But there's no way to make it return an Array if that's what you want.

What programming languages will let me manipulate the sequence of instructions in a method?

I have an upcoming project in which a core requirement will be to mutate the way a method works at runtime. Note that I'm not talking about a higher level OO concept like "shadow one method with another", although the practical effect would be similar.
The key properties I'm after are:
I must be able to modify the method in such a way that I can add new expressions, remove existing expressions, or modify any of the expressions that take place in it.
After modifying the method, subsequent calls to that method would invoke the new sequence of operations. (Or, if the language binds methods rather than evaluating every single time, provide me a way to unbind/rebind the new method.)
Ideally, I would like to manipulate the atomic units of the language (e.g., "invoke method foo on object bar") and not the assembly directly (e.g. "pop these three parameters onto the stack"). In other words, I'd like to be able to have high confidence that the operations I construct are semantically meaningful in the language. But I'll take what I can get.
If you're not sure if a candidate language meets these criteria, here's a simple litmus test:
Can you write another method called clean which:
accepts a method m as input
returns another method m2 that performs the same operations as m
such that m2 is identical to m, but doesn't contain any calls to the print-to-standard-out method in your language (puts, System.Console.WriteLn, println, etc.)?
I'd like to do some preliminary research now and figure out what the strongest candidates are. Having a large, active community is as important to me as the practicality of implementing what I want to do. I am aware that there may be some unforged territory here, since manipulating bytecode directly is not typically an operation that needs to be exposed.
What are the choices available to me? If possible, can you provide a toy example in one or more of the languages that you recommend, or point me to a recent example?
Update: The reason I'm after this is that I'd like to write a program which is capable of modifying itself at runtime in response to new information. This modification goes beyond mere parameters or configurable data, but full-fledged, evolved changes in behavior. (No, I'm not writing a virus. ;) )

Well, you could always use .NET and the Expression libraries to build up expressions. That I think is really your best bet as you can build up representations of commands in memory and there is good library support for manipulating, traversing, etc.

Well, those languages with really strong macro support (in particular Lisps) could qualify.
But are you sure you actually need to go this deeply? I don't know what you're trying to do, but I suppose you could emulate it without actually getting too deeply into metaprogramming. Say, instead of using a method and manipulating it, use a collection of functions (with some way of sharing state, e.g. an object holding state passed to each).

I would say Groovy can do this.
For example
class Foo {
void bar() {
println "foobar"
}
}
Foo.metaClass.bar = {->
prinltn "barfoo"
}
Or a specific instance of foo without effecting other instances
fooInstance.metaClass.bar = {->
println "instance barfoo"
}
Using this approach I can modify, remove or add expression from the method and Subsequent calls will use the new method. You can do quite a lot with the Groovy metaClass.

In java, many professional framework do so using the open source ASM framework.
Here is a list of all famous java apps and libs including ASM.
A few years ago BCEL was also very much used.

There are languages/environments that allows a real runtime modification - for example, Common Lisp, Smalltalk, Forth. Use one of them if you really know what you're doing. Otherwise you can simply employ an interpreter pattern for an evolving part of your code, it is possible (and trivial) with any OO or functional language.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string