What does an Interpreter contain? - python-3.x

I'm using Antlr4 to create an interpreter, lexer and parser. The GUI it will be used in contains QScintilla2.
As QScintilla does not need a parser and has a CustomLexer module will the (Antlr4 built, Python3 target) interpreter be enough?
I'm not asking for opinions but factual guidance. Thanks.

What does an Interpreter contain
An interpreter must have some way to parse the code and then some way to run it. Usually the "way to parse the code" would be handled by a lexer+parser, but lexerless parsing is also possible. Either way, the parser will create some intermediate representation of the code such as a tree or bytecode. The "way to run it" will then be a phase that iterates over the generated tree or bytecode and executes it. JIT-compilation (i.e. generating machine code from the tree or bytecode and then executing that) is also possible, but more advanced. You can also run various analyses between parsing and execution (for example you can check whether any undefined variables or used anywhere or you could do static type checking - though the latter is uncommon in interpreted languages).
When using ANTLR, ANTLR will generate a lexer and parser for you, the latter of which will produce a parse tree as a result, which you can iterate over using the generated listener or visitor. At that point you proceed as you see fit with your own code. For example, you could generate bytecode from the parse tree and execute that, translate the parse tree to a simplified tree and execute that or execute the parse tree directly in a visitor.
QScintilla is about displaying the language and is not linked to the interpreter. In an IDE the console is where the interpreter comes into play along with running the script (from a 'Run' button for example). The only thing which is common to QScintilla and the interpreter is the script file - the interpreter is not connected or linked to QScintilla. Does this make basic sense?
Yes, that makes sense, but it doesn't have to be entirely like that. That is, it can make sense to reuse certain parts of your interpreter to implement certain features in your editor/IDE, but you don't have to.
You've specifically mentioned the "Run" button and as far as that is concerned, the implementation of the interpreter (and whether or not it uses ANTLR) is of absolutely no concern. In fact it doesn't even matter which language the interpreter is written in. If your interpreter is named mylangi and you're currently editing a file named foo.mylang, then hitting the "Run" button should simply execute subprocess.run(["mylangi", "foo.mylang"]) and display the result in some kind of tab or window.
Same if you want to have a "console" or "REPL" window where you can interact with the interpreter: You simply invoke the interpreter as a subprocess and connect it to the tab or subwindow that displays the console. Again the implementation of the interpreter is irrelevant for this - you treat it like any other command line application.
Now other features that IDEs and code editors have are syntax highlighting, auto-completion and error highlighting.
For syntax highlighting you need some code that goes through the source and tells the editor which parts of the code should have which color (or boldness etc.). Using QScintilla, you accomplish this by giving a lexer class that does this. You can define such a class, by simply writing the necessary code to detect the types of tokens by hand, but you can also re-use the lexer generated by ANTLR. So that's one way in which the implementation of your interpreter could be re-used in the editor/IDE. However since a syntax highlighter is usually fairly straight forward to write by hand, you don't have to do it this way.
For code completion you need to understand which variables and functions are defined in the file, what their scope is, and which other files are included in the current file. These days it's becoming common to implement this logic in a so-called language-server that is separate tool that can be re-used from different editors and IDEs. Regardless of whether you implement this logic in such a language server or directly in your editor, you'll need a parser (and, if applicable, a type checker) to be able to answer these types of question. Again that's something that you can re-use from your interpreter and this time that's definitely a good idea because writing a second parser would be significant additional work (and easy to get out of sync with the interpreter's parser).
For error highlighting you can simply invoke the interpreter in "verify only" mode (i.e. only print out syntax errors and other errors that can be detected statically, but don't actually run the file -- many interpreters have such an option) and then parse the output to find out where to draw the squiggly lines. But you can also re-use the parser (and analyses if you have any) from your interpreter instead. If you go the route of having a language server, errors and warnings would also be handled by the language server.

Related

Create a C and C++ preprocessor using ANTLR

I want to create a tool that can analyze C and C++ code and detect unwanted behaviors, based on a config file. I thought about using ANTLR for this task, as I already created a simple compiler with it from scratch a few years ago (variables, condition, loops, and functions).
I grabbed C.g4 and CPP14.g4 from ANTLR grammars repository. However, I came to notice that they don't support the pre-processing parsing, as that's a different step in the compilation.
I tried to find a grammar that does the pre-processing part (updated to ANTLR4) with no luck. Moreover, I also understood that if I'll go with two-steps parsing I won't be able to retain the original locations of each character, as I'd already modified the input stream.
I wonder if there's a good ANTLR grammar or program (preferably Python, but can deal with other languages as well) that can help me to pre-process the C code. I also thought about using gcc -E, but then I won't be able to inspect the macro definitions (for example, I want to warn if a user used a #pragma GCC (some students at my university, for which I write this program to, used this to bypass some of the course coding style restrictions). Moreover, gcc -E will include library header contents, which I don't want to process.
My question is, therefore, if you can recommend me a grammar/program that I can use to pre-process C and C++ code. Alternatively, if you can guide me on how to create a grammar myself that'd be perfect. I was able to write the basic #define, #pragma etc. processings, but I'm unable to deal with conditions and with macro functions, as I'm unsure how to deal with them.
Thanks in advance!
This question is almost off-topic as it asks for an external resource. However, it also bears a part that deserves some attention.
The term "preprocessor" already indicates what the handling of macros etc. is about. The parser never sees the disabled parts of the input, which also means it can be anything, which might not be part of the actual language to parse. Hence a good approach for parsing C-like languages is to send the input through a preprocessor (which can be a specialized input stream) to strip out all preprocessing constructs, to resolve macros and remove disabled text. The parse position is not a problem, because you can push the current token position before you open a new input stream and restore that when you are done with it. Store reported errors together with your input stream stack. This way you keep the correct token positions. I have used exactly this approach in my Windows resource file parser.

What do I need to learn to build an interpreter?

For my AQA A2-level Computing project, I've decided to create a basic interpreted programming language, outputting to Console. I don't know how to build an interpreter. I have a copy of the purple dragon book, which is all about compiler design, as user166390 said on an answer to this question that the initial steps to building a compiler are the same to build an interpreter. My question is: is this true?
Can I use the techniques described in the dragon book to write an interpreter? And if so, which steps do I need to use and learn how to use?
Do I need to write a lexical analyser, a syntax analyser, a semantic analyser and an intermediate code generator, for example?
Could I get away with writing a basic parser that reads each line of the source code, parses it, and executes the instruction straight away, or is that a notoriously bad idea?
Yes, you can use the techniques described in the dragon book to write an interpreter.
You need a lexical analyzer and a parser regardless.
As others have pointed out, you do need to write the code to do actual execution -- but for a simple interpreter, this can be essentially the same as the syntax-directed translation described in the dragon book.
Everything else is optional.
If you want to skip straight from the parser to execution, you can. That will leave you with a very simple language, which can be both good and bad -- look at Tcl for an example of such a language.
If you want to interpret each line as you parse it, you can do that, too; this is what most command-line interpreters (Unix shell scripts, Microsoft's cmd.com and PowerShell) do, as well as interactive "REPL's" (Read-Eval-Print-Loops) for languages like Python and Ruby.
"Semantic analyzer" seems vague to me, but sounds like it should include most kinds of load-time consistency checks. This is also optional, but there are advantages in an interpreter that won't take any old garbage and try to execute it as a program...
"Intermediate code" is also kind of vague, but it is arguably optional. If you aren't executing directly from the program string (as in Tcl), you need some kind of internal representation to store your code once you've read it in. One popular option is to execute from an internal tree structure, based more or less closely on your parse tree, which is arguably distinct from producing "intermediate code". On the other hand, if your "intermediate code" could be written out more or less directly from your internal tree structure, you might as well count the internal structure as your "intermediate code".
There are important issues that you haven't addressed; one that stands out is: how do you want to handle names? Presumably you will want the programmer to be able to define and use his own names (e.g., for variables, functions, and so forth), so you will need to implement some kind of mechanism for that.
Exactly how names are handled is a big design decision, with major implications for the usability and implementability of your language. The simplest option for implementation is to use a single, global hash map to implement a single, global namespace -- but note that this choice has well-known usability problems...
Could I get away with writing a basic parser that reads source code and executes the steps straight away?
You could but you'd be doing it the hard way.
Do I need to write a lexical analyser, a syntax analyser, a semantic analyser and an intermediate code generator, for example?
You can skip intermediate code generation except if you want to write a VM-based interpreter. Perl for example, used to execute its parse graph directly; this is in contrast with Java or Python, which produces intermediate byte code.
The interpreter part of a VM-based language is generally simpler than the interpreter that have to understand a parse graph (so each component in the system is simpler), however the complexity of the whole interpreter stack is generally simpler when you don't need to define an intermediate bytecode language. So pick your poison.

Debug-able Domain Specific Language

My goal is to develop a DSL for my application but I want the user to be able to put a break-point in his/her DSL without the user to know anything about the underlying language that the DSL runs on and he/she see is the DSL related syntax, stack, watch variables and so on.
How can I achieve this?
It depends on your target platform. For example, if you're implementing your DSL compiler on top of .NET, it is trivial to annotate your bytecode with debugging information (variable names, source code location for expressions and statements, etc.).
If you also provide a Visual Studio extension for your language, you'll be able to reuse a royalty-free MSVS Isolated Shell for both editing and debugging for your DSL code.
Nearly the same approach is possible with JVM (you can use Eclipse or Netbeans as a debugging frontend).
Native code generation is a little bit more complicated, but it is still possible to do some simple things, like generating C code stuffed with line pragmas.
You basically need to generate code for your DSL with built-in opportunities for breakpoints, each with built-in facilities for observing the internal state variables. Then your debugger has know how to map locations in the DSL to the debug breakpoints, and for each breakpoint, simply call the observers. (If the observers have names, e.g., variable names, you can let the user choose which ones to call).

Interpreted standard library

It's common for a programming language to come with a standard library implemented at least partly in the language itself.
In the case of an interpreted language, the obvious implementation is to read the library source files when the interpreter starts up, but this runs into the messy but persistent problem of making sure the interpreter knows where to find those files even when both are moved around. It would be cleaner if they could be embedded in the interpreter itself, so there is just a single executable.
I can see a simple way to do this by just translating the library source files to C literal strings, but I'm curious as to whether there are any pitfalls I'm overlooking or refinements to the method.
So my question is, what existing interpreted languages attach library source files in the language itself, to the interpreter?
Bytecode virtual machines often provide an answer to this: store the bytecode in files (*.pyc, *.rbc) and load the bytecoded versions of the libraries using a simpler mechanism.
Smalltalks do this by dumping the standard heap into a separate file called an "image".
As for single-file distribution, append the library file(s) to the end of the executable file, and include special logic for the interpreter to read from its binary and find a structure of those interpretable program data, or alternatively build the interpreter with a static inclusion of the program data.

Is there a way to convert from a string to pure code in C++?

I know that its possible to read from a .txt file and then convert various parts of that into string, char, and int values, but is it possible to take a string and use it as real code in the program?
Code:
string codeblock1="cout<<This is a test;";
string codeblock2="int array[5]={0,6,6,3,5};}";
int i;
cin>>i;
if(i)
{
execute(codeblock1);
}
else
{
execute(codeblock2);
}
Where execute is a function that converts from text to actual code (I don't know if there actually is a function called execute, I'm using it for the purpose of my example).
In C++ there's no simple way to do this. This feature is available in higher-level languages like Python, Lisp, Ruby and Perl (usually with some variation of an eval function). However, even in these languages this practice is frowned upon, because it can result in very unreadable code.
It's important you ask yourself (and perhaps tell us) why you want to do it?
Or do you only want to know if it's possible? If so, it is, though in a hairy way. You can write a C++ source file (generate whatever you want into it, as long as it's valid C++), then compile it and link to your code. All of this can be done automatically, of course, as long as a compiler is available to you in runtime (and you just execute it with system). I know someone who did this for some heavy optimization once. It's not pretty, but can be made to work.
You can create a function and parse whatever strings you like and create a data structure from it. This is known as a parse tree. Subsequently you can examine your parse tree and generate the necessary dynamic structures to perform the logic therin. The parse tree is subsequently converted into a runtime representation that is executed.
All compilers do exactly this. They take your code and they produce machine code based on this. In your particular case you want a language to write code for itself. Normally this is done in the context of a code generator and it is part of a larger build process. If you write a program to parse your language (consider flex and bison for this operation) that generates code you can achieve the results you desire.
Many scripting languages offer this sort of feature, going all the way back to eval in LISP - but C and C++ don't expose the compiler at runtime.
There's nothing in the spec that stops you from creating and executing some arbitrary machine language, like so:
char code[] = { 0x2f, 0x3c, 0x17, 0x43 }; // some machine code of some sort
typedef void (FuncType*)(); // define a function pointer type
FuncType func = (FuncType)code; // take the address of the code
func(); // and jump to it!
but most environments will crash if you try this, for security reasons. (Many viruses work by convincing ordinary programs to do something like this.)
In a normal environment, one thing you could do is create a complete program as text, then invoke the compiler to compile it and invoke the resulting executable.
If you want to run code in your own memory space, you could invoke the compiler to build you a DLL (or .so, depending on your platform) and then link in the DLL and jump into it.
First, I wanted to say, that I never implemented something like that myself and I may be way off, however, did you try CodeDomProvider class in System.CodeDom.Compiler namespace? I have a feeling the classes in System.CodeDom can provide you with the functionality you are looking for.
Of course, it will all be .NET code, not any other platform
Go here for sample
Yes, you just have to build a compiler (and possibly a linker) and you're there.
Several languages such as Python can be embedded into C/C++ so that may be an option.
It's kind of sort of possible, but not with just straight C/C++. You'll need some layer underneath such as LLVM.
Check out c-repl and ccons
One way that you could do this is with Boost Python. You wouldn't be using C++ at that point, but it's a good way of allowing the user to use a scripting language to interact with the existing program. I know it's not exactly what you want, but perhaps it might help.
Sounds like you're trying to create "C++Script", which doesn't exist as far as I know. C++ is a compiled language. This means it always must be compiled to native bytecode before being executed. You could wrap the code as a function, run it through a compiler, then execute the resulting DLL dynamically, but you're not going to get access to anything a compiled DLL wouldn't normally get.
You'd be better off trying to do this in Java, JavaScript, VBScript, or .NET, which are at one stage or another interpreted languages. Most of these languages either have an eval or execute function for just that, or can just be included as text.
Of course executing blocks of code isn't the safest idea - it will leave you vulnerable to all kinds of data execution attacks.
My recommendation would be to create a scripting language that serves the purposes of your application. This would give the user a limited set of instructions for security reasons, and allow you to interact with the existing program much more dynamically than a compiled external block.
Not easily, because C++ is a compiled language. Several people have pointed round-about ways to make it work - either execute the compiler, or incorporate a compiler or interpreter into your program. If you want to go the interpreter route, you can save yourself a lot of work by using an existing open source project, such as Lua

Resources