How does a symbol table relate to static chains and scoping? - scope

I am taking a principles of programming languages course right now but I cannot for the life of me figure this out. This is not homework just a general concept question.
In our class we have talked about static chains and displays. I think that I understand why we need these. Otherwise when we have nested methods we cannot figure out what variable we are talking about when we have nested methods.
My prof has also talked about a symbol table. My question is what is the symbol table used for? How does it relate to the static chains?
I will give some background (please correct me if I am wrong).
(I am going to define a few things just to make explanations easier)
Suppose we have this code:
main(){
int i;
int j;
int k;
a(){
int i;
int j;
innerA(){
int i = 5;
print(i);
print(j);
print(k);
}
}
b(){
...
}
...
}
And this stack:
| innerA |
| a |
| b |
| main |
-----------
Quick description of static chains as a refresher.
Static chains are used to find which variable should be used when variables are redefined inside an inner function. In the stack shown above each frame will have a pointer to the method that contains it. So:
| innerA | \\ pointer to a
| a | \\ pointer to main
| b | \\ pointer to main
| main | \\ pointer to global variables
-----------
(Assuming static scoping, for dynamic scoping I think that every stack frame will just point to the one below it)
I think that when we execute print(<something>) inside the innerA method this will happen:
currentStackframe = innerAStackFrame;
while(true){
if(<something> is declared in currentStackFrame)
print(<something>);
break;
else{
currentStackFrame = currentStackFrame.containedIn();
}
}
Quick refresher of symbol table
I am not really sure what a symbol table is for. But this is what it looks like:
Index is has value,
Value is reference.
__
| |
|--| --------------------------------------------------
| | --------------------> | link to next | name | type | scope level | other |
|--| --------------------------------------------------
| | |
|--| ---------------
| | |
|--| | --------------------------------------------------
| | -------> | link to next | name | type | scope level | other |
|--| --------------------------------------------------
| |
|--|
link to next - if more than one thing has the same has hash value this is a link
name - name of the element (examples: i, j, a, int)
type - what the thing is (examples: variable, function, parameter)
scope level - not 100% sure how this is defined. I think that:
0 would be built-ins
1 would be globals
2 would be main method
3 would be a and b
4 would be innerA
Just to restate my questions:
What is the symbol table used for?
How does it relate to the static chains?
Why do we need static chains since the scope information is in the symbol table.

Note that "symbol table" can mean two different things: it could mean the internal structure used by the compiler to determine which alias of a variable has scope where, or it could mean the list of symbols exported by a library to its users at load time. Here, you're using the former definition.
The symbol table is used to determine to which memory address a user is referring when the employ a certain name. When you say "x", which alias of "x" do you want?
The reason you need to keep both a static chain and a symbol table is this: when the compiler needs to determine which variables are visible in a certain scope, it needs to "unmask" the variables previously aliased in the inner scope. For instance, when moving from innerA back to a, the variable i changes its memory address. The same thing happens again going from a to main. If the compiler did not keep a static chain, it would have to traverse the whole symbol table. That's expensive if you've got lots of names. With static chains, the compiler just looks at the current level, removes the last definition of each variable contained in it, and then follows the link up one scope. If, on the other hand, you didn't have the symbol table, then every variable access not in the local scope would make the compiler have to walk the static chain.
Summing up, you can reconstruct the symbol table from the static chain, and vice versa. But you really want to have both to make the common-case operations fast. If you lack the symbol table, compiling will take longer because each non-locally-scoped variable access will require climbing the static chain. If you lack the static chain, compiling will take longer because leaving a scope will require walking the symbol table to remove now-defunct entries.
Incidentally, if you're not already using Michael Scott's Programming Language Pragmatics, you should take a look at it. It's by far the best textbook on this topic I've seen.

This obviously refers to some specific class implementation, and in order to understand it I'd strongly recommend talking to somebody connected with the class.
The symbol table is what translates source code identifiers into something the compiler can use. It keeps the necessary descriptions. It tends to be used throughout the compilation process. The "type" you mention looks like it would be intended for parsing, and there would doubtless be more entries (in the "other") for later stages.
It's hard to know how it relates to the static chains, or why they're needed, since you don't even know what the "scope level" is. However, note that both a() and b() may have a variable i, you seem to think they've got the same scope level, so you need something to differentiate them.
Also, the static chain is frequently an optimization, so the compiler knows which symbol table entries to accept. Without a static chain, the compiler would have to do some lookups to reject an entry in b for something encountered in innerA.
To get anything more useful, you're going to have to explain more about what's going on (I'd strongly suggest talking to the instructor or TAs or whatever) and probably have more specific questions.

Related

ANTLR4 Custom token types C#

It's quite hard for me to explain using the title...
But essentially the situation is that I want to be able to create Classes or Types in my grammar.
I have my primitive types but other than that the only way I can think of making new "custom" types is if I allow a type to also be an IDENTIFIER, which causes issues on its own... Is there another way I am missing to look at this? And if not, how can I properly create it?
So far my type rules look something like:
var_type : type | VAR;
return_type : type | VOID;
type
: primitive_type
| IDENTIFIER
;
primitive_type
: INT_TYPE
| FLOAT_TYPE
| BOOL_TYPE
| STRING_TYPE
| DYNAMIC
;
I am quite new to using Antlr...
Edit
The difference between Var and Dynamic is the same as C#(sorta, not really in the way that it's built, but the idea). Var just sets the type to the expression's type, while Dynamic actually DOES have a type, it just allows you to change(and keeps track of) the type in runtime.
Types that are defined in the source code of the language you are parsing would be a semantic concern. That is, they are not a syntax concern (generally, the job of a parser is syntax validation).
If you introduce constraints on types being defined before their use, you could add a semantic predicate to check it at parse time (by collecting something like a hash map/dictionary of the types that you've seen so far. Most "modern" languages would not accept that constraint.
That leaves semantic concerns to be dealt with after the parse has produced a parse tree. ANTLR helps out by providing Visitors/Listeners to make it easy to process the parse tree for this type of thing, but it's still up to you to write code to handle this concern.
Generally, you'd write something that builds a symbol table (taking whatever scoping rules your language has into account), and then validate any type references against the symbol table for it's scope. Of course, there are many ways to do this, as many languages handle it differently.

How to generate stable id for AST nodes in functional programming?

I want to substitute a specific AST node into another, and this substituted node is specified by interactive user input.
In non-functional programming, you can use mutable data structure, and each AST node have a object reference, so when I need to reference to a specific node, I can use this reference.
But in functional programming, use IORef is not recommended, so I need to generate id for each AST node, and I want this id to be stable, which means:
when a node is not changed, the generated id will also not change.
when a child node is changed, it's parent's id will not change.
And, to make it clear that it is an id instead of a hash value:
for two different sub nodes, which are compared equal, but corresponding to different part of an expression, they should have different id.
So, what should I do to approach this?
Perhaps you could use the path from the root to the node as the id of that node. For example, for the datatype
data AST = Lit Int
| Add AST AST
| Neg AST
You could have something like
data ASTPathPiece = AddGoLeft
| AddGoRight
| NegGoDown
type ASTPath = [ASTPathPiece]
This satisfies conditions 2 and 3 but, alas, it doesn't satisfy 1 in general. The index into a list will change if you insert a node in a previous position, for example.
If you are rendering the AST into another format, perhaps you could add hidden attributes in the result nodes that identified which ASTPathPiece led to them. Traversing the result nodes upwards to the root would let you reconstruct the ASTPath.

Creating an AST and manage, at the same time, a symbol Table using Haskell's Happy parsers

I am creating a simple imperative language from scratch, I already have a working syntactic tree, without much complication, it just uses the bottom-up style of parsing to create it using a simple tree data structure. The idea now is to implement a complete LeBlanc-Cook style symbol table. Its structure is not complicated, the problem is that I don't know how to make happy fill it while creating the tree at the same time.
The idea behind doing it all in a single pass is that in this way the AST can be filled with only what is the minimum necessary, ignoring stuff like variable or type declarations, which only effects are in the symbol table. Traversing the AST to fill the table is my last option.
I have the basic idea of it being some kind of global state, modified just in certain select times, like when a new block is open or when a variable is declared, but I have no idea how to use the environment that happy gives me together with whatever monadic structure I would create.
I know this question could be reduced to something like "how does happy work?", but anyways. Any comment is appreciated.
Here's an example to illustrate a little better my question.
% monad {MyState}
...
START: INSTRUCTIONS { (AST_Root $1, Final_Symtable_State) } -- Ideally
INSTRUCTIONS: INSTRUCTIONS INSTRUCTION { $2:$1 } -- A list of all instructions
| INSTRUCTION { [$1] }
INSTRUCTION : VARDEF {%???}
| TYPEDEF
| VARMOD
| ...
...
VARDEF: let identifier : Int {%???} -- This should modify the symtable state and not the tree
| let identifier : Int = number {%???} -- This should modify the symtable and provide a new branch for the tree
It seems like a single state would maybe not work out, and there is a combination of monadic and not monadic actions, what would be the best structure to work this out.

Converting Antlr3 Syntactic Predicates to equivalent grammar in Antlr4, when number of syntactic predicates is large

I have Antlr3 parser rule that looks like this:
ruleA:
(TOKEN_1) => TOKEN_1 ruleToken1
| (TOKEN_2) => TOKEN_2 ruleToken2
....
....
<many more such rules>
| genericRuleA
;
Assume for a moment that the tokens and rules are appropriately defined. Also, genericRuleA is defined in a way that it behaves like a "catch-all" for anything that falls through the rule elements above it.
So, for example, the subrules above genericRuleA correspond to named functions and ruleToken1 and ruleToken2 capture how those named functions are called. genericRuleA would capture any other functions that are not named functions (e.g. a user defined function).
The net effect of this grammar is that if ruleA finds a TOKEN_1, it takes ruleToken1 path and reports an error if the rest of the input did not satisfy ruleToken1
In Antlr4, I would end up with the following parser rule after taking out the syntactic predicates:
ruleA:
TOKEN_1 ruleToken1
| TOKEN_2 ruleToken2
....
....
<many more such rules>
| genericRuleA
;
This works well, except, if the rest of the ruleToken1 fails, the parser automatically picks genericRuleA as the preferred path. So now, I have the side effect of being able to "overload" named functions. That may be useful in some situations, but in my situation, the requirement is to expressedly not allow this overloading i.e. named functions must conform to the specific structure laid out in ruleToken1, and report an error if that structure is violated. The system for which this grammar is being written does not support overloaded functions.
genericRuleA must cater to anything other than named functions.
My first question: Is there a standard way to implement this conversion?
One approach I have seen is to create a list of tokens that correspond to the named functions (TOKEN_1, TOKEN_2, etc.); construct a #parser::member function that returns true if the input token has membership in this list. So assuming a isNamedFunction() appropriately defined, the rule would look something like:
ruleA:
TOKEN_1 ruleToken1
| TOKEN_2 ruleToken2
| {!isNamedFunction()}? genericRuleA
;
This might work when you have only a few named functions, but if you have potentially hundreds (think builtin functions in TSQL for example), that list would be pretty cumbersome to build. Not to mention that as new named functions came about, I would have to keep updating that list.
So my follow up question is: Assuming the semantic predicate approach outlined above is the right way to do this, is there a way to programmatically assemble the list?
One approach that appears to hold promise is to build this from getExpectedTokens(). Here I would re-structure the rule a little bit so all named functions fall under one rule (let's say we call it namedRuleA) like so:
ruleA:
namedRuleA
| genericRuleA
;
namedRuleA:
TOKEN_1 ruleToken1
| TOKEN_2 ruleToken2
....
....
<many more such rules>
;
Then starting from ruleA's ATNState [recognizer.getATN().states.get(recognizer.getState())] I would walk down the transitions till I arrive at namedRuleA's ATNState and then use getATN().getExpectedTokens(<namedRuleA's ATNState>, null) to get an IntervalSet that corresponds to the token types of TOKEN_1, TOKEN_2, etc.
This appears to be yielding the right set of tokens, and appears to make sense since (my understanding is that) the ATNState and its transitions are known and fixed at transpile time i.e. during .g4 --> .java transpiling (or whatever your target language is).
I know sometimes the transitions are dynamically determined through closure() calls e.g. when there are semantic predicates involves, but let's assume I can ensure no semantic predicates would be used in namedRuleA
I just wanted to get a sense if somebody else tried this approach and if there might be a gotcha that I'm completely missing.

Resolve union structure in Rust FFI

I have problem with resolving c-union structure XEvent.
I'm experimenting with Xlib and X Record Extension in Rust. I'm generate ffi-bindings with rust-bindgen. All code hosted on github alxkolm/rust-xlib-record.
Trouble happen on line src/main.rs:106 when I try extract data from XEvent structure.
let key_event: *mut xlib::XKeyEvent = event.xkey();
println!("KeyPress {}", (*key_event).keycode); // this always print 128 on any key
My program listen key events and print out keycode. But it is always 128 on any key I press. I think this wrong conversion from C union type to Rust type.
Definition of XEvent starts here src/xlib.rs:1143. It's the c-union. Original C definition here.
Code from GitHub can be run by cargo run command. It's compile without errors.
What I do wrong?
Beware that rustbindgen generates binding to C union with as much safety as in C; as a result, when calling:
event.xkey(); // gets the C union 'xkey' field
There is no runtime check that xkey is the field currently containing a value.
This is because since C does not have tagged union (ie, the union knowing which field is currently in use), developers came up with various ways of encoding this information (*), the two that I know of being:
an external supplier; typically another field of the structure right before the union
the first field of each of the structures in the union
Here, you are in the latter case int type; is the first field of the union and each nested structure starts with int _type; to denote this. As a result, you need a two-steps approach:
consult type()
depending on the value, call the correct reinterpretation
The mapping from the value of type to the actual field being used should be part of the documentation of the C library, hopefully.
I invite you to come up with a wrapper around this low-level union that will make it safer to retrieve the result. At the very least, you could check it is the right type in the accessor; the full approach being to come up with a Rust enum that would wrap proxies to all the fields and allow pattern-matching.
(*) and actually sometimes disregard it altogether, for example in C99 to reinterpret a float as an int a union { float f; int i; } can be used.

Resources