Megaparsec: macro expansion during parsing - haskell

In a small DSL, I'm parsing macro definitions, similarly to #define C pre-processor directives (here a simplistic example):
_def mymacro(a,b) = a + b / a
When the following call is encountered by the parser
c = mymacro(pow(10,2),3)
it is expanded to
c = pow(10,2) + 3 / pow(10,2)
My current approach is:
wrap the parser in a State monad
when parsing macro definitions, store them in the state, with their body unparsed (parse it as a string)
when parsing a macro call, find the definition in the state, replace the arguments in the body text, replace the call with this body and resume the parsing.
Some code from the last step:
macrocallStmt
= do -- capture starting position and content of old input before macro call
oldInput <- getInput
oldPos <- getPosition
-- parse the call
ret <- identifier
symbolCS "="
i <- identifier
args <- parens $ commaSep anyExprStr
-- expand the macro call
us <- get
let inlinedCall = replaceMacroArgs i args ret us
-- set up new input with macro call expanded
remainder <- getInput
let newInput = T.append inlinedCall (T.cons '\n' remainder)
setPosition oldPos
setInput newInput
-- update the expanded input script
modify (updateExpandedInput oldInput newInput)
anyExprStr = fmap praShow expression <|> fmap praShow algexpr
This approach does the job decently. However, it has a number of drawbacks.
Parsing multiple times
Any valid DSL expression can be an argument of the macro call. Therefore, even though I only need their textual representation (to be replaced in the macro body), I need to parse them and then convert them again to string - simply looking for the next comma wouldn't work. Then the complete and customised macro will be parsed. So in practice, macro arguments get parsed twice (and also show-ed, which has its cost). Moreover, each call requires a new parsing of the (almost same) body. The reason to keep the body unparsed in memory is to allow maximum flexibility: in the body, even DSL keywords could be constructed out of the macro arguments.
Error handling
Because the expanded body is inserted in front of the unconsumed input (replacing the call), the initial and final input can be quite different. In the event of a parse error, the position where the error occurred in the expanded input is available. However, when processing the error, I only have the original, not expanded, input. So the error position won't match.
That is why, in the code snippet above, I use the state to save the expanded input, so that it is available when the parser exits with an error.
This works well, but I noticed that it becomes quite costly, with new Text arrays (the input stream is Text) being allocated for the whole stream at every expansion. Perhaps keeping the expanded input in the state as String, rather than Text, would be cheaper in this case, i.e. when a middle part needs to be replaced?
The reasons for this question are:
I would appreciate suggestions / comments on the two issues described above
Can anyone suggest a better approach altogether?

Related

Including comments in AST

I'm planning on writing a Parser for some language. I'm quite confident that I could cobble together a parser in Parsec without too much hassle, but I thought about including comments into the AST so that I could implement a code formatter in the end.
At first, adding an extra parameter to the AST types seemed like a suitable idea (this is basically what was suggested in this answer). For example, instead of having
data Expr = Add Expr Expr | ...
one would have
data Expr a = Add a Expr Expr
and use a for whatever annotation (e.g. for comments that come after the expression).
However, there are some not so exciting cases. The language features C-like comments (// ..., /* .. */) and a simple for loop like this:
for (i in 1:10)
{
... // list of statements
}
Now, excluding the body there are at least 10 places where one could put one (or more) comments:
/*A*/ for /*B*/ ( /*C*/ i /*E*/ in /*F*/ 1 /*G*/ : /*H*/ 10 /*I*/ ) /*J*/
{ /*K*/
...
In other words, while the for loop could previously be comfortably represented as an identifier (i), two expressions (1 & 10) and a list of statements (the body), we would now at least had to include 10 more parameters or records for annotations.
This get ugly and confusing quite quickly, so I wondered whether there is a clear better way to handle this. I'm certainly not the first person wanting to write a code formatter that preserves comments, so there must be a decent solution or is writing a formatter just that messy?
You can probably capture most of those positions with just two generic comment productions:
Expr -> Comment Expr
Stmt -> Comment Stmt
This seems like it ought to capture comments A, C, F, H, J, and K for sure; possibly also G depending on exactly what your grammar looks like. That only leaves three spots to handle in the for production (maybe four, with one hidden in Range here):
Stmt -> "for" Comment "(" Expr Comment "in" Range Comment ")" Stmt
In other words: one before each literal string but the first. Seems not too onerous, ultimately.

Fortran CHARACTER FUNCTION without defined size [duplicate]

I am writing the following simple routine:
program scratch
character*4 :: word
word = 'hell'
print *, concat(word)
end program scratch
function concat(x)
character*(*) x
concat = x // 'plus stuff'
end function concat
The program should be taking the string 'hell' and concatenating to it the string 'plus stuff'. I would like the function to be able to take in any length string (I am planning to use the word 'heaven' as well) and concatenate to it the string 'plus stuff'.
Currently, when I run this on Visual Studio 2012 I get the following error:
Error 1 error #6303: The assignment operation or the binary
expression operation is invalid for the data types of the two
operands. D:\aboufira\Desktop\TEMP\Visual
Studio\test\logicalfunction\scratch.f90 9
This error is for the following line:
concat = x // 'plus stuff'
It is not apparent to me why the two operands are not compatible. I have set them both to be strings. Why will they not concatenate?
High Performance Mark's comment tells you about why the compiler complains: implicit typing.
The result of the function concat is implicitly typed because you haven't declared its type otherwise. Although x // 'plus stuff' is the correct way to concatenate character variables, you're attempting to assign that new character object to a (implictly) real function result.
Which leads to the question: "just how do I declare the function result to be a character?". Answer: much as you would any other character variable:
character(len=length) concat
[note that I use character(len=...) rather than character*.... I'll come on to exactly why later, but I'll also point out that the form character*4 is obsolete according to current Fortran, and may eventually be deleted entirely.]
The tricky part is: what is the length it should be declared as?
When declaring the length of a character function result which we don't know ahead of time there are two1 approaches:
an automatic character object;
a deferred length character object.
In the case of this function, we know that the length of the result is 10 longer than the input. We can declare
character(len=LEN(x)+10) concat
To do this we cannot use the form character*(LEN(x)+10).
In a more general case, deferred length:
character(len=:), allocatable :: concat ! Deferred length, will be defined on allocation
where later
concat = x//'plus stuff' ! Using automatic allocation on intrinsic assignment
Using these forms adds the requirement that the function concat has an explicit interface in the main program. You'll find much about that in other questions and resources. Providing an explicit interface will also remove the problem that, in the main program, concat also implicitly has a real result.
To stress:
program
implicit none
character(len=[something]) concat
print *, concat('hell')
end program
will not work for concat having result of the "length unknown at compile time" forms. Ideally the function will be an internal one, or one accessed from a module.
1 There is a third: assumed length function result. Anyone who wants to know about this could read this separate question. Everyone else should pretend this doesn't exist. Just like the writers of the Fortran standard.

Arduino and TinyGPS++ convert lat and long to a string

I' m having a problem parsing the lat and long cords from TinyGPS++ to a Double or a string. The code that i'm using is:
String latt = ((gps.location.lat(),6));
String lngg = ((gps.location.lng(),6));
Serial.println(latt);
Serial.println(lngg);
The output that i'm getting is:
0.06
Does somebody know what i'm doing wrong? Does it have something to do with rounding? (Math.Round) function in Arduino.
Thanks!
There are two problems:
1. This does not compile:
String latt = ((gps.location.lat(),6));
The error I get is
Wouter.ino:4: warning: left-hand operand of comma has no effect
Wouter:4: error: invalid conversion from 'int' to 'const char*'
Wouter:4: error: initializing argument 1 of 'String::String(const char*)'
There is nothing in the definition of the String class that would allow this statement. I was unable to reproduce printing values of 0.06 (in your question) or 0.006 (in a later comment). Please edit your post to have the exact code that compiles, runs and prints those values.
2. You are unintentionally using the comma operator.
There are two places a comma can be used: to separate arguments to a function call, and to separate multiple expressions which evaluate to the last expression.
You're not calling a function here, so it is the latter use. What does that mean? Here's an example:
int x = (1+y, 2*y, 3+(int)sin(y), 4);
The variable x will be assigned the value of the last expression, 4. There are very few reasons that anyone would actually use the comma operator in this way. It is much more understandable to write:
int x;
1+y; // Just a calculation, result never used
2*y; // Just a calculation, result never used
3 + (int) sin(y); // Just a calculation, result never used
x = 4; // A (trivial) calculation, result stored in 'x'
The compiler will usually optimize out the first 3 statements and only generate code for the last one1. I usually see the comma operator in #define macros that are trying to avoid multiple statements.
For your code, the compiler sees this
((gps.location.lat(),6))
And evaluates it as a call to gps.location.lat(), which returns a double value. The compiler throws this value away, and even warns you that it "has no effect."
Next, it sees a 6, which is the actual value of this expression. The parentheses get popped, leaving the 6 value to be assigned to the left-hand side of the statement, String latt =.
If you look at the declaration of String, it does not define how to take an int like 6 and either construct a new String, or assign it 6. The compiler sees that String can be constructed from const char *, so it tells you that it can't convert a numeric 6 to a const char *.
Unlike a compiler, I think I can understand what you intended:
double latt = gps.location.lat();
double lngg = gps.location.lon();
Serial.println( latt, 6 );
Serial.println( lngg, 6 );
The 6 is intended as an argument to Serial.println. And those arguments are correctly separated by a comma.
As a further bonus, it does not use the String class, which will undoubtedly cause headaches later. Really, don't use String. Instead, hold on to numeric values, like ints and floats, and convert them to text at the last possible moment (e.g, with println).
I have often wished for a compiler that would do what I mean, not what I say. :D
1 Depending on y's type, evaluating the expression 2*y may have side effects that cannot be optimized away. The streaming operator << is a good example of a mathematical operator (left shift) with side effects that cannot be optimized away.
And in your code, calling gps.location.lat() may have modified something internal to the gps or location classes, so the compiler may not have optimized the function call away.
In all cases, the result of the call is not assigned because only the last expression value (the 6) is used for assignment.

Haskell: Parsec trouble breaking out of pattern

For reference, here is my code: http://hpaste.org/86949
I am trying to parse the following expression: if (a[1].b[2].c.d[999].e[1+1].f > 3) { }. The method playing up is varExpr, which parses the variable member chains.
Context
In the language I am parsing, a dot can specify accessing a member variable. Since a member variable can be another object, chains can be produced ie: a.b.c, or essentially (a.b).c. Do not assume the dots are function composition.
Implementation
The logic is like this:
First, before <- many vocc collects all the instances of varname . and their optional array expression, leaving only a single identifier left
this <- vtrm collects the remaining identifier plus array expression -- the only one not proceeded by a dot
Issues
I am having two issues:
Firstly, the first term [for a reason that I cannot determine] seems to always require that it be wrapped in brackets for the parser to accept it ie: (a[1]).b[2].c... -- subsequent terms do not require this.
Secondly, the many vocc won't stop parsing. It always expects another identifier and another dot and I am unable to terminate the expression to catch the last vtrm.
I am looking for hints or solutions that will help me solve my problem(s)/headaches. Thanks.
When varExpr runs, it checks whether the next bit of input is matched by vocc or vtrm.
varExpr = do before <- many vocc -- Zero or more occurrences
this <- vtrm
return undefined
The problem is that any input matched by vtrm is also matched by the first step of vocc. When varExpr runs, it runs vocc, which runs vobj, which runs vtrm.
vocc = vobj <* symbol "."
vobj = choice [try vtrm, try $ parens vtrm]
Parsing of many vocc ends when vocc fails without consuming input. This happens when both vtrm and parens vtrm fail. However, after many vocc ends, the next parser to run is vtrm—and this parser is sure to fail!
You want vocc to fail without consuming input if it doesn't find a "." in the input. For that, you need to use try.
vocc = try $ vobj <* symbol "."
Alternatively, if vobj and vtrm really should be the same syntax, you can define varExpr as vobj `sepBy1` symbol ".".

Ungroupable line break using wl-pprint

I'm writing a pretty-printer for a simple white-space sensitive language.
I like the Leijen pretty-printer library more than I like the Wadler library, but the Leijen library has one problem in my domain: any line break I insert may be overridden by the group construct, which may compress any line, which might change the semantics of the output.
I don't think I can implement an ungroupable line in the wl-pprint (although I'd love to be wrong).
Looking a bit at the wl-pprint-extras package, I don't think that even the exposed internal interface allows me to create a line which will not be squashed by group.
Do I just have to rely on the fact that I never use group, or do I have some better option?
Given that you want to be able to group and you also need to be able to ensure some lines aren't uninserted,
why don't we use the fact that the library designers encoded the semantics in the data type,
instead of in code. This fabulous decision makes it eminently re-engineerable.
The Doc data type encodes a line break using the constructor Line :: Bool -> Doc.
The Bool represents whether to omit a space when removing a line. (Lines indent when they're there.)
Let's replace the Bool:
data LineBehaviour = OmitSpace | AddSpace | Keep
data Doc = ...
...
Line !LineBehaviour -- not Bool any more
The beautiful thing about the semantics-as-data design is that if we replace
this Bool data with LineBehaviour data, functions that didn't use it but
passed it on unchanged don't need editing. Functions that look inside at what
the Bool is break with the change - we'll rewrite exactly the parts of the code
that need changing to support the new semantics by changing the data type where
the old semantics resided. The program won't compile until we've made all the
changes we should, while we won't need to touch a line of code that doesn't
depend on line break semantics. Hooray!
For example, renderPretty uses the Line constructor, but in the pattern Line _,
so we can leave that alone.
First, we need to replace Line True with Line OmitSpace, and Line False with Line AddSpace,
line = Line AddSpace
linebreak = Line OmitSpace
but perhaps we should add our own
hardline :: Doc
hardline = Line Keep
and we could perhaps do with a binary operator that uses it
infixr 5 <->
(<->) :: Doc -> Doc -> Doc
x <-> y = x <> hardline <> y
and the equvalent of the vertical seperator, which I can't think of a better name than very vertical separator:
vvsep,vvcat :: [Doc] -> Doc
vvsep = fold (<->)
vvcat = fold (<->)
The actual removing of lines happens in the group function. Everything can stay the same except:
flatten (Line break) = if break then Empty else Text 1 " "
should be changed to
flatten (Line OmitSpace) = Empty
flatten (Line AddSpace) = Text 1 " "
flatten (Line Keep) = Line Keep
That's it: I can't find anything else to change!
You do need to avoid group, yes. The library's designed to facilitate wrapping or not wrapping based on the width of the output that you specify.
Dependent on the syntax of language you're implementing, you should also be cautious about softline and softbreak and the </> and <//> operators that use them. There's no reason I can see that you can't use <$> and <$$> instead.
sep, fillSep, cat and fillCat all use group directly or indirectly (and have the indeterminate semantics/width-dependent line breaks you want to avoid). However, given the your purpose, I don't think you need them:
Use vsep or hsep instead of sep or fillSep.
Use hcat or vcat instead of cat or fillCat.
You could use a line like
import Text.PrettyPrint.Leijen hiding (group,softline,softbreak,
(</>),(<//>),
sep,fillSep,cat,fillCat)
to make sure you don't call these functions.
I can't think of a way to ensure that functions you do use don't call group somewhere along the line, but I think those are the ones to avoid.

Resources