Parse arithmetic/boolean expression but skip capture - haskell

Given the following expression
x = a + 3 + b * 5
I would like to write that in the following data structure, where I'm only interested to capture the variables used on the RHS and keep the string intact. Not interesting in parsing a more specific structure since I'm doing a transformation from language to language, and not handling the evaluation
Variable "x" (Expr ["a","b"] "a + 3 + b * 5")
I've been using this tutorial as my starting point, but I'm not sure how to write an expression parser without buildExpressionParser. That doesn't seem to be the way I should approach this.

I am not sure why you want to avoid buildExpressionParser, as it hides a lot of the complexity in parsing expressions with infix operators. It is the right way to do things....
Sorry about that, but now that I got that nag out of the way, I can answer your question.
First, here is some background-
The main reason writing a parser for expressions with infix operators is hard is because of operator precedence. You want to make sure that this
x+y*z
parses as this
+
/ \
x *
/\
y z
and not this
*
/ \
+ z
/ \
x y
Choosing the correct parsetree isn't a very hard problem to solve.... But if you aren't paying attention, you can write some really bad code. Why? Performance....
The number of possible parsetrees, ignoring precedence, grows exponentially with the size of the input. For instance, if you write code to try all possibilities then throw away all but the ones with the proper precedence, you will have a nasty surprise when your parser tackles anything in the real world (remember, exponential complexity often ain't just slow, it is basically not a solution at all.... You may find that you are waiting half an hour for a simple parse, no one will use that parser).
I won't repeat the details of the "proper" solution here (a google search will give the details), except to note that the proper solution runs at O(n) with the size of the input, and that buildExpressionParser hides all the complexity of writing such a parser for you.
So, back to your original question....
Do you need to use buildExpressionParser to get the variables out of the RHS, or is there a better way?
You don't need it....
Since all you care about is getting the variables used in the right side, you don't care about operator precedence. You can just make everything left associative and write a simple O(n) parser. The parsetrees will be wrong, but who cares? You will still get the same variables out. You don't even need a context free grammar for this, this regular expression basically does it
<variable>(<operator><variable>)*
(where <variable> and <operator> are defined in the obvious way).
However....
I wouldn't recommend this, because, as simple as it is, it still will be more work than using buildExpressionParser. And it will be trickier to extend (like adding parenthesis). But most important, later on, you may accidentally use it somewhere where you do need a full parsetree, and be confused for a while why the operator precedence is so completely messed up.
Another solution is, you could rewrite your grammar to remove the ambiguity (again, google will tell you how).... This would be good as a learning exercise, but you basically would be repeating what buildExpressionParser is doing internally.

Related

How to get the grammar production when there is an error with Ply(Yacc)?

In the yacc.py file I defined the output of a grammar and also an error, like this:
def p_error(p):
if p:
print("Error when trying to read the symbol '%s' (Token type: %s)" % (p.value, p.type))
else:
print("Syntax error at EOF")
exit()
In addition to this error message, I also want to print what was the production read at the time of the error, something like:
print("Error in production: ifstat -> IF LPAREN expression RPAREN statement elsestat")
How can I do this?
Really, you can't. You particularly can't with a bottom-up parser like the one generated by Ply, but even with top-down parsers "the production read at the time" is not a very well-defined concept.
For example, consider the erroneous code:
if (x < y return 42;
in which the error is a missing parentheses. At least, that's how I'd describe the error. But a closing parenthesis is not the only thing which could follow the 0. For example, a correct program might include any of the following:
if (x < y) return 42;
if (x < y + 10) return 42;
if (x < y && give_up_early) return 42;
and many more.
So which production is the parser trying to complete when it sees the token return? Evidently, it's still trying to complete expression (which might actually have a hierarchy of different expression types, or which might be relying on precedence declarations to be a single non-terminal, or some combination of the two.) But that doesn't really help identify the error as a missing close parenthesis.
In a top-down parser, it would be possible to walk up the parser stack to get a list of partially-completed productions in inclusion order. (At least, that would be possible if the parser maintained its own stack. If it were a recursive-descent parser, examining the stack would be more complicated.)
But in a bottom-up parser, the parser state is more complicated. Bottom-up parsers are more flexible than top-down parsers precisely because they can, in effect, consider multiple productions at the same time. So there often really isn't one single partial production; the parser will decide which production it is looking at by gradually eliminating all the possibilities which don't work.
That description makes it sound like the bottom-up parser is doing a lot of work, which is misleading. The work was already done by the parser generator, which compiles a simple state transition table to guide the parse. What that means in practice is that the parser knows how to handle every possibly-correct token at each moment in the parse. So, for example, when it sees a ) following if (x < y, it immediately knows that it must finish up the expression production and proceed with the rest of the if statement.
Bison -- a C implementation of yacc -- has an optional feature which allows it to list the possible correct tokens when an error is encountered. That's not as simple as it sounds, and implementing it correctly creates a noticeable overhead in parsing time, but it is sometimes useful. (It's often not useful, though, because the list of possible tokens can be very long. In the case of the error I'm using as an example, the list would include every single arithmetic operator, as well as those tokens which could start a postfix operator. The bison extended error handler stops trying when it reaches the sixth possible token, which means that it will rarely generated an extended error message if the parse is in the middle of an expression.) In any case, Ply does not have such a feature.
Ply, like bison, does implement error recovery through the error pseudo-token. The error-recovery algorithm works best with languages which have an unambiguous resynchronisation point, as in languages with a definite statement terminator (unlike C-like languages, in which many statements do not end with ;). But you can use error productions to force the parser to pop its stack back to some containing production in order to produce a better error message. In my experience, a lot of experimentation is needed to get this strategy right.
In short, producing meaningful error messages is hard. My recommendation is to first focus on getting your parser working on correct inputs.

Haskell function with infinite list as argument [duplicate]

In The Haskell 98 Report it's said that
A floating literal must contain digits both before and after the decimal point; this ensures that a decimal point cannot be mistaken for another use of the dot character.
What other use might this be? I can't imagine any such legal expression.
(To clarify the motivation: I'm aware that many people write numbers like 9.0 or 0.7 all the time without needing to, but I can't quite befriend myself with this. I'm ok with 0.7 rather then the more compact but otherwise no better .7, but outwritten trailing zeroes feel just wrong to me unless they express some quantity is precise up to tenths, which is seldom the case in the occasions Haskell makes me write 9.0-numbers.)
I forgot it's legal to write function composition without surrounding whitespaces! That's of course a possibility, though one could avoid this problem by parsing floating literals greedily, such that replicate 3 . pred$8 ≡ ((replicate 3) . pred) 8 but replicate 3.pred$8 ≡ (replicate 3.0 pred)8.
There is no expression where an integer literal is required to stand directly next to a ., without whitespace?
One example of other uses is a dot operator (or any other operator starting or ending with a dot): replicate 3.pred$8.
Another possible use is in range expressions: [1..10].
Also, you can (almost) always write 9 instead of 9.0, thus avoiding the need for . altogether.
One of the most prominent usages of (.) is the function composition. And so the haskell compiler interpretes a . 1 composing the function a with a number and does not know what to do; analogously the other way 'round. Other usage of (.) could be found here.
Other problems with .7 vs. 0.7 are not known to me.
I don't really seem much of a problem with allowing '9.' and '.7'. I think the current design is more of a reflection of the ideas of the original designers of Haskell.
While it could probably be disambiguated, I don't think there is much to be gained from allowing .7 and 7.. Code is meant to be read by people as well as machines, and it's much easier to accidentally miss a decimal point at either end of a literal than in the middle.
I'll take the extra readability over the saved byte any day.

How to run several operations back to back in a lambda in Haskell?

I'm teaching myself Haskell, and I am having difficulty understanding how I might pipeline a number of operations in the body of a lambda function without using a do block. Take the following for example:
par = (\c->
c + 1
c + 2
)
In Ruby and other imperative languages, I am used to being able to run serveral expressions in a lambda block by adding a linebreak between them. In Haskell, I looked for a similar construct and found that Haskell doesn't respect linebreaks in pure expressions.
So, syntactically, if I wanted to run the second line after the first without using a do block, what could I do?
Unless it's side-effectful code, having multiple independent expressions with no side effects makes no sense in Haskell (nor any other languages for that matter). You can use let to store the value of an expression in a name and then use it in the subsequent expression (the returm value of the lambda), or a do-block, but anything else is meaningless in Haskell and thus not a valid program.
In other words: it makes little sense to ask this question solely about syntax as Haskell's syntax is all about its semantics.

What are some examples of where using parentheses in a program lowers readability?

I always thought that parentheses improved readability, but in my textbook there is a statement that the use of parentheses dramatically reduces the readability of a program. Does anyone have any examples?
I can find plenty of counterexamples where the lack of parentheses lowered the readability, but the only example I can think of for what the author may have meant is something like this:
if(((a == null) || (!(a.isSomething()))) && ((b == null) || (!(b.isSomething()))))
{
// do some stuff
}
In the above case, the ( ) around the method calls is unnecessary, and this kind of code may benefit from factoring out of terms into variables. With all of those close parens in the middle of the condition, it's hard to see exactly what is grouped with what.
boolean aIsNotSomething = (a == null) || !a.isSomething(); // parens for readability
boolean bIsNotSomething = (b == null) || !b.isSomething(); // ditto
if(aIsNotSomething && bIsNotSomething)
{
// do some stuff
}
I think the above is more readable, but that's a personal opinion. That may be what the author was talking about.
Some good uses of parens:
to distinguish between order of operation when behavior changes without the parens
to distinguish between order of operation when behavior is unaffected, but someone who doesn't know the binding rules well enough is going to read your code. The good citizen rule.
to indicate that an expression within the parens should be evaluated before used in a larger expression: System.out.println("The answer is " + (a + b));
Possibly confusing use of parens:
in places where it can't possibly have another meaning, like in front of a.isSomething() above. In Java, if a is an Object, !a by itself is an error, so clearly !a.isSomething() must negate the return value of the method call.
to link together a large number of conditions or expressions that would be clearer if broken up. As in the code example up above, breaking up the large paranthetical statement into smaller chunks can allow for the code to be stepped through in a debugger more straightforwardly, and if the conditions/values are needed later in the code, you don't end up repeating expressions and doing the work twice. This is subjective, though, and obviously meaningless if you only use the expressions in 1 place and your debugger shows you intermediate evaluated expressions anyway.
Apparently, your textbook is written by someone who hate Lisp.
Any way, it's a matter of taste, there is no single truth for everyone.
I think that parentheses is not a best way to improve readability of your code. You can use new line to underline for example conditions in if statement. I don't use parentheses if it is not required.
Well, consider something like this:
Result = (x * y + p * q - 1) % t and
Result = (((x * y) + (p * q)) - 1) % t
Personally I prefer the former (but that's just me), because the latter makes me think the parantheses are there to change the actual order of operations, when in fact they aren't doing that. Your textbook might also refer to when you can split your calculations in multiple variables. For example, you'll probably have something like this when solving a quadratic ax^2+bx+c=0:
x1 = (-b + sqrt(b*b - 4*a*c)) / (2*a)
Which does look kind of ugly. This looks better in my opinion:
SqrtDelta = sqrt(b*b - 4*a*c);
x1 = (-b + SqrtDelta) / (2*a);
And this is just one simple example, when you work with algorithms that involve a lot of computations, things can get really ugly, so splitting the computations up into multiple parts will help readability more than parantheses will.
Parentheses reduce readability when they are obviously redundant. The reader expects them to be there for a reason, but there is no reason. Hence, a cognitive hiccough.
What do I mean by "obviously" redundant?
Parentheses are redundant when they can be removed without changing the meaning of the program.
Parentheses that are used to disambiguate infix operators are not "obviously redundant", even when they are redundant, except perhaps in the very special case of multiplication and addition operators. Reason: many languages have between 10–15 levels of precedence, many people work in multiple languages, and nobody can be expected to remember all the rules. It is often better to disambiguate, even if parentheses are redundant.
All other redundant parentheses are obviously redundant.
Redundant parentheses are often found in code written by someone who is learning a new language; perhaps uncertainty about the new syntax leads to defensive parenthesizing.
Expunge them!
You asked for examples. Here are three examples I see repeatedly in ML code and Haskell code written by beginners:
Parentheses between if (...) then are always redundant and distracting. They make the author look like a C programmer. Just write if ... then.
Parentheses around a variable are silly, as in print(x). Parentheses are never necessary around a variable; the function application should be written print x.
Parentheses around a function application are redundant if that application is an operand in an infix expression. For example,
(length xs) + 1
should always be written
length xs + 1
Anything taken to an extreme and/or overused can make code unreadable. It wouldn't be to hard to make the same claim with comments. If you have ever looked at code that had a comment for virtually every line of code would tell you that it was difficult to read. Or you could have whitespace around every line of code which would make each line easy to read but normally most people want similar related lines (that don't warrant a breakout method) to be grouped together.
You have to go way over the top with them to really damage readability, but as a matter of personal taste, I have always found;
return (x + 1);
and similar in C and C++ code to be very irritating.
If a method doesn't take parameters why require an empty () to call method()? I believe in groovy you don't need to do this.

In Functional Programming, is it considered a bad practice to have incomplete pattern matchings

Is it generally considered a bad practice to use non-exhaustive pattern machings in functional languages like Haskell or F#, which means that the cases specified don't cover all possible input cases?
In particular, should I allow code to fail with a MatchFailureException etc. or should I always cover all cases and explicitly throw an error if necessary?
Example:
let head (x::xs) = x
Or
let head list =
match list with
| x::xs -> x
| _ -> failwith "Applying head to an empty list"
F# (unlike Haskell) gives a warning for the first code, since the []-case is not covered, but can I ignore it without breaking functional style conventions for the sake of succinctness? A MatchFailure does state the problem quite well after all ...
If you complete your pattern-matchings with a constructor [] and not the catch-all _, the compiler will have a chance to tell you to look again at the function with a warning the day someone adds a third constructor to lists.
My colleagues and I, working on a large OCaml project (200,000+ lines), force ourselves to avoid partial pattern-matching warnings (even if that means writing | ... -> assert false from time to time) and to avoid so-called "fragile pattern-matchings" (pattern matchings written in such a way that the addition of a constructor may not be detected) too. We consider that the maintainability benefits.
Explicit is better than implicit (borrowed from the Zen of Python ;))
It's exactly the same as in a C switch over an enum... It's better to write all the cases (with a fall through) rather than just putting a default, because the compiler will tell you if you add new elements to the enumeration and you forgot to handle them.
I think that it depends quite a bit on the context. Are you trying to write robust, easy to debug code, or are you trying to write something simple and succinct?
If I were working on a long term project with multiple developers, I'd put in the assert to give a more useful error message. I also agree with Pascal's comment that not using a wildcard would be ideal from a software engineering perspective.
If I were working on a smaller scale project on which I was the only developer, I wouldn't think twice about using an incomplete match. If necessary, you can always check the compiler warnings.
I think it also depends a bit on the types you're matching against. Realistically, no extra union cases will be added to the list type, so you don't need to worry about fragile matching. On the other hand, in code that you control and are actively working on, there may well be types which are in flux and have additional union cases added, which means that protecting against fragile matching may be worth it.
This is a special case of a more general question, which is "should you ever create partial functions". Incomplete pattern matches are only one example of partial functions.
As a rule total functions are preferable. When you find yourself looking at a function that just has to be partial, ask yourself if you can solve the problem in the type system first. Sometimes that is more trouble than its worth (e.g. creating a whole type of lists with known lengths just to avoid the "head []" problem). So its a trade-off.
Or maybe you just asking whether its good practice in partial functions to say things like
head [] = error "head: empty list"
In which case the answer is YES!
The Haskell prelude (standard functions) contains many partial functions, e.g. head and tail only work on non-empty lists, but don't ask me why.
This question has two aspects.
For the user of the API, failwith... simply throws a System.Exception, which is unspecific (and therefore is sometimes considered a bad practice in itself). On the other hand, the implicitly thrown MatchFailureException can be specifically caught using a type test pattern, and therefore is preferrable.
For the reviewer of the implementation code, failwith... clearly documents that the implementer has at least given some thought about the possible cases, and therefore is preferrable.
As the two aspects contradict each other, the right answer depends on the circumstances (see also kvb's answer). A solution which is 100% "correct" from any point of view would have to
deal with every case explicitly,
throw a specific exception where necessary, and
clearly document the exception
Example:
/// <summary>Gets the first element of the list.</summary>
/// <exception cref="ArgumentException">The list is empty.</exception>
let head list =
match list with
| [] -> invalidArg "list" "The list is empty."
| x::xs -> x

Resources