ambiguity in parsing comma as a operator using PLY python - python-3.x

I have following tokens and many more, but I want to keep my question short that's why not including the whole code.
tokens = (
'COMMA',
'OP',
'FUNC1',
'FUNC2'
)
def t_OP(t):
r'&|-|\||,'
return t
def t_FUNC1(t):
r'FUNC1'
return t
def t_FUNC2(t):
r'FUNC2'
return t
Other methods:
def FUNC1(param):
return {'a','b','c','d'}
def FUNC2(param,expression_result):
return {'a','b','c','d'}
My grammar rules in YACC are and few more are there but listed important ones:
'expression : expression OP expression'
'expression : LPAREN expression RPAREN'
'expression : FUNC1 LPAREN PARAM RPAREN'
'expression : FUNC2 LPAREN PARAM COMMA expression RPAREN'
'expression : SET_ITEM'
In my yacc.py, below are the methods which are related to the issue:
def p_expr_op_expr(p):
'expression : expression OP expression'
if p[2] == '|' or p[2]== ',':
p[0] = p[1] | p[3]
elif p[2] == '&':
p[0] = p[1] & p[3]
elif p[2] == '-':
p[0] = p[1] - p[3]
def p_expr_func1(p):
'expression : FUNC1 LPAREN PARAM RPAREN'
Param = p[3]
Result = ANY(Param)
p[0] = Result
def p_expr_func2(p):
'expression : FUNC2 LPAREN PARAM COMMA expression RPAREN'
Param = p[3]
expression_result = p[5]
Result = EXPAND(Param,expression_result)
p[0] = Result
def p_expr_set_item(p):
'expression : SET_ITEM'
p[0] = {p[1]}
So, the issue is:
If I give below input expression to this grammar:
FUNC1("foo"),bar
-- it works properly, and give me the result as the UNION of the SET returned by FUNC1("foo") and bar => {a,b,c,d} | {bar}
But, if i give below input expression, it gives syntax error at , and ):
I have my parenthesis defined as tokens (for those who think may be brackets are not defined in tokens)
FUNC2("foo", FUNC1("foo"),bar)
According to me for this expression, it matches production rule 'expression : FUNC2 LPAREN PARAM COMMA expression RPAREN'
so everything after the first comma should be well treated as a expression and it should match 'expression : expression OP expression' and do the union when comma is encountered as a operator.
If that's the case, then it should not work for FUNC1("foo"),bar as well.
I know I can fix this issue by removing ',' from t_OP(t) and adding one more production rule as 'expression : expression COMMA expression' and the method for this rule will look like below:
def p_expr_comma_expr(p):
'expression : expression COMMA expression'
p[0] = p[1] | p[3]
I'm reluctant to include this rule because it will introduces '4 shift/reduce conflicts'.
I really want to understand why it executes in one case and why not the other and what's the way to consider ',' as a operator?
Thanks

Ply has no way to know whether you want a given , to be the lexeme COMMA, or the lexeme OP. Or, rather, it has a way, but it will always choose the same one: OP. That's because patterns in token functions are tested before tokens in pattern variables.
I'm assuming you have t_COMMA = r',' somewhere in the part of your program you did not provide. It is also possible that you have a token function to recognise COMMA, in which case whichever function comes first will win. But however you do it, the order the regexes are tested is fixed, so either , is always COMMA or it is always OP. This is well explained in the Ply manual section on Specification of Tokens.
Personally, I'd suggest removing the comma from OP and modifying the grammar to use COMMA in the definition of expression. As you observed, you will get shift-reduce conflicts so you must include it in your precedence declaration (which you have also chosen to omit from your question). In fact, it seems likely that you would want to have different precedences for different operators, so you will probably want to separate the different operators into different tokens, since that is precedence is by the token. See the explanation in the Ply manual section on precedence declarations.

Adding one more rule like solved my problem :
expression:expression COMMA expression
added because as #rici told, in expression like FUNC2("hello",FUNC1("ghost")) the first comma is always taken as operator.
and adding precedence thing removed 4shift/reduce conflicts.
precedence = (
('left','COMMA'),
('left','OP')
)

Related

python Using variable in re.search source.error("bad escape %s" % escape, len(escape)) [duplicate]

I want to use input from a user as a regex pattern for a search over some text. It works, but how I can handle cases where user puts characters that have meaning in regex?
For example, the user wants to search for Word (s): regex engine will take the (s) as a group. I want it to treat it like a string "(s)" . I can run replace on user input and replace the ( with \( and the ) with \) but the problem is I will need to do replace for every possible regex symbol.
Do you know some better way ?
Use the re.escape() function for this:
4.2.3 re Module Contents
escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
A simplistic example, search any occurence of the provided string optionally followed by 's', and return the match object.
def simplistic_plural(word, text):
word_or_plural = re.escape(word) + 's?'
return re.match(word_or_plural, text)
You can use re.escape():
re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'
If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.
If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore (_).
Unfortunately, re.escape() is not suited for the replacement string:
>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'
A solution is to put the replacement in a lambda:
>>> re.sub('a', lambda _: '_', 'aa')
'__'
because the return value of the lambda is treated by re.sub() as a literal string.
Usually escaping the string that you feed into a regex is such that the regex considers those characters literally. Remember usually you type strings into your compuer and the computer insert the specific characters. When you see in your editor \n it's not really a new line until the parser decides it is. It's two characters. Once you pass it through python's print will display it and thus parse it as a new a line but in the text you see in the editor it's likely just the char for backslash followed by n. If you do \r"\n" then python will always interpret it as the raw thing you typed in (as far as I understand). To complicate things further there is another syntax/grammar going on with regexes. The regex parser will interpret the strings it's receives differently than python's print would. I believe this is why we are recommended to pass raw strings like r"(\n+) -- so that the regex receives what you actually typed. However, the regex will receive a parenthesis and won't match it as a literal parenthesis unless you tell it to explicitly using the regex's own syntax rules. For that you need r"(\fun \( x : nat \) :)" here the first parens won't be matched since it's a capture group due to lack of backslashes but the second one will be matched as literal parens.
Thus we usually do re.escape(regex) to escape things we want to be interpreted literally i.e. things that would be usually ignored by the regex paraser e.g. parens, spaces etc. will be escaped. e.g. code I have in my app:
# escapes non-alphanumeric to help match arbitrary literal string, I think the reason this is here is to help differentiate the things escaped from the regex we are inserting in the next line and the literal things we wanted escaped.
__ppt = re.escape(_ppt) # used for e.g. parenthesis ( are not interpreted as was to group this but literally
e.g. see these strings:
_ppt
Out[4]: '(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
__ppt
Out[5]: '\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
print(rf'{_ppt=}')
_ppt='(let H : forall x : bool, negb (negb x) = x := fun x : bool =>HEREinHERE)'
print(rf'{__ppt=}')
__ppt='\\(let\\ H\\ :\\ forall\\ x\\ :\\ bool,\\ negb\\ \\(negb\\ x\\)\\ =\\ x\\ :=\\ fun\\ x\\ :\\ bool\\ =>HEREinHERE\\)'
the double backslashes I believe are there so that the regex receives a literal backslash.
btw, I am surprised it printed double backslashes instead of a single one. If anyone can comment on that it would be appreciated. I'm also curious how to match literal backslashes now in the regex. I assume it's 4 backslashes but I honestly expected only 2 would have been needed due to the raw string r construct.

AttributeError: 'str' object has no attribute 'title()'

first="harry"
last="potter"
print(first, first.title())
print(f"Full name: {first.title()} {last.title()}")
print("Full name: {0.title()} {1.title()}".format(first, last))
The first two statements works fine; which means there is attribute title() to 'str' object.
The third print statement gives error. Why is it so?
The str.format() syntax is different from f-string syntax. In particular, while f-strings essentially let you put any expression between the brackets, str.format() is considerably more limited. Per the documentation:
The grammar for a replacement field is as follows:
replacement_field ::= "{" [field_name] ["!" conversion] [":" format_spec] "}"
field_name ::= arg_name ("." attribute_name | "[" element_index "]")*
arg_name ::= [identifier | digit+]
attribute_name ::= identifier
element_index ::= digit+ | index_string
index_string ::= <any source character except "]"> +
conversion ::= "r" | "s" | "a"
format_spec ::= <described in the next section>
You'll note that, while attribute names (via the dot operator .) and indices (via square-brackets []) - in other words, values - are valid, actual method calls (or any other expressions) are not. I hypothesize this is because str.format() does not actually execute the text, but just swaps in an object that already exists.
Actual f-strings (your second example) share a similar syntax to the str.format() method, in that they use curly-brackets {} to indicate the areas to replace, but according to the PEP that introduced them,
F-strings provide a way to embed expressions inside string literals, using a minimal syntax. It should be noted that an f-string is really an expression evaluated at run time, not a constant value.
This is clearly different (more complex) than str.format(), which is more of a simple text replacement - an f-string is an expression and is executed as such, and allows full expressions inside its brackets (in fact, you can even nest f-strings inside each other, which is fun).
str.format() passes string object in respective placeholder. And by using '.' you can access the string attributes or functionalities. That is why {0.title()} searching for the specific method in the string class and it is getting nothing about title().
But if you use
print("Full name: {0.title} {1.title}".format(first, last))
>> Full name: <built-in method title of str object at 0x7f5e42d09630><built-in method title of str object at 0x7f5e42d096b0>
Here you can see you can access built-in method of string
If you want to use title() with format() then use like this:
print("Full name: {0} {1}".format(first.title(), last.title()))
>> Full name: Harry Potter

ANTLR4 lexer rule ensuring expression does not end with character

I have a syntax where I need to match given the following example:
some-Text->more-Text
From this example, I need ANTLR4 lexer rules that would match 'some-Text' and 'more-Text' into one lexer rule, and the '->' as another rule.
I am using the lexer rules shown below as my starting point, but the trouble is, the '-' character is allowed in the NAMEDELEMENT rule, which causes the first NAMEDELEMENT match to become 'some-Text-', which then causes the '->' to not be captured by the EDGE rule.
I'm looking for a way to ensure that the '-' is not captured as the last character in the NAMEDELEMENT rule (or some other alternative that produces the desired result).
EDGE
: '->'
;
NAMEDELEMENT
: ('a'..'z'|'A'..'Z'|'_'|'#') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')* { _input.LA(1) != '-' && _input.LA(2) != '>' }?
;
Im trying to use the predicate above to look ahead for a sequence of '-' and '>', but it doesn't seem to work. It doesn't seem to do anything at all, actually, as get the same parsing results both with and without the predicate.
The parser rules are as follows, where I am matching on 'selector' rules:
selector
: namedelement (edge namedelement)*
;
edge
: EDGE
;
namedelement
: NAMEDELEMENT
;
Thanks in advance!
After messing around with this for hours, I have a syntax that works, though I fail to see how it is functionally any different than what I posted in the original question.
(I use the uncommented version so that I can put a break point in the generated lexer to ensure that the equality test is evaluating correctly.)
NAMEDELEMENT
//: [a-zA-Z_#] [a-zA-Z_-]* { String.fromCharCode(this._input.LA(1)) != ">" }?
: [a-zA-Z_#] [a-zA-Z_-]* { (function(a){
var c = String.fromCharCode(a._input.LA(1));
return c != ">";
})(this)
}?
;
My target language is JavaScript and both the commented and uncommented forms of the predicate work fine.
Try this:
NAMEDELEMENT
: [a-zA-Z_#] ( '-' {_input.LA(1) != '>'}? | [a-zA-Z0-9_] )*
;
Not sure if _input.LA(1) != '>' is OK with the JavaScript runtime, but in Java it properly tokenises "some-->more" into "some-", "->" and "more".

Pushing back tokens in Happy and Alex

I'm parsing a language which has both < and <<. In my Alex definition I've got something that contains something like
tokens :-
"<" { token Lt }
"<<" { token (BinOp Shl) }
so whenever I encounter <<, that gets tokenized as a left shift and not as to less-than's. This is generally a good thing, since I end up throwing out whitespace after tokenization and want to differentiate between 1 < < 2 and 1 << 2. However, there are other times I wish << had been read as two <. For example, I have things like
<<A>::B>
which I want read like
< < A > :: B >
Obviously I can try to adjust my Happy parser rules to accommodate for the extra cases, but that scales badly. In other imperative parser generators, I might try to do something like push back "part" of the token (something like push_back("<") when I encountered << but I only needed <).
Has anyone else had such a problem and, if so, how did you deal with it? Are there ways of "pushing back" tokens in Happy? Should I instead try to keep a whitespace token around (I'm actually leaning towards the last alternative - although being a huge headache, it would let me deal with << by just making sure there is no whitespace between the two <).
I don’t know how to express this in Happy, but you don’t need a separate “whitespace” token. You can parse < or > as a distinct “angle bracket” token when immediately followed by an operator symbol in the input, with no intervening whitespace.
Then, when you want to parse an operator, you join a sequence of angles and operators into a single token. When you want to treat them as brackets, you just deal with them separately as usual.
So a << b would be tokenised as:
identifier "a"
left angle -- joined with following operator
operator "<"
identifier "b"
When parsing an operator, you concatenate angle tokens with the following operator token, producing a single operator "<<" token.
<<A>::B> would be tokenised as:
left angle
operator "<" -- accepted as bracket
identifier "A"
right angle
operator "::"
identifier "B"
operator ">" -- accepted as bracket
When parsing angle-bracketed terms, you accept both angle tokens and </> operators.
This relies on your grammar not being ambiguous wrt. whether you should parse an operator name or a bracketed thing.
While I initially went with #Jon's answer, I ended up running into a variety of precedence related issues (think precedence around expr < expr vs expr << expr) which caused me a lot of headaches. I recently (successfully) back to lexing << as one token. The solution was twofold:
I bit the bullet and added extra rules for << (where previously I only had rules for <). For the example in the question (<<A>::B>) my rule went from something like
ty_qual_path
: '<' ty_sum '>' '::' ident
to
ty_qual_path
: '<' ty_sum '>' '::' ident
| '<<' ty_sum '>' '::' ident '>' '::' ident
(The actual rule was actually a bit more involved, but that is not for this answer).
I found a clever way to deal with token that started with > (these would cause problems around things like vector<i32,vector<i32>> where the last >> was a token): use a threaded lexer (section 2.5.2), exploit the {%% ... } RHS of rules which lets you reconsider the lookahead token, and add a pushToken facility to my parser monad (this turned out to be quite simple - here is exactly what I did). I then added a dummy rule - something like
gt :: { () }
: {- empty -} {%% \tok ->
case tok of
Tok ">>" -> pushToken (Tok ">") *> pushToken (Tok ">")
Tok ">=" -> pushToken (Tok "=") *> pushToken (Tok ">")
Tok ">>=" -> pushToken (Tok ">=") *> pushToken (Tok ">")
_ -> pushToken tok
}
And every time in some other rule I expected a > but there could also be any other token starting with >, I would precede the > token with gt. This has the effect of looking ahead to the next token which may could start with > without being >, and try to convert that token into one > token and another token for the "rest" of the initial token.

parsing a file with specific format in ply (python)

i have a problem with ply, i have to receive a file with a tokens list and a grammar (bnf), i wrote a grammar to recognize the input, and it is almost working (just minor issues, we are solving them), for example this is a valid input file
#tokens = NUM PLUS TIMES
exp : exp PLUS exp | exp TIMES exp
exp : NUM
(we dont care, in this case, about ambiguous grammar or whatever, this is an example for input)
parsing every line separately works fine, but i want to parse the whole file with these rules:
#tokens must be only in first line, so if we have a #tokens declaration after grammar it is not valid
you can have 0 or more blank lines after every line of "code"
you can have as many grammar rules as you want
i tried using a loop to scan and parse every line separately, but i can't control the rirst (and really important) rule, so i tried this in my .py file:
i defined t_NLINEA (new line) i had also problem using the \n character as a literal and the file is open using rU mode to avoid conflicts about \r\n or \n characters, so i added these rules:
def p_S(p):
'''S : T N U'''
print("OK")
def p_N(p):
'''N : NLINEA N'''
pass
def p_N2(p):
'''N : '''
pass
def p_U(p):
'''U : R N U'''
pass
def p_U2(p):
'''U : '''
pass
(as i told you above, i had tu use the N rule because ply didnt accept the \n literal in my grammar, i added the \n to "literals" variable)
T is the rule to parse the #tokens declaration and R is used to parse grammar rules, T and R works ok if i use them in a single line string, but when i add the productions i wrote above i get a syntax error when parsing the fisrt gramar rule, for example A : B C i get syntax error with :
any suggestion?
thanks
Ply tries to figure out a "starting rule" based on your rules. With what you have written, it will make "exp" the start rule, which says there is only one expression per string or file. If you want multiple expressions, you probably want a list of expressions:
def p_exp_list(p):
"""exp_list :
| exp_list exp
"""
if len(p) == 1:
p[0] = []
else:
p[0] = p[1] + [p[2]]
Then your starting rule will be "exp_list". This would allow multiple expressions on each line. If you want to limit to one expression per line, then how about:
def p_line_list(p):
"""line_list :
| line_list line
"""
if len(p) == 1:
p[0] == []
else:
p[0] = p[1] + [p[2]]
def p_line(p):
"""line : exp NL"""
p[0] = p[1]
I don't think you can use newline as a literal, (because it might mess up regular expressions). You probably need a more specific token rule:
t_NL = r'[\r*\n]'
Pretty sure this would work, but haven't tried it as there isn't enough to go on.
As for the "#token" line, you could just skip it, if it doesn't appear anywhere else:
def t_COMMENT(t):
r'#.*$'
pass # ignore this token

Resources