parsing a file with specific format in ply (python) - string

i have a problem with ply, i have to receive a file with a tokens list and a grammar (bnf), i wrote a grammar to recognize the input, and it is almost working (just minor issues, we are solving them), for example this is a valid input file
#tokens = NUM PLUS TIMES
exp : exp PLUS exp | exp TIMES exp
exp : NUM
(we dont care, in this case, about ambiguous grammar or whatever, this is an example for input)
parsing every line separately works fine, but i want to parse the whole file with these rules:
#tokens must be only in first line, so if we have a #tokens declaration after grammar it is not valid
you can have 0 or more blank lines after every line of "code"
you can have as many grammar rules as you want
i tried using a loop to scan and parse every line separately, but i can't control the rirst (and really important) rule, so i tried this in my .py file:
i defined t_NLINEA (new line) i had also problem using the \n character as a literal and the file is open using rU mode to avoid conflicts about \r\n or \n characters, so i added these rules:
def p_S(p):
'''S : T N U'''
print("OK")
def p_N(p):
'''N : NLINEA N'''
pass
def p_N2(p):
'''N : '''
pass
def p_U(p):
'''U : R N U'''
pass
def p_U2(p):
'''U : '''
pass
(as i told you above, i had tu use the N rule because ply didnt accept the \n literal in my grammar, i added the \n to "literals" variable)
T is the rule to parse the #tokens declaration and R is used to parse grammar rules, T and R works ok if i use them in a single line string, but when i add the productions i wrote above i get a syntax error when parsing the fisrt gramar rule, for example A : B C i get syntax error with :
any suggestion?
thanks

Ply tries to figure out a "starting rule" based on your rules. With what you have written, it will make "exp" the start rule, which says there is only one expression per string or file. If you want multiple expressions, you probably want a list of expressions:
def p_exp_list(p):
"""exp_list :
| exp_list exp
"""
if len(p) == 1:
p[0] = []
else:
p[0] = p[1] + [p[2]]
Then your starting rule will be "exp_list". This would allow multiple expressions on each line. If you want to limit to one expression per line, then how about:
def p_line_list(p):
"""line_list :
| line_list line
"""
if len(p) == 1:
p[0] == []
else:
p[0] = p[1] + [p[2]]
def p_line(p):
"""line : exp NL"""
p[0] = p[1]
I don't think you can use newline as a literal, (because it might mess up regular expressions). You probably need a more specific token rule:
t_NL = r'[\r*\n]'
Pretty sure this would work, but haven't tried it as there isn't enough to go on.
As for the "#token" line, you could just skip it, if it doesn't appear anywhere else:
def t_COMMENT(t):
r'#.*$'
pass # ignore this token

Related

How do i find/count number of variable in string using Python

Here is example of string
Hi {{1}},
The status of your leave application has changed,
Leaves: {{2}}
Status: {{3}}
See you soon back at office by Management.
Expected Result:
Variables Count = 3
i tried python count() using if/else, but i'm looking for sustainable solution.
You can use regular expressions:
import re
PATTERN = re.compile(r'\{\{\d+\}\}', re.DOTALL)
def count_vars(text: str) -> int:
return sum(1 for _ in PATTERN.finditer(text))
PATTERN defines the regular expression. The regular expression matches all strings that contain at least one digit (\d+) within a pair of curly brackets (\{\{\}\}). Curly brackets are special characters in regular expressions, so we must add \. re.DOTALL makes sure that we don't skip over new lines (\n). The finditer method iterates over all matches in the text and we simply count them.

Python ord() and chr()

I have:
txt = input('What is your sentence? ')
list = [0]*128
for x in txt:
list[ord(x)] += 1
for x in list:
if x >= 1:
print(chr(list.index(x)) * x)
As per my understanding this should just output every letter in a sentence like:
))
111
3333
etc.
For the string "aB)a2a2a2)" the output is correct:
))
222
B
aaaa
For the string "aB)a2a2a2" the output is wrong:
)
222
)
aaaa
I feel like all my bases are covered but I'm not sure what's wrong with this code.
When you do list.index(x), you're searching the list for the first index that value appears. That's not actually what you want though, you want the specific index of the value you just read, even if the same value occurs somewhere else earlier in the list too.
The best way to get indexes along side values from a sequence is with enuemerate:
for i, x in enumerate(list):
if x >= 1:
print(chr(i) * x)
That should get you the output you want, but there are several other things that would make your code easier to read and understand. First of all, using list as a variable name is a very bad idea, as that will shadow the builtin list type's name in your namespace. That makes it very confusing for anyone reading your code, and you even confuse yourself if you want to use the normal list for some purpose and don't remember you've already used it for a variable of your own.
The other issue is also about variable names, but it's a bit more subtle. Your two loops both use a loop variable named x, but the meaning of the value is different each time. The first loop is over the characters in the input string, while the latter loop is over the counts of each character. Using meaningful variables would make things a lot clearer.
Here's a combination of all my suggested fixes together:
text = input('What is your sentence? ')
counts = [0]*128
for character in text:
counts[ord(character)] += 1
for index, count in enumerate(counts):
if count >= 1:
print(chr(index) * count)

ambiguity in parsing comma as a operator using PLY python

I have following tokens and many more, but I want to keep my question short that's why not including the whole code.
tokens = (
'COMMA',
'OP',
'FUNC1',
'FUNC2'
)
def t_OP(t):
r'&|-|\||,'
return t
def t_FUNC1(t):
r'FUNC1'
return t
def t_FUNC2(t):
r'FUNC2'
return t
Other methods:
def FUNC1(param):
return {'a','b','c','d'}
def FUNC2(param,expression_result):
return {'a','b','c','d'}
My grammar rules in YACC are and few more are there but listed important ones:
'expression : expression OP expression'
'expression : LPAREN expression RPAREN'
'expression : FUNC1 LPAREN PARAM RPAREN'
'expression : FUNC2 LPAREN PARAM COMMA expression RPAREN'
'expression : SET_ITEM'
In my yacc.py, below are the methods which are related to the issue:
def p_expr_op_expr(p):
'expression : expression OP expression'
if p[2] == '|' or p[2]== ',':
p[0] = p[1] | p[3]
elif p[2] == '&':
p[0] = p[1] & p[3]
elif p[2] == '-':
p[0] = p[1] - p[3]
def p_expr_func1(p):
'expression : FUNC1 LPAREN PARAM RPAREN'
Param = p[3]
Result = ANY(Param)
p[0] = Result
def p_expr_func2(p):
'expression : FUNC2 LPAREN PARAM COMMA expression RPAREN'
Param = p[3]
expression_result = p[5]
Result = EXPAND(Param,expression_result)
p[0] = Result
def p_expr_set_item(p):
'expression : SET_ITEM'
p[0] = {p[1]}
So, the issue is:
If I give below input expression to this grammar:
FUNC1("foo"),bar
-- it works properly, and give me the result as the UNION of the SET returned by FUNC1("foo") and bar => {a,b,c,d} | {bar}
But, if i give below input expression, it gives syntax error at , and ):
I have my parenthesis defined as tokens (for those who think may be brackets are not defined in tokens)
FUNC2("foo", FUNC1("foo"),bar)
According to me for this expression, it matches production rule 'expression : FUNC2 LPAREN PARAM COMMA expression RPAREN'
so everything after the first comma should be well treated as a expression and it should match 'expression : expression OP expression' and do the union when comma is encountered as a operator.
If that's the case, then it should not work for FUNC1("foo"),bar as well.
I know I can fix this issue by removing ',' from t_OP(t) and adding one more production rule as 'expression : expression COMMA expression' and the method for this rule will look like below:
def p_expr_comma_expr(p):
'expression : expression COMMA expression'
p[0] = p[1] | p[3]
I'm reluctant to include this rule because it will introduces '4 shift/reduce conflicts'.
I really want to understand why it executes in one case and why not the other and what's the way to consider ',' as a operator?
Thanks
Ply has no way to know whether you want a given , to be the lexeme COMMA, or the lexeme OP. Or, rather, it has a way, but it will always choose the same one: OP. That's because patterns in token functions are tested before tokens in pattern variables.
I'm assuming you have t_COMMA = r',' somewhere in the part of your program you did not provide. It is also possible that you have a token function to recognise COMMA, in which case whichever function comes first will win. But however you do it, the order the regexes are tested is fixed, so either , is always COMMA or it is always OP. This is well explained in the Ply manual section on Specification of Tokens.
Personally, I'd suggest removing the comma from OP and modifying the grammar to use COMMA in the definition of expression. As you observed, you will get shift-reduce conflicts so you must include it in your precedence declaration (which you have also chosen to omit from your question). In fact, it seems likely that you would want to have different precedences for different operators, so you will probably want to separate the different operators into different tokens, since that is precedence is by the token. See the explanation in the Ply manual section on precedence declarations.
Adding one more rule like solved my problem :
expression:expression COMMA expression
added because as #rici told, in expression like FUNC2("hello",FUNC1("ghost")) the first comma is always taken as operator.
and adding precedence thing removed 4shift/reduce conflicts.
precedence = (
('left','COMMA'),
('left','OP')
)

Python 3: How do I change user input into integers inside of a loop?

I want my program to ask the user to input a 3D point, and it is supposed to keep prompting the user until the user inputs the point (0,0,0). The problem I am having with this loop is being caused by the statement "point = [int(y) for y in input().split()]". Whenever the loop reaches this statement, it quits. I have tried placing this statement in different places, but it does the same thing no matter where I put it. If I take the statement out, the loop works. I need to change the coordinates inputted by the user to integers, so I cannot leave the statement out. Is there something else I can do to change the coordinates to integers that won't affect the loop?
point = ""
pointList = [[]] #pointList will be a list that contains lists
while True:
if point == "0,0,0":
break
else:
point = input("Enter a point in 3D space:")
point = [int(y) for y in input().split()]
pointList.append(point)
print(pointList)
From the docs:
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
In short, it splits on whitespace, which doesn't include commas. What you're looking for is str.split(',').
I suggest to make it more robust with respect to the user input. While regular expressions should not be overused, I believe it is a good fit for this situation -- you can define the regular expression for all possible allowed separators, and then you can use the split method of the regular expression. It is also more usual to represent the point as a tuple. The loop can directly contain the condition. Also, the condition can be a bit different than giving it a point with zeros. (Not shown in the example.) Try the following code:
#!python3
import re
# The separator.
rexsep = re.compile(r'\s*,?\s*') # can be extended if needed
points = [] # the list of points
point = None # init
while point != (0, 0, 0):
s = input('Enter a point in 3D space: ')
try:
# The regular expression is used for splitting thus allowing
# more complex separators like spaces, commas, commas and spaces,
# whatever - you never know your user ;)
x, y, z, *rest = [int(e) for e in rexsep.split(s)]
point = (x, y, z)
points.append(point)
except:
print('Some error.')
print(points)

What's wrong with Groovy multi-line String?

Groovy scripts raises an error:
def a = "test"
+ "test"
+ "test"
Error:
No signature of method: java.lang.String.positive() is
applicable for argument types: () values: []
While this script works fine:
def a = new String(
"test"
+ "test"
+ "test"
)
Why?
As groovy doesn't have EOL marker (such as ;) it gets confused if you put the operator on the following line
This would work instead:
def a = "test" +
"test" +
"test"
as the Groovy parser knows to expect something on the following line
Groovy sees your original def as three separate statements. The first assigns test to a, the second two try to make "test" positive (and this is where it fails)
With the new String constructor method, the Groovy parser is still in the constructor (as the brace hasn't yet closed), so it can logically join the three lines together into a single statement
For true multi-line Strings, you can also use the triple quote:
def a = """test
test
test"""
Will create a String with test on three lines
Also, you can make it neater by:
def a = """test
|test
|test""".stripMargin()
the stripMargin method will trim the left (up to and including the | char) from each line
Similar to stripMargin(), you could also use stripIndent() like
def a = """\
test
test
test""".stripIndent()
Because of
The line with the least number of leading spaces determines the number to remove.
you need to also indent the first "test" and not put it directly after the inital """ (the \ ensures the multi-line string does not start with a newline).
You can tell Groovy that the statement should evaluate past the line ending by adding a pair of parentheses ( ... )
def a = ("test"
+ "test"
+ "test")
A second option is to use a backslash, \, at the end of each line:
def a = "test" \
+ "test" \
+ "test"
FWIW, this is identical to how Python multi-line statements work.

Resources