ANTLR - find first match for grammar within string

ANTLR - find first match for grammar within string - search

Is there a way to use an ANTLR parser as a searcher, i.e. to find the first instance of a substring ss of a longer string S that matches a given rule my_rule?
Conceptually, I could accomplish this by looking for a match at position S[i], incrementing i until I successfully retrieve a match or S is exhausted.
However, in practice this doesn't work very well, because prefixes in S might coincidentally have characters that match tokens in my grammar. Depending on how this happens, a valid string ss in S might get recognized several times, or skipped over erratically, or there might be lots of errors printed about "token recognition error".
Is there an approach I haven't thought of, or an ANTLR feature I don't know about?
I'm using the Python bindings for ANTLR, if that matters.
EXAMPLE:
Given the following grammar:
grammar test ;
options { language=Python3; }
month returns [val]
: JAN {$val = 1}
| FEB {$val = 2}
| MAR {$val = 3}
| APR {$val = 4}
| MAY {$val = 5}
;
day_number returns [val]
: a=INT {$val = int($a.text)} ;
day returns [val]
: day_number WS? {$val = int($day_number.start.text)}
;
month_and_day returns [val]
: month WS day {$val = ($month.val, $day.val)}
| day WS ('of' WS)? month {$val = ($month.val, $day.val)}
;
WS : [ \n\t]+ ; // whitespace is not ignored
JAN : 'jan' ('.' | 'uary')? ;
FEB : 'feb' ('.' | 'ruary')? ;
MAR : 'mar' ('.' | 'ch')? ;
APR : 'apr' ('.' | 'il')? ;
MAY : 'may' ;
INT
: [1-9]
| '0' [1-9]
| '1' [0-9]
| '2' [0-3]
;
and the following script to test it:
import sys
sys.path.append('gen')
from testParser import testParser
from testLexer import testLexer
from antlr4 import InputStream
from antlr4 import CommonTokenStream, TokenStream
def parse(text: str):
date_input = InputStream(text.lower())
lexer = testLexer(date_input)
stream = CommonTokenStream(lexer)
parser = testParser(stream)
return parser.month_and_day()
for t in ['Jan 6',
'hello Jan 6, 1984',
'hello maybe Jan 6, 1984']:
value = parse(t)
print(value.val)
I get the following results:
# First input - good
(1, 6)
# Second input - errors printed to STDERR
line 1:0 token recognition error at: 'h'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'l'
line 1:3 token recognition error at: 'l'
line 1:4 token recognition error at: 'o '
line 1:11 token recognition error at: ','
(1, 6)
# Third input - prints errors and throws exception
line 1:0 token recognition error at: 'h'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'l'
line 1:3 token recognition error at: 'l'
line 1:4 token recognition error at: 'o '
line 1:9 token recognition error at: 'b'
line 1:10 token recognition error at: 'e'
line 1:12 mismatched input 'jan' expecting INT
Traceback (most recent call last):
File "test_grammar.py", line 25, in <module>
value = parse(t)
File "test_grammar.py", line 19, in parse
return parser.month_and_day()
File "gen/testParser.py", line 305, in month_and_day
localctx._day = self.day()
File "gen/testParser.py", line 243, in day
localctx.val = int((None if localctx._day_number is None else localctx._day_number.start).text)
ValueError: invalid literal for int() with base 10: 'jan'
Process finished with exit code 1
To use the incremental approach I outlined above, I'd need a way to suppress the token recognition error output and also wrap the exception in a try or similar. Feels like I'd be very much going against the grain, and it would be difficult to distinguish these parsing exceptions from other things going wrong.
(META - I could swear I already asked this question somewhere about 4 months ago, but I couldn't find anything on SO, or the ANTLR GitHub tracker, or the ANTLR Google Group.)

Is there a way to use an ANTLR parser as a searcher, i.e. to find the
first instance of a substring ss of a longer string S that matches
a given rule my_rule?
The short answer is no. ANTLR does not work as a substitute/equivalent to any of the standard regex-based tools, like sed and awk.
The longer answer is yes, but with messy caveats. ANTLR expects to parse a structured, largely unambiguous input text. Text that is of no semantic significance can be ignored by adding the lexer rule (at lowest priority/bottom position)
IGNORE : . -> skip;
That way, anything not explicitly recognized in the lexer is ignored.
The next problem is the potential semantic overlap between 'normal' text and keywords, e.g. Jan (name) - Jan (month abrev). In general, this can be handled by adding a BaseErrorListener to the parser to distinguish between real and meaningless errors. What constitutes real vs meaningless can involve complex corner cases depending on the application.
Finally, the rule
day_number returns [val]
: a=INT {$val = int($a.text)} ;
is returning an int value not an INT token, hence the error that is being reported. The rule should be
day_number : INT ;

The solution I've settled on, based on a variant of an idea from #grosenberg's answer, is as follows.
1) Add a fallback lexer rule to match any text that isn't already matched by existing rules. Do not ignore/skip these tokens.
OTHER : . ;
2) Add a parser alternative to match either the rule of interest, or (with lower precedence) anything else:
month_and_day_or_null returns [val]
: month_and_day {$val = $month_and_day.val}
| . {$val = None}
;
3) In the application code, look for either a None or a populated value:
def parse(text: str):
date_input = InputStream(text.lower())
lexer = testLexer(date_input)
stream = CommonTokenStream(lexer)
parser = testParser(stream)
return parser.month_and_day_or_null()
for t in ['Jan 6',
'hello Jan 6, 1984',
'hello maybe Jan 6, 1984']:
for i in range(len(t)):
value = parse(t[i:])
if value.val:
print(f"Position {i}: {value.val}")
break
This has the desired effect at match time:
Position 0: (1, 6)
Position 6: (1, 6)
Position 12: (1, 6)

Related

How to define in ANTLR4 a string with escaping boundary (like multipart mimetype)?

In my grammar, I want to allow 2 syntax for a string:
The classical way "my \"string\"", no problem here.
A new approach with an arbitrary escaping boundary : |"my "string"|", |x"my |"string"|x". The objective is to keep the string content without any escaping and never have something like a && b when a js fragment is in a x(ht)ml file for example.
In the spirit, I'm looking to express something like :
'|' {$Boundary} '"' {AnyCharSequenceExcept('|' $Boundary '"')} '|' {$Boundary} '"'
I understand I can't do it in standard ANTLR4. Is it possible to do it with actions ?

Here's a way to do that:
lexer grammar DemoLexer;
#members {
def ahead(self, steps):
"""
Returns the next `steps` characters ahead in the character-stream or None if
there aren't `steps` characters ahead aymore
"""
text = ""
for n in range(1, steps + 1):
next = self._input.LA(n)
if next == Token.EOF:
return None
text += chr(next)
return text
def consume_until(self, open_tag):
"""
If we get here, it means the lexer matched an opening tag, and we now consume as
much characters until we match the corresponsing closing tag
"""
while True:
ahead = self.ahead(len(open_tag))
if ahead == None:
raise Exception("missing '{}' close tag".format(open_tag))
if ahead == open_tag:
break
self._input.consume()
# Be sure to consume the end_tag, which has the same character count as `open_tag`
for n in range(0, len(open_tag)):
self._input.consume()
}
STRING
: '|' ~'"'* '"' {self.consume_until(self.text)}
;
SPACES
: [ \t\r\n] -> skip
;
OTHER
: .
;
If you generate the lexer from the grammar above and run the following (Python) script:
from antlr4 import *
from DemoLexer import DemoLexer
source = """
foo |x"my |"string"|x" bar
"""
lexer = DemoLexer(InputStream(source))
stream = CommonTokenStream(lexer)
stream.fill()
for token in stream.tokens[:-1]:
print("{0:<25} '{1}'".format(DemoLexer.symbolicNames[token.type], token.text))
the following will be printed to your console:
OTHER 'f'
OTHER 'o'
OTHER 'o'
STRING '|x"my |"string"|x"'
OTHER 'b'
OTHER 'a'
OTHER 'r'

How to print a horizontal row of numbers with user input in python

I need to write a program that asks the user to enter a number
n, where -6 < n < 93.
output: Enter the start number: 12
12 13 14 15 16 17 18
The numbers need printed using a field width of 2 and are right-justified. Fields need separated by a single space. There should be no spaces after the final field.
This is my code so far:
a = eval(input('Enter the start number : ',end='\n'))
for n in range(a,a+7):
print("{0:>2}").format(n)
print()
But it says:
File "C:/Users/Nathan/Documents/row.py", line 5, in <module>
a = eval(input('Enter the start number : ',end='\n'))
builtins.TypeError: input() takes no keyword arguments
Please help

First of all, the input function returns a string. You should cast it as integer.
Also you have some syntax error, to name a few:
You put .format after print, but it should be inside the print and after the string.
The input function doesn't take an end argument. And python gives you this error for that: TypeError: input() takes no keyword arguments
The formatting pattern is not right.
This code does what you want:
a = int(input('Enter the start number between -6 and 93: '))
assert (n >= -6) and (n <= 93), f"number must be in [-6, 93]," \
f"but got {n} instead"
for n in range(a, a+7):
print(f"{n:02d}", end=' ')
OUTPUT:
Enter the start number : 12
12 13 14 15 16 17 18

You can't pass \n to input beacouse is a special character.
If you want a white line add another print() after the input.

input() doesn't take the end argument, only print() does.

Solution
# input(): takes the input from the user, that will be in the form of string
# rstrip(): will remove the white spaces present in the input
# split(): will convert the string into a list
# map(function, iterable) : typecast the list
ar = list(map(int, input().rstrip().split()))
print(ar)
output:
1 2 3 4
[1, 2, 3, 4]

apparent pyparsing bug with 'ZeroOrMore'

I'm using pyparsing with python 3.6.5 on a mac. The following code crashes on the second parse:
from pyparsing import *
a = Word(alphas) + Literal(';')
b = Word(alphas) + Optional(Literal(';'))
bad_parser = ZeroOrMore(a) + b
b.parseString('hello;')
print("no problems yet...")
bad_parser.parseString('hello;')
print("this will not print because we're dead")
Is this logical behavior? Or is it a bug?
EDIT: Here is the full console output:
no problems yet...
Traceback (most recent call last):
File "test.py", line 9, in <module>
bad_parser.parseString('hello;')
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyparsing.py", line 1632, in parseString
raise exc
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyparsing.py", line 1622, in parseString
loc, tokens = self._parse( instring, 0 )
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyparsing.py", line 1379, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyparsing.py", line 3395, in parseImpl
loc, exprtokens = e._parse( instring, loc, doActions )
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyparsing.py", line 1379, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyparsing.py", line 2689, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected W:(ABCD...) (at char 6), (line:1, col:7)

This is expected behavior. Pyparsing does not do any lookahead, but is purely left-to-right. You can add lookahead to your parser, but it is something you have to do for yourself.
You can get some more insight into what is happening if you turn on debugging for a and b:
a.setName('a').setDebug()
b.setName('b').setDebug()
which will show you every place pyparsing is about to match the expression, and then if the match failed or succeeded, and if it succeeded, the matching tokens:
Match a at loc 0(1,1)
Matched a -> ['hello', ';']
Match a at loc 6(1,7)
Exception raised:Expected W:(ABCD...) (at char 6), (line:1, col:7)
Match b at loc 6(1,7)
Exception raised:Expected W:(ABCD...) (at char 6), (line:1, col:7)
Since a matches the complete input string, that matches the criterion of "zero or more". Then pyparsing proceeds to match b, but since the word and semicolon have already been read, there is no more to parse. Since b is not optional, pyparsing raises an exception that it could not be found. Even if you were to parse "hello; hello; hello;", all the strings and semis would be consumed by the
ZeroOrMore, with no trailing b left to parse.
Try this:
not_so_bad_parser = ZeroOrMore(a + ~StringEnd()) + b
By stating that you only want to read a expressions that are not at the end of the string, then parsing "hello;" will not match a, and so proceed to b, which then matches.
This is so prevalent an issue that I added the stopOn keyword to the ZeroOrMore and OneOrMore class constructors, to avoid the need to add the overt ~ (meaning NotAny). At first I thought this might work:
even_less_bad_parser = ZeroOrMore(a, stopOn=b) + b
But then, since b also matches as an a, this will effectively never match any as, and may leave unmatched text. We need to stop on b only if at the end of the string:
even_less_bad_parser = ZeroOrMore(a, stopOn=b + StringEnd()) + b
I'm not sure if that will truly satisfy your concept of "less bad"-ness, but that is why pyparsing is behaving as it is for you.

Error while taking input invalid literal for int() with base 10: '1 2 3'

def kad(l):
max_c=max_g=l[0]
for i in range(0,len(l)):
max_c=max(l[i],l[i]+max_c)
if(max_c>max_g):
max_g=max_c
print(max_c)
return max_g
t=int(input("test case")) ## TEST CASES
for k in range (0,t):
n=int(input(" num")) # TOTAL NUMBERS IN EACH TEST CASE
l=[float(int(input())) for i in range(0,n)]
if(len(l)>0):
kad(l)
print(l)
Error Message
File "/home/dc97fc38c3d1e4a695c9d3550e8af5c1.py", line 16, in <module>
l=[float(int(input())) for i in range(0,n)]
File "/home/dc97fc38c3d1e4a695c9d3550e8af5c1.py", line 16, in <listcomp>
l=[float(int(input())) for i in range(0,n)]
ValueError: invalid literal for int() with base 10: '1 2 3'
Even the code is working fine in the local editor(Jupyter notebook) but displays the error in the online editor.

No idea what your function tries to do - for empty lists it gives an error(which you guard against) , for only positive inputs it is a convoluted way of summing up all numbers, adding the first value twice.
It looks like a mangled solutions to some hackerrank'isc site.
You should fix your input like this:
for _ in range(int(input("test case"))): ## TEST CASES
_ = input() # TOTAL NUMBERS IN EACH TEST CASE
# the number of numbers does not matter - you get it as len(l) if needed
l = list(map(float,input().strip().split())) # split the input and parse it to floats
# ^^^ change to int if you only handle int's - your error suggest so
if(len(l)>0):
kad(l)

lexer, parser , trying to make an expression calculator

As the title says am trying to make an expression calculator using Ply , I still didnt finish the complete code just a part of it , but until now . I get this error :
Traceback (most recent call last): File "EX3.py", line 69, in
module
parser = yacc.yacc() File "/Users/mostafa.osama2/anaconda3/lib/python3.6/site-packages/ply/yacc.py",
line 3317, in yacc
raise YaccError('Unable to build parser') ply.yacc.YaccError: Unable to build parser
here is my code :
import ply.lex as lex
import ply.yacc as yacc
import sys
tokens = ['INT','FLOAT' , 'NAME' , 'PLUS' , 'MINUS' , 'DIVIDE' , 'MULTIPLY' ,
'EQUALS']
#list of tokens , for grammar checking
t_PLUS = r'\+'
t_MINUS = r'\-'
t_MULTIPLY = r'\*'
t_DIVIDE = r'\/'
t_EQUALS = r'\='
t_ignore = r' ' # used for ignoring spaces between numbers and operators
#has to match name of token,
def t_FLOAT(t):
r'\d+.\d+' # 1.2 is a float , 1.any number is a float
t.value = float(t.value)
return t
def t_INT(t):
r'\d+'
t.value = int(t.value)
return t # t is our token object
def t_NAME(t):
r'[a-zA-Z_][a-zA-Z_0-9]*' #star means 0 or more, first char is a-zA-z , second character is a-zA-z0-9
t.type = 'NAME'
return t
def t_error(t):
print("Illegal characters!")
t.lexer.skip(1) # skips 1 token onwards
lexer = lex.lex()
def p_calc(p): # p is a tuple
'''
calc : expression
| empty
'''
print(p[1])
def p_expression(p):
'''
expression : expression PLUS expression
| expression DIVIDE expression
| expression PLUS expression
| expression MINUS expression
'''
p[0] = (p[2] , p[1] , p[3])
def p_expression(p):
'''
expression : INT
| FLOAT
'''
p[0] = p[1]
def p_empty(p):
'''
empty:
'''
p[0] = None
parser = yacc.yacc()
while True:
try:
s = input('')
except EOFError: # when u press contorl D on keyboard
break
parser.parse(s)

Ply insists that parsing rules have whitespace between the name of the non-terminal and the colon. So empty: is not valid; you must write empty :
Also, as reported by Ply, you define two functions named p_expression. Ply requires that all parsing functions have different names (otherwise it has no way to call them), but it doesn't care what the names are, as long as they start with p_. So change one of the names.
Finally, you have two rules for addition, and no rule for multiplication. Ply will complain about the duplicate addition rule (after you fix the other problems). It will also complain that you are missing a p_error function.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLR - find first match for grammar within string - search

Related

How to define in ANTLR4 a string with escaping boundary (like multipart mimetype)?

How to print a horizontal row of numbers with user input in python

apparent pyparsing bug with 'ZeroOrMore'

Error while taking input invalid literal for int() with base 10: '1 2 3'

lexer, parser , trying to make an expression calculator

Categories

Resources