Lark: parsing special characters

Lark: parsing special characters - python-3.x

I'm starting with Lark and got stuck on an issue with parsing special characters.
I have expressions given by a grammar. For example, these are valid expressions: Car{_}, Apple3{3+}, Dog{a_7}, r2d2{A3*}, A{+}... More formally, they have form: name{feature} where
name: CNAME
feature: (DIGIT|LETTER|"+"|"-"|"*"|"_")+
The definition of constants can be found here.
The problem is that the special characters are not present in produced tree (see example below). I have seen this answer, but it did not help me. I tried to place ! before special characters, escaping them. I also enabled keep_all_tokens, but this is not desired because then characters { and } are also present in the tree. Any ideas how to solve this problem? Thank you.
from lark import Lark
grammar = r"""
start: object
object : name "{" feature "}" | name
feature: (DIGIT|LETTER|"+"|"-"|"*"|"_")+
name: CNAME
%import common.LETTER
%import common.DIGIT
%import common.CNAME
%import common.WS
%ignore WS
"""
parser = Lark(grammar, parser='lalr',
lexer='standard',
propagate_positions=False,
maybe_placeholders=False
)
def test():
test_str = '''
Apple_3{3+}
'''
j = parser.parse(test_str)
print(j.pretty())
if __name__ == '__main__':
test()
The output looks like this:
start
object
name Apple_3
feature 3
instead of
start
object
name Apple_3
feature
3
+

You said you tried placing ! before special characters. As I understand the question you linked, the ! has to be replaced before the rule:
!feature: (DIGIT|LETTER|"+"|"-"|"*"|"_")+
This produces your expected result for me:
start
object
name Apple_3
feature
3
+

Related

Prevent shlex from splitting with colon (:)

I'm having trouble dealing with colons (:) in shlex. I need the following behaviour:
Sample input
text = 'hello:world ("my name is Max")'
s = shlex.shlex(instream=text, punctuation_chars=True)
s.get_token()
s.get_token()
...
Desired output
hello:world
(
"my name is Max"
)
Current output
hello
:
world
(
"my name is Max"
)
Shlex puts the colon in a separate token and I don't want that. The documentation doesn't say very much about the colon. I've tried to add it to the wordchar attribute but it messes everything up and separates the words between commas. I've also tried setting the punctuation_char attribute to a custom array with only parenthesis: ["(", ")"] but it makes no difference. I need the punctuation_char option set to get the parenthesis as a separate token (or any other option that achieves this output).
Anyone knows how could I get this to work? Any help will be greatly appreciated.
I'm using python 3.6.9, could upgrade to python 3.7.X if necessary.

To make shlex treat : as a word char, you need to add : to wordchars:
>>> text = 'hello:world ("my name is Max")'
>>> s = shlex.shlex(instream=text, punctuation_chars=True)
>>> s.wordchars += ':'
>>> while True:
... tok = s.get_token()
... if not tok: break
... print(tok)
...
hello:world
(
"my name is Max"
)
I tested that with Python 3.6.9 and 3.8.0. I think you need Python 3.6 in order to have the punctuation_chars initialization parameter.

How to get the content after a string using regex in python

I am having a string as follows:
A5697[2:10] = {ravi, rageev, raghav, smith};
I want the content after "A5697[2:10] =". So, my output should be:
{ravi, rageev, raghav, smith};
This is my code:
print(re.search(r'(?<=A\d+\[.*\] =\s).*', line).group())
But, this is giving error:
sre_constants.error: look-behind requires fixed-width pattern
Can anyone help to solve this issue? I would prefer to use regex.

You can try re.sub , like below, Since you have given only one data point. I am assuming all the other data points are following the similar pattern.
import re
text = "A5697[2:10] = {ravi, rageev, raghav, smith}"
re.sub(r'(A\d+\[\d+:\d+\]\s+=\s+)(.+)', r'\2', text)
returns,
'{ravi, rageev, raghav, smith}'
re.sub : substitutes the entire match as given as regex with the 2nd capturing group. The second capturing group captures every thing after '= '.

Simply replace the bits you don't want:
print re.sub(r'A\d[^=]*= *','',line)
See demo here: https://rextester.com/NSG17655

regex question with partial solution already found

string = "My QUIZZING codes is GREATLY bad so quizzing number is the integer 94.4; I don't like any other BuzzcuT except 1.\n"
From this string of gibberish, I want to pull out the words QUIZZING GREATLY and BuzzcuT leaving their capitalization's or lack thereof as is.
caps = re.findall('([A-Z]+(?:(?!\s?[A-Z][a-z])\s?[A-Z])+)', string)
print(string)
This code that I have/the code that you see results in ['QUIZZING', 'GREATLY']....but I am hoping to get ['QUIZZING', 'GREATLY', 'BuzzcuT']
Although it's gibberish, the point is the various alpha/numeric combinations that make it a challenge .

The regex below finds the 3 patterns in your example string.
import re
string = "My QUIZZING codes is GREATLY bad so quizzing number is the integer 94.4; I don't like any other BuzzcuT except 1.\n"
# The regex contains 2 patterns
# \b[A-Z]{3,}\S*\b -- will match QUIZZING and GREATLY
# \b[A-Z]{1}[a-z]\S*[A-Z]\b -- will match BuzzcuT
#
# You could use a single pattern -- [A-Z]{1,}\S*[A-Z]
# to match all 3 words
#
word_pattern = re.compile(r'\b[A-Z]{3,}\S*\b|\b[A-Z]{1}[a-z]\S*[A-Z]\b')
find_words = re.findall(word_pattern, string)
if find_words:
print (find_words)
# output
['QUIZZING', 'GREATLY', 'BuzzcuT']

An Elegant Solution to Python's Multiline String?

I was trying to log a completion of a scheduled event I set to run on Django. I was trying my very best to make my code look presentable, So instead of putting the string into a single line, I have used a multiline string to output to the logger within a Command Management class method. The example as code shown:
# the usual imports...
# ....
import textwrap
logger = logging.getLogger(__name__)
class Command(BaseCommand):
def handle(self, *args, **kwargs):
# some codes here
# ....
final_statement = f'''\
this is the final statements \
with multiline string to have \
a neater code.'''
dedented_text = textwrap.dedent(final_statment)
logger.info(dedent.replace(' ',''))
I have tried a few methods I found, however, most quick and easy methods still left a big chunk of spaces on the terminal. As shown here:
this is the final statement with multiline string to have a neater code.
So I have come up with a creative solution to solve my problem. By using.
dedent.replace(' ','')
Making sure to replace two spaces with no space in order not to get rid of the normal spaces between words. Which finally produced:
this is the final statement with multiline string to have a neater code.
Is this an elegant solution or did I missed something on the internet?

You could use regex to simply remove all white space after a newline. Additionally, wrapping it into a function leads to less repetitive code, so let's do that.
import re
def single_line(string):
return re.sub("\n\s+", "", string)
final_statement = single_line(f'''
this is the final statements
with multiline string to have
a neater code.''')
print(final_statement)
Alternatively, if you wish to avoid this particular problem (and don't mine the developmental overhead), you could store them inside a file, like JSON so you can quickly edit prompts while keeping your code clean.

Thanks to Neil's suggestion, I have come out with a more elegant solution. By creating a function to replace the two spaces with none.
def single_line(string):
return string.replace(' ','')
final_statement = '''\
this is a much neater
final statement
to present my code
'''
print(single_line(final_statement)
As improvised from Neil's solution, I have cut down the regex import. That's one line less of code!
Also, making it a function improves on readability as the whole print statement just read like English. "Print single line final statement"
Any better idea?

The issue with both Neil’s and Wong Siwei’s answers is they don’t work if your multiline string contains lines more indented than others:
my_string = """\
this is my
string and
it has various
identation
levels"""
What you want in the case above is to remove the two-spaces indentation, not every space at the beginning of a line.
The solution below should work in all cases:
import re
def dedent(s):
indent_level = None
for m in re.finditer(r"^ +", s):
line_indent_level = len(m.group())
if indent_level is None or indent_level > line_indent_level:
indent_level = line_indent_level
if not indent_level:
return s
return re.sub(r"(?:^|\n) {%s}" % indent_level, "", s)
It first scans the whole string to find the lowest indentation level then uses that information to dedent all lines of it.
If you only care about making your code easier to read, you may instead use C-like strings "concatenation":
my_string = (
"this is my string"
" and I write it on"
" multiple lines"
)
print(repr(my_string))
# => "this is my string and I write it on multiple lines"
You may also want to make it explicit with +s:
my_string = "this is my string" + \
" and I write it on" + \
" multiple lines"

Gitlab CI: Set dynamic variables

For a gitlab CI I'm defining some variables like this:
variables:
PROD: project_package
STAGE: project_package_stage
PACKAGE_PATH: /opt/project/build/package
BUILD_PATH: /opt/project/build/package/bundle
CONTAINER_IMAGE: registry.example.com/project/package:e2e
I would like to set those variables a bit more dynamically, as there are mainly only two parts: project and package. Everything else depends on those values, that means I have to change only two values to get all other variables.
So I would expect something like
variables:
PROJECT: project
PACKAGE: package
PROD: $PROJECT_$PACKAGE
STAGE: $PROD_stage
PACKAGE_PATH: /opt/$PROJECT/build/$PACKAGE
BUILD_PATH: /opt/$PROJECT/build/$PACKAGE/bundle
CONTAINER_IMAGE: registry.example.com/$PROJECT/$PACKAGE:e2e
But it looks like, that the way doing this is wrong...

I don't know where your expectation comes from, but it is trivial to check there is no special meaning for $, _, '/' nor : if not followed by a space in YAML. There might be in gitlab, but I doubt strongly that there is in the way you expect.
To formalize your expectation, you assume that any key (from the same mapping) preceded by a $ and terminated by the end of the scalar, by _ or by / is going to be "expanded" to that key's value. The _ has to be such terminator otherwise $PROJECT_$PACKAGE would not expand correctly.
Now consider adding a key-value pair:
BREAKING_TEST: $PACKAGE_PATH
is this supposed to expand to:
BREAKING_TEST: /opt/project/build/package/bundle
or follow the rule you implied that _ is a terminator and just expand to:
BREAKING_TEST: project_PATH
To prevent this kind of ambiguity programs like bash use quoting around variable names to be expanded ( "$PROJECT"_PATH vs. $PROJECT_PATH), but the more sane, and modern, solution is to use clamping begin and end characters (e.g. { and }, $% and %, ) with some special rule to use the clamping character as normal text.
So this is not going to work as you indicated as indeed you do something wrong.
It is not to hard to pre-process a YAML file, and it can be done with e.g. Python (but watch out that { has special meaning in YAML), possible with the help of jinja2: load the variables, and then expand the original text using the variables until replacements can no longer be made.
But it all starts with choosing the delimiters intelligently. Also keep in mind that although your "variables" seem to be ordered in the YAML text, there is no such guarantee when the are constructed as dict/hash/mapping in your program.
You could e.g. use << and >>:
variables:
PROJECT: project
PACKAGE: package
PROD: <<PROJECT>>_<<PACKAGE>>
STAGE: <<PROD>>_stage
PACKAGE_PATH: /opt/<<PROJECT>>/build/<<PACKAGE>>
BUILD_PATH: /opt/<<PROJECT>>/build/<<PACKAGE>>/bundle
CONTAINER_IMAGE: registry.example.com/<<PROJECT>>/<<PACKAGE>>:e2
which, with the following program (that doesn't deal with escaping << to keep its normal meaning) generates your original, expanded, YAML exactly.
import sys
from ruamel import yaml
def expand(s, d):
max_recursion = 100
while '<<' in s:
res = ''
max_recursion -= 1
if max_recursion < 0:
raise NotImplementedError('max recursion exceeded')
for idx, chunk in enumerate(s.split('<<')):
if idx == 0:
res += chunk # first chunk is before <<, just append
continue
try:
var, rest = chunk.split('>>', 1)
except ValueError:
raise NotImplementedError('delimiters have to balance "{}"'.format(chunk))
if var not in d:
res += '<<' + chunk
else:
res += d[var] + rest
s = res
return s
with open('template.yaml') as fp:
yaml_str = fp.read()
variables = yaml.safe_load(yaml_str)['variables']
data = yaml.round_trip_load(expand(yaml_str, variables))
yaml.round_trip_dump(data, sys.stdout, indent=2)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Lark: parsing special characters - python-3.x

You said you tried placing ! before special characters. As I understand the question you linked, the ! has to be replaced before the rule: !feature: (DIGIT|LETTER|"+"|"-"|"*"|"_")+ This produces your expected result for me: start object name Apple_3 feature 3 +

Related

Prevent shlex from splitting with colon (:)

How to get the content after a string using regex in python

regex question with partial solution already found

An Elegant Solution to Python's Multiline String?

Gitlab CI: Set dynamic variables

Categories

Resources