Related
I am trying to parse a long string of 'objects' enclosed by quotes delimitated by commas. EX:
s='"12345","X","description of x","X,Y",,,"345355"'
output=['"12345"','"X"','"description of x"','"X,Y"','','','"345355"']
I am using split to delimitate by commas:
s=["12345","X","description of x","X,Y",,,"345355"]
s.split(',')
This almost works but the output for the string segment ...,"X,Y",... ends up parsing the data enclosed by quotes to "X and Y". I need the split to ignore commas inside of quotes
Split_Output
Is there a way I can delaminate by commas except for in quotes?
I tried using a regex but it ignores the ...,,,... in data because there are no quotes for blank data in the file I'm parsing. I am not an expert with regex and this sample I used from Python split string on quotes. I do understand what this example is doing and not sure how I could modify it to allow parse data that is not enclosed by quotes.
Thanks!
Regex_Output
split by " (quote) instead of by , (comma) then it will split the string into a list with extra commas, then you can just remove all elements that are commas
s='"12345","X","description of x","X,Y",,,"345355"'
temp = s.split('"')
print(temp)
#> ['', '12345', ',', 'X', ',', 'description of x', ',', 'X,Y', ',,,', '345355', '']
values_to_remove = ['', ',', ',,,']
result = list(filter(lambda val: not val in values_to_remove, temp))
print(result)
#> ['12345', 'X', 'description of x', 'X,Y', '345355']
this should work:
In [1]: import re
In [2]: s = '"12345","X","description of x","X,Y",,,"345355"'
In [3]: pattern = r"(?<=[\",]),(?=[\",])"
In [4]: re.split(pattern, s)
Out[4]: ['"12345"', '"X"', '"description of x"', '"X,Y"', '', '', '"345355"']
Explanation:
(?<=...) is a "positive lookbehind assertion". It causes your pattern (in this case, just a comma, ",") to match commas in the string only if they are preceded by the pattern given by .... Here, ... is [\",], which means "either a quotation mark or a comma".
(?=...) is a "positive lookahead assertion". It causes your pattern to match commas in the string only if they are followed by the pattern specified as ... (again, [\",]: either a quotation mark or a comma).
Since both of these assertions must be satisfied for the pattern to match, it will still work correctly if any of your 'objects' begin or end with commas as well.
You can replace all quotes with empty string.
s='"12345","X","description of x","X,Y",,,"345355"'
n = ''
i = 0
while i < len(s):
if i >= len(s):
break
if i<len(s) and s[i] == '"':
i+=1
while i<len(s) and s[i] != '"':
n+=s[i]
i+=1
i+=1
if i < len(s) and s[i] == ",":
n+=", "
i+=1
n.split(", ")
output: ['12345', 'X', 'description of x', 'X,Y', '', '', '345355']
Is there a way to use an ANTLR parser as a searcher, i.e. to find the first instance of a substring ss of a longer string S that matches a given rule my_rule?
Conceptually, I could accomplish this by looking for a match at position S[i], incrementing i until I successfully retrieve a match or S is exhausted.
However, in practice this doesn't work very well, because prefixes in S might coincidentally have characters that match tokens in my grammar. Depending on how this happens, a valid string ss in S might get recognized several times, or skipped over erratically, or there might be lots of errors printed about "token recognition error".
Is there an approach I haven't thought of, or an ANTLR feature I don't know about?
I'm using the Python bindings for ANTLR, if that matters.
EXAMPLE:
Given the following grammar:
grammar test ;
options { language=Python3; }
month returns [val]
: JAN {$val = 1}
| FEB {$val = 2}
| MAR {$val = 3}
| APR {$val = 4}
| MAY {$val = 5}
;
day_number returns [val]
: a=INT {$val = int($a.text)} ;
day returns [val]
: day_number WS? {$val = int($day_number.start.text)}
;
month_and_day returns [val]
: month WS day {$val = ($month.val, $day.val)}
| day WS ('of' WS)? month {$val = ($month.val, $day.val)}
;
WS : [ \n\t]+ ; // whitespace is not ignored
JAN : 'jan' ('.' | 'uary')? ;
FEB : 'feb' ('.' | 'ruary')? ;
MAR : 'mar' ('.' | 'ch')? ;
APR : 'apr' ('.' | 'il')? ;
MAY : 'may' ;
INT
: [1-9]
| '0' [1-9]
| '1' [0-9]
| '2' [0-3]
;
and the following script to test it:
import sys
sys.path.append('gen')
from testParser import testParser
from testLexer import testLexer
from antlr4 import InputStream
from antlr4 import CommonTokenStream, TokenStream
def parse(text: str):
date_input = InputStream(text.lower())
lexer = testLexer(date_input)
stream = CommonTokenStream(lexer)
parser = testParser(stream)
return parser.month_and_day()
for t in ['Jan 6',
'hello Jan 6, 1984',
'hello maybe Jan 6, 1984']:
value = parse(t)
print(value.val)
I get the following results:
# First input - good
(1, 6)
# Second input - errors printed to STDERR
line 1:0 token recognition error at: 'h'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'l'
line 1:3 token recognition error at: 'l'
line 1:4 token recognition error at: 'o '
line 1:11 token recognition error at: ','
(1, 6)
# Third input - prints errors and throws exception
line 1:0 token recognition error at: 'h'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'l'
line 1:3 token recognition error at: 'l'
line 1:4 token recognition error at: 'o '
line 1:9 token recognition error at: 'b'
line 1:10 token recognition error at: 'e'
line 1:12 mismatched input 'jan' expecting INT
Traceback (most recent call last):
File "test_grammar.py", line 25, in <module>
value = parse(t)
File "test_grammar.py", line 19, in parse
return parser.month_and_day()
File "gen/testParser.py", line 305, in month_and_day
localctx._day = self.day()
File "gen/testParser.py", line 243, in day
localctx.val = int((None if localctx._day_number is None else localctx._day_number.start).text)
ValueError: invalid literal for int() with base 10: 'jan'
Process finished with exit code 1
To use the incremental approach I outlined above, I'd need a way to suppress the token recognition error output and also wrap the exception in a try or similar. Feels like I'd be very much going against the grain, and it would be difficult to distinguish these parsing exceptions from other things going wrong.
(META - I could swear I already asked this question somewhere about 4 months ago, but I couldn't find anything on SO, or the ANTLR GitHub tracker, or the ANTLR Google Group.)
Is there a way to use an ANTLR parser as a searcher, i.e. to find the
first instance of a substring ss of a longer string S that matches
a given rule my_rule?
The short answer is no. ANTLR does not work as a substitute/equivalent to any of the standard regex-based tools, like sed and awk.
The longer answer is yes, but with messy caveats. ANTLR expects to parse a structured, largely unambiguous input text. Text that is of no semantic significance can be ignored by adding the lexer rule (at lowest priority/bottom position)
IGNORE : . -> skip;
That way, anything not explicitly recognized in the lexer is ignored.
The next problem is the potential semantic overlap between 'normal' text and keywords, e.g. Jan (name) - Jan (month abrev). In general, this can be handled by adding a BaseErrorListener to the parser to distinguish between real and meaningless errors. What constitutes real vs meaningless can involve complex corner cases depending on the application.
Finally, the rule
day_number returns [val]
: a=INT {$val = int($a.text)} ;
is returning an int value not an INT token, hence the error that is being reported. The rule should be
day_number : INT ;
The solution I've settled on, based on a variant of an idea from #grosenberg's answer, is as follows.
1) Add a fallback lexer rule to match any text that isn't already matched by existing rules. Do not ignore/skip these tokens.
OTHER : . ;
2) Add a parser alternative to match either the rule of interest, or (with lower precedence) anything else:
month_and_day_or_null returns [val]
: month_and_day {$val = $month_and_day.val}
| . {$val = None}
;
3) In the application code, look for either a None or a populated value:
def parse(text: str):
date_input = InputStream(text.lower())
lexer = testLexer(date_input)
stream = CommonTokenStream(lexer)
parser = testParser(stream)
return parser.month_and_day_or_null()
for t in ['Jan 6',
'hello Jan 6, 1984',
'hello maybe Jan 6, 1984']:
for i in range(len(t)):
value = parse(t[i:])
if value.val:
print(f"Position {i}: {value.val}")
break
This has the desired effect at match time:
Position 0: (1, 6)
Position 6: (1, 6)
Position 12: (1, 6)
--*--
-***-
--*--
bars are blanks
print('', '*', ' \n', '***', ' \n', '', '*', '')
This is what i made and it doesn't work...I thought ''=blank and since there's comma it's one more blank so there should be 2 blanks as a result?
anyway what should i do using only one print f(x)
Just put it in as a single string:
print(' * \n***\n * ')
Output:
*
***
*
You can do this, because Python treats \n as new line character and it will not interfere with the rest of the text, even if it "touches" it. Putting it in a single string makes it more readable. There is no reason to fragment the whole statement with commas, when you can do it all in one string.
Basically:
'' --> empty string
' ' --> one space char (or blank)
So, modifying your print:
Only change the first argument from '' to ' '
print(' ', '*', ' \n', '***', ' \n', '', '*', '')
You can also simplify it passing only 1 argument:
print(' * \n *** \n * ')
TL;DR - I need a line number count to repeat, but it's not.
I've looked to see if there's anything that can answer my specific question, and I've gotten a lot of help and nearly have this solved, but just have a question on what is returning for me.
I have a file with a list of peoples' names. It contains a column for Given Name, Surname, and through that I can obtain the full name. What I am trying to do is ascertain whether or not a non-ASCII character is in the name, what character that is, and what line number in the file that name may be found.
Here's a snippet of my code:
with open('testFile.txt', 'r') as myFile:
for l in lastName:
if 0 <= ord(l) <= 127:
pass
else:
for num, line in enumerate(myFile, start=1):
if lastName in line:
print('Line number:', num)
print('Unicode Character:', l, '\n')
for f in firstName:
if 0 <= ord(f) <= 127:
pass
else:
for num, line in enumerate(myFile, start=1):
if firstName in line:
print('Line number:', num)
print('Unicode Character:', f, '\n')
The results work 'okay' but they're not complete. For example, if my file had three names:
Hélen Duçére
Mike Johnson
Aïda Flannery
My results look like this:
Line Number: 1
Unicode Character: é
Unicode Character: ç
Unicode Character: é
Line Number: 3
Unicode Character: ï
Is there something obvious in my code to identify exactly why I'm not getting a repeat of the Line Number for that ç or the second é character? Is there a simpler way to write this?
This code is a little more compact.
The sub method of regex arranges to substitute patterns that match its first parameter with the value of its second parameter, in the value of its third. The pattern [a-zA-Z ] looks for a single ascii character or a blank. Thus the sub will replace non-ascii characters or blanks with empty strings.
import re
with open ('will.txt') as will:
for n, line in enumerate(will):
remaining = re.sub(r'[a-zA-Z ]', '', line.rstrip())
if remaining:
print ('Line number:', n+1, 'non-ascii', remaining)
Edit: Making use of KyrSt's comment, the regex should contain some other characters, including, for instance, "'" and "-".
Edit 2: After exhaustive discussion with KyrSt, I've come to the conclusion that he's right, the regex should be [\x00-\x7F]
I think there is a simpler way to approach this. Including the parsing of the names and providing you set an appropriate value for variable delimiter below, the code you want could be like this
line_number = 0
with open('testFile.txt', 'r') as myFile:
line = myFile.readline().replace('\n','')
while line != '':
line_number += 1
firstName, lastName = line.split(delimiter)
for l in firstName:
if ord(l) > 127 or ord(l) < 0:
print('Line number:', line_number)
print('Unicode Character:', l, '\n')
for l in lastName:
if ord(l) > 127 or ord(l) < 0:
print('Line number:', line_number)
print('Unicode Character:', l, '\n')
line = myFile.readline().replace('\n','')
I want to generate sequential file names that take the last 2+ digits
from current buffer's name and count upwards from there. Like this:
08a01 > 08a02 > 08a03 > ....
The snippet I use (thanks for initial advice, Ingo Karkat!) leaves out the zeros,
producing sequences like 08a01 > 08a2 > 08a3 > ....
if b:current_buffer_name =~ '\d\+$'
let lastDigit = matchstr(b:current_buffer_name, '\d\+$')
let newDigit = lastDigit + 1
let s:new_file_name = substitute(b:current_buffer_name, '\d\+$', newDigit, '')
else
let s:new_file_name = b:current_buffer_name . '01'
How can I tell Vim in a function that it should count upwards "with
zeros"? I tried adding let &nrformats-=octal before the
if-condition (as suggested here), but that didn't work.
Thanks for any explanations!
try this:
change this line:
let newDigit = lastDigit + 1
into:
let newDigit = printf("%02d", str2nr(lastDigit) + 1)
didn't test, but by reading your codes, it should work.
it hardcoded 2, if your string was foobar0000001, it won't work. In this case, you need get the len(lastDigit) and use it in the printf format.
I don't know how to avoid doing the sum without vim taking into account that the number is not octal with leading zeros. I tried with set nrformats-=octal but neither it worked. Here is my workaround extracting the number in two parts, zeroes by one side and the other digits from leading zeros by the other side and calculate its length using printf():
let last_digits = matchlist(bufname('%'), '\(0\+\)\?\(\d\+\)$')
echo printf('%0' . (len(last_digits[1]) + len(last_digits[2])) . 'd', last_digits[2] + 1)
Some tests:
With a buffer named 08a004562, last_digits will be a list like:
['004562', '00', '4562', '', '', '', '', '', '', '']
and the result will be:
004563
And with a buffer named 8a9, last_digits will be:
['9', '', '9', '', '', '', '', '', '', '']
and the result:
10