Substitute special characters for a text in Python - python-3.x

I am trying to come up a script for me substitute few special characters for texts in Python, here is a list of characters I want to replace:
: < > ? * / " | \
My code works well, if I don't add \ into the list I want to replace:
import re
subj='Test/ US: Paper* Packaging'
chars_to_remove = [':','<','>','?','*','/','"','|']
rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
re.sub(rx, '', subj)
However, when I add \ into my chars_to_remove list, i will give me the error:SyntaxError: EOL while scanning string literal
import re
chars_to_remove = [':','<','>','?','*','/','"','|','\']
rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
re.sub(rx, '', subj)
I know \ means add a newline in Python, but here how can I let my code knows
mean the character not newline.
Thanks

Related

Bash script to replace matched substrings within larger substring

I'm trying to write a bash script to replace the newline characters and *s from comments, but only if that comment contains a particular substring.
// file.txt
/**
* Here is a multiline
* comment that contains substring
* rest of it
*/
/**
* Here is a multiline
* comment that does not contain subNOTstring
* rest of it
*/
I would like the final result to be:
// file.txt
/** Here is a multiline comment that contains substring rest of it */
/**
* Here is a multiline
* comment that does not contain subNOTstring
* rest of it
*/
I have a regex that matches multiline comments: \/\*([^*]|[\r\n]|(\*+([^*\/]|[\r\n])))*\*\/ but can't figure out the second part, of only matching with the substring, and then replacing all the /n * with just
So to make sure my question is articulated correctly
Make a match of a substring within a file. i.e. comment
Make sure that match includes substring.
Replace all substring within the first match with another string. i.e. n/ * with
If python is your option, would you please try:
#!/usr/bin/python
import re # use regex module
with open('file.txt') as f: # open "file.txt" to read
str = f.read() # assign "str" to the lines of the file
for i in re.split(r'(/\*.*?\*/)', str, flags=re.DOTALL): # split the file on the comment including the comment in the result
if re.match(r'/\*.*substring', i, flags=re.DOTALL): # if the comment includes the keyword "substring"
i = re.sub(r'\n \* |\n (?=\*/)', ' ', i) # then replace the newline and the asterisk with a whitespace
print(i, end='') # print the element without adding newline
re.split(r'(/\*.*?\*/)', str, flags=re.DOTALL) splits "str" on the comment
including the comment in the splitted list.
The flags=re.DOTALL option makes a dot match with newline characters.
for i in .. syntax loops over the list assiging "i" to each element.
re.match(r'/\*.*substring', i, flags=re.DOTALL) matches the element
which is a comment including the keyword "substring".
re.sub(r'\n \* |\n (?=\*/)', ' ', i) replaces a newline followed by
the " * " in the next line with a whitespace.
\n (?=\*/) is a positive lookahead which matches a newline followed
by " */". It will match the last line of the comment block leaving the
"*/" as is.
[Edit]
If you want to embed the python script in bash, would you please try:
#!/bin/bash
infile="file.txt" # modify according to your actual filename
tmpfile=$(mktemp /tmp/temp.XXXXXX) # temporary file to output
# start of python script
python3 -c "
import re, sys
filename = sys.argv[1]
with open(filename) as f:
str = f.read()
for i in re.split(r'(/\*.*?\*/)', str, flags=re.DOTALL):
if re.match(r'/\*.*substring', i, flags=re.DOTALL):
i = re.sub(r'\n \* |\n (?=\*/)', ' ', i)
print(i, end='')
" "$infile" > "$tmpfile"
# end of python script
mv -f -- "$infile" "$infile".bak # backup the original file
mv -f -- "$tmpfile" "$infile" # replace the input file with the output

Lexer rule to handle escape of quote with quote or backslash in ANTLR4?

I'm trying to expand the answer to How do I escape an escape character with ANTLR 4? to work when the " can be escaped both with " and \. I.e. both
"Rob ""Commander Taco"" Malda is smart."
and
"Rob \"Commander Taco\" Malda is smart."
are both valid and equivalent. I've tried
StringLiteral : '"' ('""'|'\\"'|~["])* '"';
but if fails to match
"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""
with the tokenizer consuming more characters than intended, i.e. consumes beyond \""
Anyone who knows how to define the lexer rule?
A bit more detail...
"" succeeds
"""" succeeds
\" " succeeds
"\"" succeeds (at EOF)
"\""\n"" fails (it greedily pulls in the \n and "
Example: (text.txt)
""
""""
"\" "
"\""
""
grun test tokens -tokens < test.txt
line 5:1 token recognition error at: '"'
[#0,0:1='""',<StringLiteral>,1:0]
[#1,2:2='\n',<'
'>,1:2]
[#2,3:6='""""',<StringLiteral>,2:0]
[#3,7:7='\n',<'
'>,2:4]
[#4,8:12='"\" "',<StringLiteral>,3:0]
[#5,13:13='\n',<'
'>,3:5]
[#6,14:19='"\""\n"',<StringLiteral>,4:0]
[#7,21:20='<EOF>',<EOF>,5:2]
\"" and """ at the end of a StringListeral are not being handled the same.
Here's the ATN for that rule:
From this diagram it's not clear why they should be handled differently. They appear to be parallel constructs.
More research
Test Grammar (small change to simplify ATN):
grammar test
;
start: StringLiteral (WS? StringLiteral)+;
StringLiteral: '"' ( (('\\' | '"') '"') | ~["])* '"';
WS: [ \t\n\r]+;
The ATN for StringLiteral in this grammar:
OK, let's walk through this ATN with the input "\""\n"
unconsumed input
transition
"\""\n"
1 -ε-> 5
"\""\n"
5 -"-> 11
\""\n"
11 -ε-> 9
\""\n"
9 -ε-> 6
\""\n"
6 -\-> 7
""\n"
7 -"-> 10
"\n"
10 -ε-> 13
"\n"
13 -ε-> 11
"\n"
11 -ε-> 12
"\n"
12 -ε-> 14
"\n"
14 -"-> 15
\n"
15 -ε-> 2
We should reach State 2 with the " before the \n, which would be the desired behavior.
Instead, we see it continue on to consume the \n and the next "
line 2:1 token recognition error at: '"'
[#0,0:5='"\""\n"',<StringLiteral>,1:0]
[#1,7:6='<EOF>',<EOF>,2:2]
In order for this to be valid, there must be a path from state 11 to state 2 that consumes a \n and a " (and I'm not seeing it)
Maybe I'm missing something, but it's looking more and more like a bug to me.
The problem is handling the \ properly.
Bart found the path through the ATN that I missed and allowed it to match the extra \n". The \ is matched as a ~["] and then comes back through and matches the " to terminate the string.
We could disallow \ in the "everything but a " alternative (~["\\]), but then we have to allow a stand-alone \ to be acceptable. We'd want to add an alternative that allows a \ followed by anything other than a ". You'd think that '\\' ~["] does that, and you'd be right, to a point, but it also consumes the character following the \, which is a problem if you want a string like "test \\" string" since it's consumed the second \ you can't match the \" alternative. What you're looking for is a lookahead (i.e. consume the \ if it's not followed by a ", but don't consume the following character). But ANTLR Lexer rules don't allow for lookaheads (ANTLR lexer can't lookahead at all).
You'll notice that most grammars that allow \" as an escape sequence in strings also require a bare \ to be escaped (\\), and frequently treat other \ (other character) sequences as just the "other character").
If escaping the \ character is acceptable, the rule could be simplified to:
StringLiteral: '"' ('\\' . | '""' | ~["\\])* '"';
"Flag for \\"Chiller Water\\"" would not parse correctly, but "Flag for \\\"Chiller Water\\\"" would. Without lookahead, I'm not seeing a way to Lex the first version.
Also, note that if you don't escape the \, then you have an ambiguous interpretation of \"". Is it \" followed by a " to terminate the string, or \ followed by "" allowing the string to continue? ANTLR will take whichever interpretation consumes the most input, so we see it using the second interpretation and pulling in characters until if finds a "
I cannot reproduce it. Given the grammar:
grammar T;
parse
: .*? EOF
;
StringLiteral
: '"' ( '""' | '\\"' | ~["] )* '"'
;
Other
: . -> skip
;
The following code:
String source =
"\"Rob \"\"Commander Taco\"\" Malda is smart.\"\n" +
"\"Rob \\\"Commander Taco\\\" Malda is smart.\"\n" +
"\"Entry Flag for Offset check and for \\\"don't start Chiller Water Pump Request\\\"\"\n";
TLexer lexer = new TLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-20s '%s'\n",
TLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
produces the following output:
StringLiteral '"Rob ""Commander Taco"" Malda is smart."'
StringLiteral '"Rob \"Commander Taco\" Malda is smart."'
StringLiteral '"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""'
Tested with ANTLR 4.9.3 and 4.10.1: both produce the same output.

how to write antlr4 rule for string

I have the following rules for string and comment:
Double_quoted_string : '"' ( ~[\n\r] )* '"' ;
SL_Comment : '//' .*? '\r'? '\n' -> channel(HIDDEN) ;
But I see that for the following input:
printf("Hello \"something "); //printf("Bye ");
the string token getting generated is:
"Hello \"something "); //printf("Bye "
i.e. greedily the longest match is taken, without applying the rule for the comment.
I would like the string only to be "Hello \"something ". How should the rules be modified for this?
Like this
Double_quoted_string
: '"' ( ~[\\"\n\r] | '\\' [\\"] )* '"'
;
Short explanation of the inner ( ... )*:
~[\\"\n\r] matches any char except \, ", \n and \r
'\\' [\\"] matches \\ or \" *
* if you want to escape more, simply add them to the character class: '\\' [\\"'tbnrf] would match \\, \", \', \t, \b, \n, \r and \f

Cut a special char in c#

Someone know how to cut a string with a SPLIT Method ?
No idea when it comes with ' \ '.
**identitiname = HttpContext.Current.User.Identity.Name;
identitiname *// it has the value FAMILY\ANDRES*
string[] usuario = identitiname.Split( '\' );**
It gives me an error code.
Regards
When we split string by \ then you have to use \\ because \ is used for formatting..
string[] usuario = identitiname.Split( '\\' );

Parsing quoted string with escape chars

I'm having a problem parsing a list of lines of format in antlr4
* this is a string
* "first" this is "quoted"
* this is "quoted with \" "
I want to build a parse tree like
(list
(line * (value (string this is a string)))
(line * (value (parameter first) (string this is) (parameter quoted)))
(line * (value (string this is) (parameter quoted with " )))
)
I have an antlr4 grammar of this format
grammar List;
list : line+;
line : '*' (WS)+ value* NEWLINE;
value : string
| parameter
;
string : ((WORD) (WS)*)+;
parameter : '"'((WORD) (WS)*)+ '"';
WORD : (~'\n')+;
WS : '\t' | ' ';
NEWLINE : '\n';
But this is failing in the first character recognition of '*' itself, which baffles me.
line 1:0 mismatched input '* this is a string' expecting '*'
The problem is that your lexer is too greedy. The rule
WORD : (~'\n')+;
matches almost everything. This causes the lexer to produce the following tokens for your input:
token 1: WORD (* this is a string)
token 2: NEWLINE
token 3: WORD (`* "first" this is "quoted")
token 4: NEWLINE
token 5: WORD (* this is "quoted with \" ")
Yes, that is correct: only WORD and NEWLINE tokens. ANTLR's lexer tries to construct tokens with as much characters as possible, it does not "listen" to what the parser is trying to match.
The error message:
line 1:0 mismatched input '* this is a string' expecting '*'
is telling you this: on line 1, index 0 the token with text '* this is a string' (type WORD) is encountered, but the parser is trying to match the token: '*'
Try something like this instead:
grammar List;
parse
: NEWLINE* list* NEWLINE* EOF
;
list
: item (NEWLINE item)*
;
item
: '*' (STRING | WORD)*
;
BULLET : '*';
STRING : '"' (~[\\"] | '\\' [\\"])* '"';
WORD : ~[ \t\r\n"*]+;
NEWLINE : '\r'? '\n' | '\r';
SPACE : [ \t]+ -> skip;
which parses your example input as follows:
(parse
(list
(item
* this is a string) \n
(item
* "first" this is "quoted") \n
(item
* this is "quoted with \" "))
\n
<EOF>)

Resources