I have an assignment to create a lexical analyser and I've got everything working except for one bit.
I need to create a string that will accept a new line, and the string is delimited by double quotes.
The string accepts any number, letter, some specified punctuation, backslashes and double quotes within the delimiters.
I can't seem to figure out how to escape a new line character.
Is there a certain way of escaping characters like new line and tab?
Here's some of my code that might help
< STRING : ( < QUOTE> (< QUOTE > | < BACKSLASH > | < ID > | < NUM > | " " )* <QUOTE>) >
< #QUOTE : "\"" >
< #BACKSLASH : "\\" >
So my string should allow for a quote, then any of the following characters like a backslash, a whitespace, a number etc, and then followed by another quote.
The newline char like "\n" is what's not working.
Thanks in advance!
For string literals, JavaCC borrows the syntax of Java. So, a single-character literal comprising a carriage return is escaped as "\r", and a single-character literal comprising a line feed is escaped as "\n".
However, the processed string value is just a single character; it is not the escape itself. So, suppose you define a token for line feed:
< LF : "\n" >
A match of the token <LF> will be a single line-feed character. When substituting the token in the definition of another token, the single character is effectively substituted. So, suppose you have the higher-level definition:
< STRING : "\"" ( <LF> ) "\"" >
A match of the token <STRING> will be three characters: a quotation mark, followed by a line feed, followed by a quotation mark. What you seem to want instead is for the escape sequence to be recognized:
< STRING : "\"" ( "\\n" ) "\"" >
Now a match of the token <STRING> will be four characters: a quotation mark, followed by an escape sequence representing a line feed, followed by a quotation mark.
In your current definition, I see that other often-escaped metacharacters like quotation mark and backslash are also being recognized literally, rather than as escape sequences.
Related
Are there any "quote" symbols that you can use if you run out of them?
I know one can use " and ' (and, perhaps a combination of them) like "''" (somehow)
A example of a "additional" "quote" symbol would be
\" inside quotes.
If I make a python script, and "used up" both " and ', is there any more chars that I can use to indicate a quote?
No. From what I can tell, ' and " are the only quotes than can be used in a string literal. The Lexical Analysis page contains this information:
shortstring ::= "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring ::= "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
In plain English: Both types of literals can be enclosed in matching single quotes (') or double quotes ("). They can also be enclosed in matching groups of three single or double quotes (these are generally referred to as triple-quoted strings).
Which suggests that those are the only two options.
No, " and ' are the only quote characters in Python.
If you run out of quote symbols you can "escape" quotes with a backslash(\), which will ignore the quotes and just make them part of the string.
For example, this would be a valid string:
print("hello \"bob\"")
This would return 'hello "bob"'
No, their are no other quote chracters in Python for right now
As far as I know ' and " are tho only quote characters in python
but if you count a multi line string then it would look like this:
multiline = """
string
string
string
"""
and to make a list of them you would use:
multiline.splitlines()
How can I include quotes for string and characters as part of the string. Example is "This is a \" string" which should result in one string instead of "This is a \" as one string and string" as an error in this case. The same goes for the characters. Example is '\'', but
in my case it's only '\'.
This is my current solution which works only without quotes.
CHARACTER
: '\'' ~('\'')+ '\''
;
STRING
: '"' ~('"')+ '"'
;
Your string/char rules don't handle escape sequences correctly. For the character it should be:
CHARACTER: '\'' '\\'? '.' '\'';
Here we make the escape char (backshlash) be part of the rule and require an additional char (whatever it is) follow it. Similar for the string:
STRING: '"' ('\\'? .)+? '"';
By using +? we are telling ANTLR4 to match in a non-greedy manner, stopping at the first non-escaped quote char after the initial one.
I have a large (4 GB) Windows .csv text file (each lines end in "\r\n") in a Linux environment that was supposed to have been a csv delimited file (delimiter = '|', text qualifier = '"') with each field separated by a pipe and enclosed in double quotes. Any narrative text field with embedded double quotes was supposed to have the double quote escaped with a second double quote (ie. " the quick "brown" fox" was supposed to have been represented as "the quick ""brown"" fox"). Unfortunately escaping the embedded double quotes did not occur. Further the text fields may include embedded new lines (i.e. Windows CR (\r\n)) which need to be retained.
Sample lines might look as follows:
"1234567890123456"|"2016-07-30"|"2016-08-01"|"123"|"456"|"789"|"text narrative field starts\r\n
with text lines that may have embedded double quotes "For example"\r\n
and may include measurements such as 1/2" x 2" with \r\n
the text continuing and includes embedded line breaks \r\n
which will finally be terminated with a double quote"\r\n
"9876543210654321"|"2017-01-31"|"2018-08-01"|"123"|"456"|"789"|"text narrative field"\r\n
"2345678901234567"|"...."\r\n
with the objective to have the output appear as follows:
~1234567890123456~|~2016-07-30~|~2016-08-01~|~123~|~456~|~789~|~text narrative field starts\r\n
with text lines that may have embedded double quotes ""For example""\r\n
and may include measurements such as 1/2"" x 2"" with \r\n
the text continuing and includes embedded line breaks \r\n
which will finally be terminated with a double quote~\r\n
~9876543210654321~|~2017-01-31~|~2018-08-01~|~123~|~456~|~789~|~text narrative field~\r\n
~2345678901234567~|~....~\r\n
The solution I was attempting to implement was to:
SUCCESSFUL: change all the "|" sequences to ~|~
SUCCESSFUL: change the double quote (")at the start of the first line and end of the last line to a tilde (~)
change the ending and starting double quotes to tildes for any lines ending in a double quote at the end of the first line and terminated with a CR (\r\n) (eg. ..."\r\n) and the next line begins with a double quote, followed by 16 digit number and a tilde (eg. "1234567890123456~...) (i.e. it is the start of a new record)
convert all remaining double quote characters to two successive double quotes (change " to "")
then reverse the first 3 steps above changing all ~ back to double quotes.
I started by using sed to replace all strings with double quote, followed by a pipe, followed by a double quote (i.e. "|") with a tilde, pipe, tilde (i.e. ~|~). I then manually replaced the first and last doublequote in the file with a tilde.
This is where I ran into issues as I tried to count the number of occurrences where a line ends with a doublequote(") and the start of the next line begins with a doublequote followed by a 16 digit number and a "~" which will tell me the actual number of csv records in the file (minus one) as opposed to the number of lines. I attempted to do this using grep: grep '"\r\n"\d{16}~' | wc -l but that didn't work
I then need to replace those double quotes wherein a double quote ends a record and the succeeding record begins with a double quote followed by a 16 digit number and a "~" leaving everything else intact.
I tried to use sed: sed 's/"\r\n"(\d{16}~)/~\r\n~\1' windows_file.txt but it is not working as hoped.
I would welcome any recommendations as to how to accomplish the above.
The script below does what you expect using awk, except for the very last line in the file since it does not know where that record ends.
It could be fixed counting lines in the file but would be impractical since it's a big file.
Looking at data structure records are separated by "\r\n" and fields by "|" let's use that with awk.
gawk 'BEGIN{
RS="\"\r\n\"" # input record separator RS, 2 double quotes with a DOS line ending in the middle
FS="\"\\|\"" # input field separator FS, 2 double quotes with a pipe in the middle
ORS="~\r\n~" # your record separator
OFS="~|~" # your field separator
} {
$1=$1 # trick awk into believing something has changed
if (NR == 1){ # first record, replace first character
print "~" substr($0,2)
}else{
print $0
}
} ' test.txt
Result (assuming lines end with \r\n):
~1234567890123456~|~2016-07-30~|~2016-08-01~|~123~|~456~|~789~|~text narrative field starts
with text lines that may have embedded double quotes "For example"
and may include measurements such as 1/2" x 2" with
the text continuing and includes embedded line breaks
which will finally be terminated with a double quote~
~9876543210654321~|~2017-01-31~|~2018-08-01~|~123~|~456~|~789~|~text narrative field~
~10654321~|~2018-09-31~|~2018-08-01~|~123~|~456~|~789~|~asdasdasdasdad asasda"
~
~
PS: will break if a field contains a line that starts with " and the preceding line within the same ends with "\r\n since the pattern will match the proposed RS.
"10654321"|"2018-09-31"|"2018-08-01"|"123"|"456"|"789"|"asdasdasdasdad asasda"\r\n
"some more"\r\n
"22222"|".... (another record)
I am trying to make a string variable containing :
"C:\Program Files\Sublime Text 3\sublime_text.exe" C:\Users\User\Desktop\Guess.py
Unfortunately I am not succeeding in doing so. Is there a way to put the text about as is into a variable, double quotes and all?
In your example string you have characters that need escaping: " and \
fmt.Println("\"C:\\Program Files\\Sublime Text 3\\sublime_text.exe\" C:\\Users\\User\\Desktop\\Guess.py")
You can also use back quotes to create what is called a raw string which doesn't require escaping those characters.
fmt.Println(`"C:\Program Files\Sublime Text 3\sublime_text.exe" C:\Users\User\Desktop\Guess.py`)
List of escapes:
\a U+0007 alert or bell
\b U+0008 backspace
\f U+000C form feed
\n U+000A line feed or newline
\r U+000D carriage return
\t U+0009 horizontal tab
\v U+000b vertical tab
\\ U+005c backslash
\' U+0027 single quote (valid escape only within rune literals)
\" U+0022 double quote (valid escape only within string literals)
See the official docs.
I'm trying to implement a parser using ANTLRv4 for a language that accepts both "" and \" as a way escaping " characters in " delimited strings.
The answers to this question show how to do it for "" escaping. However when I try to extend it to also cover the \" case, it almost works but becomes too greedy when two strings are on the same line.
Here is my grammar:
grammar strings;
strings : STRING (',' STRING )* ;
STRING
: '"' (~[\r\n"] | '""' | '\"' )* '"'
;
Here is my input of three strings:
"This is ""my string\"",
"cat","fish"
This correctly recognises "This is ""my string\"", but thinks that "cat","fish" is all one string.
If I move "fish" down on to the next line it works correctly.
Can anyone figure out how to make it work if "cat" and "fish" are on the same line?
Make your STRING rule non greedy to stop at the first quote char it encounters, instead of trying to get as much as possible:
STRING
: '"' (~[\r\n"] | '""' | '\"' )*? '"'
;
I've found what I need to do to get this to work as I wanted, though to be honest I'm still not entirely sure why Antlr was doing what it did.
Simply by adding another backslash character to the '\"' clause it works!
So my final STRINGS definition is : '"' (~[\r\n"] | '""' | '\\"' )* '"'
Going back to first principles, I hand drew a state transition diagram of the problem and then realised that the two escaping mechanism sequences are not the same and cannot be treated similarly. Then trying to implement the two patterns in AntlrWorks it became apparent that I needed to add the second backslash at which point it all started working.
Does a single backslash followed by some arbitrary character simply mean that character?