Parsing block comments with Megaparsec using symbols for start and end - haskell

I want to parse text similar to this in Haskell using Megaparsec.
# START SKIP
def foo(a,b):
c = 2*a # Foo
return a + b
# END SKIP
, where # START SKIP and # END SKIP marks the start and end of the block of text to parse.
Compared to skipBlockComment I want the parser to return the lines between the start and end marker.
This is my parser.
skip :: Parser String
skip = s >> manyTill anyChar e
where s = string "# START SKIP"
e = string "# END SKIP"
The skip parser works as intended.
To allow for a variable amount of white space within the start and end marker, for example # START SKIP I've tried the following:
skip' :: Parser String
skip' = s >> manyTill anyChar e
where s = symbol "#" >> symbol "START" >> symbol "SKIP"
e = symbol "#" >> symbol "END" >> symbol "SKIP"
Using skip' to parse the above text gives the following error.
3:15:
unexpected 'F'
expecting "END", space, or tab
I would like to understand the cause of this error and how I can fix it.

As Alec already commented, the problem is that as soon as e encounters '#', it counts as a consumed character. And the way parsec and its derivatives work is that as soon as you've consumed any characters, you're committed to that parsing branch – i.e. the manyTill anyChar alternative is then not considered anymore, even though e ultimately fails here.
You can easily request backtracking though, by wrapping the end delimiter in try:
skip' :: Parser String
skip' = s >> manyTill anyChar e
where s = symbol "#" >> symbol "START" >> symbol "SKIP"
e = try $ symbol "#" >> symbol "END" >> symbol "SKIP"
This then will before consuming '#' set a “checkpoint”, and when e fails later on (in your example, at "Foo"), it will act as if no characters had matched at all.
In fact, traditional parsec would give the same behaviour also for skip. Just, because looking for a string and only succeeding if it matches entirely is such a common task, megaparsec's string is implemented like try . string, i.e. if the failure occurs within that fixed string then it will always backtrack.
However, compound parsers still don't backtrack by default, like they do in attoparsec. The main reason is that if anything can backtrack to any point, you can't really get a clear point of failure to show in the error message.

Related

How to deal with file ending '\' in strings haskell

import Data.Char (isAlpha)
import Data.List (elemIndex)
import Data.Maybe (fromJust)
helper = ['a'..'z'] ++ ['a'..'z'] ++ ['A'..'Z'] ++ ['A'..'Z']
rotate :: Char -> Char
rotate x | '\' = '\'
|isAlpha(x) = helper !! (fromJust (elemIndex x helper) + 13)
| otherwise = x
rot13 :: String -> String
rot13 "" = ""
rot13 s = map rotate s
main = do
print $ rot13( "Hey fellow warriors" )
print $ rot13( "This is a test")
print $ rot13( "This is another test" )
print $ rot13("\604099\159558\705559&\546452\390142")
n <- getLine
print $ rot13( show n)
This is my code for ROT13 and there is an error when I try to pass file ending directly
rot13.hs:8:15: error:
lexical error in string/character literal at character ' '
|
8 | rotate x | '\' = '\'
There is also an error even from if not replace just use isAlpha to filter
How to deal with this?
As in many languages, backslash is the escape character. It's used to introduce characters that are hard or impossible to include in strings in other ways. For example, strings can't span multiple lines*, so it's impossible to include a literal newline in a string literal; and double-quotes end the string, so it's normally impossible to include a double quote in a string literal. The \n and \" escapes, respectively, covers those:
> putStrLn "before\nmiddle\"after"
before
middle"after
>
Since \ introduces escape codes, it always expects to be followed by something. If you want a literal backslash to be included at that spot, you can use a second backslash. For example:
> putStrLn "before\\after"
before\after
>
The Report, Section 2.6 is the final word on what escapes are available and what they mean.
Literal characters have a similar (though not quite identical) collection of escapes to strings. So the fix to your syntax looks like this:
rotate x | '\\' = '\\'
This will let your code parse, though there are further errors to fix once you get past that.
* Yes, yes, string gaps. I know. Doesn't actually change the point, since the newline in the gap isn't included in the resulting string.

Parsec csv parser parsing extra line

I have defined the follwing Parsec parser for parsing csv files into a table of strings, i.e. [[String]]
--A csv parser is some rows seperated, and possibly ended, by a newline charater
csvParser = sepEndBy row (char '\n')
--A row is some cells seperated by a comma character
row = sepBy cell (char ',')
--A cell is either a quoted cell, or a normal cell
cell = qcell <|> ncell
--A normal cell is a series of charaters which are neither , or newline. It might also be an escape character
ncell = many (escChar <|> noneOf ",\n")
--A quoted cell is a " followd by some characters which either are escape charaters or normal characters except for "
qcell = do
char '"'
res <- many (escChar <|> noneOf "\"")
char '"'
return res
--An escape character is anything followed by a \. The \ will be discarded.
escChar = char '\\' >> anyChar
I don't really know if the comments are too much and annoying, of if they are helping. As a Parsec noob they would help me, so I thought I would add them.
It works pretty good, but there is a problem. It creates an extra, empty, row in the table. So if I for example have a csv file with 10 rows(that is, only 10 lines. No empty lines in the end*), the [[String]] structure will have length 11 and the last list of Strings will contain 1 element. An empty String (at least this is how it appears when printing it using show).
My main question is: Why does this extra row appear, and what can I do to stop it?
Another thing I have noted is that if there are empty lines after the data in the csv files, these will end up as rows containing only an empty String in the table. I thought that using sepEndBy instead of sepBy would make the extra empty lines by ignored. Is this not the case?
*After looking at the text file in a hex editor, it seems that it indeed actually ends in a newline character, even though vim doesn't show it...
If you want each row to have at least one cell, you can use sepBy1 instead of sepBy. This should also stop empty rows being parsed as a row. The difference between sepBy and sepBy1 is the same as the difference between many and many1: the 1 version only parses sequences of at least one element. So row becomes this:
row = sepBy1 cell (char ',')
Also, the usual style is to use sepBy1 in infix: cell `sepBy1` char ','. This reads more naturally: you have a "cell separated by a comma" rather than "separated by cell a comma".
EDIT: If you don't want to accept empty cells, you have to specify that ncell has at least one character using many1:
ncell = many1 (escChar <|> noneOf ",\n")

Lexical analysis of string token using Parsec

I have this parser for string parsing using Haskell Parsec library.
myStringLiteral = lexeme (
do str <- between (char '\'')
(char '\'' <?> "end of string")
(many stringChar)
; return (U.replace "''" "'" (foldr (maybe id (:)) "" str))
<?> "literal string"
)
Strings in my language are defined as alpha-num characters inside of '' (example: 'this is my string'), but these string can also contain ' inside of it (in this case ' must be escaped by another ', ex 'this is my string with '' inside of it').
What I need to do, is to look forward when ' appears during parsing of string and decide, if there is another ' after or not (if no, return end of string). But I dont know how to do it. Any ideas? Thanks!
If the syntax is as simple as it seems, you can make a special case for the escaped single quote,
escapeOrStringChar :: Parser Char
escapeOrStringChar = try (string "''" >> return '\'') <|> stringChar
and use that in
myStringLiteral = lexeme $ do
char '\''
str <- many escapeOrStringChar
char '\'' <?> "end of string"
return str
You can use stringLiteral for that.
Parsec deals only with LL(1) languages (details). It means the parser can look only one symbol a time. Your language is LL(2). You can write your own FSM for parsing your language. Or you can transform the text before parsing to make it LL(1).
In fact, Parsec is designed for syntactic analysis not lexical. The good idea is to make lexical analysis with other tool and than use Parsec for parsing the sequence of lexemes instead of sequence of chars.

Am I going crazy with the Perl string comparison operator? Debug log included

I must be missing something about variable assignment or string comparison. I have a script that's going through a tab-separated file. Unless one particular value in a row is "P", I want to skip to the next line. The code looks like:
1 print "Processing inst_report file...\n";
2 foreach(#inst_report_file){
3 #line=split(/\t/);
4 ($line[13] ne "P") && next;
5 $inst_report{$line[1]}++;
6 }
For some reason, the script would never get to Line 5 even though there were clearly lines with "P" in it.
So debug time!
# Continuing to the breakpoint.
DB<13> c
main::(count.pl:27): ($line[13] ne "P") && next;
# Proving that this particular array element is indeed "P" with no leading or trailing characters.
DB<13> p "--$line[13]--\n";
--P--
# Proving that I'm not crazy and the Perl string comparison operator really works.
DB<14> p ("P" eq "P");
1
# Now since we've shown that $line[13] eq P, let's run that Boolean again.
DB<15> p ($line[13] eq "P")
# (Blank means FALSE) Whaaaat?
# Let's manually set $line[13]
DB<16> $line[13]="P"
# Now let's try that comparison again...
DB<17> p ($line[13] eq "P")
1
DB<18>
# Now it works. Why?
I can work around this by prefiltering the input file but bothers me why this doesn't work. Am I missing something obvious?
---loren---
Find out what your string really is using:
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
print(Dumper($line[13]));
[ On further review, the guesses below are most likely incorrect. ]
I suspect you have a trailing newline, in which case you want chomp.
You could also have trailing spaces. s/\s+\z// will remove both trailing spaces and a trailing newline.
Have you tried printing out the string characters with ord?
say ord for (split //, $line[13]);
If, for example, you have a \0 in there, it might not show up in a regular print. With the string P\0, I get:
$ perl -wE '$a="P\0"; say "--$a--"; say ord for (split //, $a);'
--P--
80
0
Unless there are unprintable characters in the input, it's not clear why your code doesn't work. Having said that, I would still write that statement as:
next unless $line[13] eq "P";
or
next unless $line[13] =~ /^P$/; (Theoretically this could be faster.)
You will not need to pre-filter the data.
Are you sure $line[13] isn't supposed to be $line[12]?

Wrapping strings, but not substrings in quotes, using R

This question is related to my question about Roxygen.
I want to write a new function that does word wrapping of strings, similar to strwrap or stringr::str_wrap, but with the following twist: Any elements (substrings) in the string that are enclosed in quotes must not be allowed to wrap.
So, for example, using the following sample data
test <- "function(x=123456789, y=\"This is a long string argument\")"
cat(test)
function(x=123456789, y="This is a long string argument")
strwrap(test, width=40)
[1] "function(x=123456789, y=\"This is a long"
[2] "string argument\")"
I want the desired output of a newWrapFunction(x, width=40, ...) to be:
desired <- c("function(x=123456789, ", "y=\"This is a long string argument\")")
desired
[1] "function(x=123456789, "
[2] "y=\"This is a long string argument\")"
identical(desired, newWrapFunction(tsring, width=40))
[1] TRUE
Can you think of a way to do this?
PS. If you can help me solve this, I will propose this code as a patch to roxygen2. I have identified where this patch should be applied and will acknowledge your contribution.
Here's what I did to get strwrap so it would not break single quoted sections on spaces:
A) Pre-process the "even" sections after splitting by the single-quotes by substituting "~|~" for the spaces:
Define new function strwrapqt
....
zz <- strsplit(x, "\'") # will be only working on even numbered sections
for (i in seq_along(zz) ){
for (evens in seq(2, length(zz[[i]]), by=2)) {
zz[[i]][evens] <- gsub("[ ]", "~|~", zz[[i]][evens])}
}
zz <- unlist(zz)
.... insert just before
z <- lapply(strsplit) ...........
Then at the end replace all the "~|~" with spaces. It might be necessary to doa lot more thinking about the other sorts of whitespace "events" to get a fully regular treatment.
....
y <- gsub("~\\|~", " ", y)
....
Edit: Tested #joran's suggestion. Matching single and double quotes would be a difficult task with the methods I am using but if one were willing to consider any quote as equally valid as a separator target, one could just use zz <- strsplit(x, "\'|\"") as the splitting criterion in the code above.

Resources