Lexical analysis of string token using Parsec - haskell

I have this parser for string parsing using Haskell Parsec library.
myStringLiteral = lexeme (
do str <- between (char '\'')
(char '\'' <?> "end of string")
(many stringChar)
; return (U.replace "''" "'" (foldr (maybe id (:)) "" str))
<?> "literal string"
)
Strings in my language are defined as alpha-num characters inside of '' (example: 'this is my string'), but these string can also contain ' inside of it (in this case ' must be escaped by another ', ex 'this is my string with '' inside of it').
What I need to do, is to look forward when ' appears during parsing of string and decide, if there is another ' after or not (if no, return end of string). But I dont know how to do it. Any ideas? Thanks!

If the syntax is as simple as it seems, you can make a special case for the escaped single quote,
escapeOrStringChar :: Parser Char
escapeOrStringChar = try (string "''" >> return '\'') <|> stringChar
and use that in
myStringLiteral = lexeme $ do
char '\''
str <- many escapeOrStringChar
char '\'' <?> "end of string"
return str

You can use stringLiteral for that.

Parsec deals only with LL(1) languages (details). It means the parser can look only one symbol a time. Your language is LL(2). You can write your own FSM for parsing your language. Or you can transform the text before parsing to make it LL(1).
In fact, Parsec is designed for syntactic analysis not lexical. The good idea is to make lexical analysis with other tool and than use Parsec for parsing the sequence of lexemes instead of sequence of chars.

Related

How to deal with file ending '\' in strings haskell

import Data.Char (isAlpha)
import Data.List (elemIndex)
import Data.Maybe (fromJust)
helper = ['a'..'z'] ++ ['a'..'z'] ++ ['A'..'Z'] ++ ['A'..'Z']
rotate :: Char -> Char
rotate x | '\' = '\'
|isAlpha(x) = helper !! (fromJust (elemIndex x helper) + 13)
| otherwise = x
rot13 :: String -> String
rot13 "" = ""
rot13 s = map rotate s
main = do
print $ rot13( "Hey fellow warriors" )
print $ rot13( "This is a test")
print $ rot13( "This is another test" )
print $ rot13("\604099\159558\705559&\546452\390142")
n <- getLine
print $ rot13( show n)
This is my code for ROT13 and there is an error when I try to pass file ending directly
rot13.hs:8:15: error:
lexical error in string/character literal at character ' '
|
8 | rotate x | '\' = '\'
There is also an error even from if not replace just use isAlpha to filter
How to deal with this?
As in many languages, backslash is the escape character. It's used to introduce characters that are hard or impossible to include in strings in other ways. For example, strings can't span multiple lines*, so it's impossible to include a literal newline in a string literal; and double-quotes end the string, so it's normally impossible to include a double quote in a string literal. The \n and \" escapes, respectively, covers those:
> putStrLn "before\nmiddle\"after"
before
middle"after
>
Since \ introduces escape codes, it always expects to be followed by something. If you want a literal backslash to be included at that spot, you can use a second backslash. For example:
> putStrLn "before\\after"
before\after
>
The Report, Section 2.6 is the final word on what escapes are available and what they mean.
Literal characters have a similar (though not quite identical) collection of escapes to strings. So the fix to your syntax looks like this:
rotate x | '\\' = '\\'
This will let your code parse, though there are further errors to fix once you get past that.
* Yes, yes, string gaps. I know. Doesn't actually change the point, since the newline in the gap isn't included in the resulting string.

What does "?\s" mean in Elixir?

In the Elixir-documentation covering comprehensions I ran across the following example:
iex> for <<c <- " hello world ">>, c != ?\s, into: "", do: <<c>>
"helloworld"
I sort of understand the whole expression now, but I can't figure out what the "?\s" means.
I know that it somehow matches and thus filters out the spaces, but that's where my understanding ends.
Edit: I have now figured out that it resolves to 32, which is the character code of a space, but I still don't know why.
erlang has char literals denoted by a dollar sign.
Erlang/OTP 22 [erts-10.6.1] [...]
Eshell V10.6.1 (abort with ^G)
1> $\s == 32.
%%⇒ true
The same way elixir has char literals that according to the code documentation act exactly as erlang char literals:
This is exactly what Erlang does with Erlang char literals ($a).
Basically, ?\s is exactly the same as ?  (question mark followed by a space.)
# ⇓ space here
iex|1 ▶ ?\s == ?
warning: found ? followed by code point 0x20 (space), please use ?\s instead
There is nothing special with ?\s, as you might see:
for <<c <- " hello world ">>, c != ?o, into: "", do: <<c>>
#⇒ " hell wrld "
Also, ruby as well uses ?c notation for char literals:
main> ?\s == ' '
#⇒ true
? is a literal that gives you the following character's codepoint( https://elixir-lang.org/getting-started/binaries-strings-and-char-lists.html#utf-8-and-unicode). For characters that cannot be expressed literally (space is just one of them, but there are more: tab, carriage return, ...) the escaped sequence should be used instead. So ?\s gives you a codepoint for space:
iex> ?\s
32

Read A String Exactly As It Is in Haskell

My program is something like that:
func = do
text <- getLine
return text
If I read line \123\456, the result is, naturally, \\123\\456.
How can I obtain \123\456 as the result?
Based on the discussion in comments, it looks like you want to parse the string as if it was a string literal, except that it is not surrounded by quotes.
We can make use of of read :: Read a => String -> a here that for a string parses it as if it was a string literal to a string. The only problem is that this string literal is surrounded by double quotes (").
We can thus add these quotes, and work with:
read ('"' : text ++ "\"") :: String
Not every string text is however per se a valid string literal, so the above might fail. For example if the text contains a double quote itself, that is not directly preceded by a backslash (\).

Parsing block comments with Megaparsec using symbols for start and end

I want to parse text similar to this in Haskell using Megaparsec.
# START SKIP
def foo(a,b):
c = 2*a # Foo
return a + b
# END SKIP
, where # START SKIP and # END SKIP marks the start and end of the block of text to parse.
Compared to skipBlockComment I want the parser to return the lines between the start and end marker.
This is my parser.
skip :: Parser String
skip = s >> manyTill anyChar e
where s = string "# START SKIP"
e = string "# END SKIP"
The skip parser works as intended.
To allow for a variable amount of white space within the start and end marker, for example # START SKIP I've tried the following:
skip' :: Parser String
skip' = s >> manyTill anyChar e
where s = symbol "#" >> symbol "START" >> symbol "SKIP"
e = symbol "#" >> symbol "END" >> symbol "SKIP"
Using skip' to parse the above text gives the following error.
3:15:
unexpected 'F'
expecting "END", space, or tab
I would like to understand the cause of this error and how I can fix it.
As Alec already commented, the problem is that as soon as e encounters '#', it counts as a consumed character. And the way parsec and its derivatives work is that as soon as you've consumed any characters, you're committed to that parsing branch – i.e. the manyTill anyChar alternative is then not considered anymore, even though e ultimately fails here.
You can easily request backtracking though, by wrapping the end delimiter in try:
skip' :: Parser String
skip' = s >> manyTill anyChar e
where s = symbol "#" >> symbol "START" >> symbol "SKIP"
e = try $ symbol "#" >> symbol "END" >> symbol "SKIP"
This then will before consuming '#' set a “checkpoint”, and when e fails later on (in your example, at "Foo"), it will act as if no characters had matched at all.
In fact, traditional parsec would give the same behaviour also for skip. Just, because looking for a string and only succeeding if it matches entirely is such a common task, megaparsec's string is implemented like try . string, i.e. if the failure occurs within that fixed string then it will always backtrack.
However, compound parsers still don't backtrack by default, like they do in attoparsec. The main reason is that if anything can backtrack to any point, you can't really get a clear point of failure to show in the error message.

Pushing back tokens in Happy and Alex

I'm parsing a language which has both < and <<. In my Alex definition I've got something that contains something like
tokens :-
"<" { token Lt }
"<<" { token (BinOp Shl) }
so whenever I encounter <<, that gets tokenized as a left shift and not as to less-than's. This is generally a good thing, since I end up throwing out whitespace after tokenization and want to differentiate between 1 < < 2 and 1 << 2. However, there are other times I wish << had been read as two <. For example, I have things like
<<A>::B>
which I want read like
< < A > :: B >
Obviously I can try to adjust my Happy parser rules to accommodate for the extra cases, but that scales badly. In other imperative parser generators, I might try to do something like push back "part" of the token (something like push_back("<") when I encountered << but I only needed <).
Has anyone else had such a problem and, if so, how did you deal with it? Are there ways of "pushing back" tokens in Happy? Should I instead try to keep a whitespace token around (I'm actually leaning towards the last alternative - although being a huge headache, it would let me deal with << by just making sure there is no whitespace between the two <).
I don’t know how to express this in Happy, but you don’t need a separate “whitespace” token. You can parse < or > as a distinct “angle bracket” token when immediately followed by an operator symbol in the input, with no intervening whitespace.
Then, when you want to parse an operator, you join a sequence of angles and operators into a single token. When you want to treat them as brackets, you just deal with them separately as usual.
So a << b would be tokenised as:
identifier "a"
left angle -- joined with following operator
operator "<"
identifier "b"
When parsing an operator, you concatenate angle tokens with the following operator token, producing a single operator "<<" token.
<<A>::B> would be tokenised as:
left angle
operator "<" -- accepted as bracket
identifier "A"
right angle
operator "::"
identifier "B"
operator ">" -- accepted as bracket
When parsing angle-bracketed terms, you accept both angle tokens and </> operators.
This relies on your grammar not being ambiguous wrt. whether you should parse an operator name or a bracketed thing.
While I initially went with #Jon's answer, I ended up running into a variety of precedence related issues (think precedence around expr < expr vs expr << expr) which caused me a lot of headaches. I recently (successfully) back to lexing << as one token. The solution was twofold:
I bit the bullet and added extra rules for << (where previously I only had rules for <). For the example in the question (<<A>::B>) my rule went from something like
ty_qual_path
: '<' ty_sum '>' '::' ident
to
ty_qual_path
: '<' ty_sum '>' '::' ident
| '<<' ty_sum '>' '::' ident '>' '::' ident
(The actual rule was actually a bit more involved, but that is not for this answer).
I found a clever way to deal with token that started with > (these would cause problems around things like vector<i32,vector<i32>> where the last >> was a token): use a threaded lexer (section 2.5.2), exploit the {%% ... } RHS of rules which lets you reconsider the lookahead token, and add a pushToken facility to my parser monad (this turned out to be quite simple - here is exactly what I did). I then added a dummy rule - something like
gt :: { () }
: {- empty -} {%% \tok ->
case tok of
Tok ">>" -> pushToken (Tok ">") *> pushToken (Tok ">")
Tok ">=" -> pushToken (Tok "=") *> pushToken (Tok ">")
Tok ">>=" -> pushToken (Tok ">=") *> pushToken (Tok ">")
_ -> pushToken tok
}
And every time in some other rule I expected a > but there could also be any other token starting with >, I would precede the > token with gt. This has the effect of looking ahead to the next token which may could start with > without being >, and try to convert that token into one > token and another token for the "rest" of the initial token.

Resources