Haskell: Matching String Prefixes against List - haskell

I've been learning some Haskell lately, and I thought a lexer might be a fun project. I'm using this ANSI C Yacc grammar as a guide.
The general program structure is:
lex :: [Char] -> Maybe [Token]
lex s =
case tokenize([], s) of
Just (tokens, []) -> Just tokens
_ -> Nothing
tokenize :: ([Token], [Char]) -> Maybe ([Token], [Char])
Where tokenize builds a list of tokens. I'm having trouble thinking of a suitable structure for tokenize. For example, to match keywords like int, I could write:
tokenize (toks, 'i':'n':'t':' ':rest) = tokenize (toks++[TokenKeyword IntK], rest)
But this seems like a terrible way to do things. Is there a way to pattern match against elements in a list? Could I create a list of all keywords, and attempt to match them as prefixes of the input string?

If you want to match based on a string prefix, you could use the ViewPatterns extension. This extension can be enabled by passing -XViewPatterns to the compiler, by running :set -XViewPatterns in ghci, or by putting {-# LANGUAGE ViewPatterns #-} at the top of the file.
Then, you can write a function matchPrefix (not 100% optimal, as it does iterate over prefix twice):
matchPrefix :: String -> String -> Maybe String
matchPrefix prefix result
| and (zipWith (==) prefix result) = Just (drop (length prefix) result)
| otherwise = Nothing
And then use it in a pattern like the following:
startsWithInt :: String -> Bool
startsWithInt (matchPrefix "int " -> Just rest) = True
startsWithInt _ = False
If you wanted to match based on a list of tokens, and get out the rest of the string and which token matched, you could do that by modifying matchPrefix to do that instead.

Related

How to pattern match an abstract data type when the data constructor isn't in scope?

I'm writing a parser library using Parsec combinators, and I want to unit test some of my parsers. So I have a simple parser:
dash :: GenParser Char st Char
dash = char '-'
I'd like to write some tests for it. The positive test is pretty easy:
spec :: Spec
spec = do
describe "dash" $ do
it "parses a dash" $
parse dash "N/A" "-" `shouldBe` (Right '-')
I'd like to write a negative test as well. When the parser doesn't match, it returns Left of a ParseError. I'd like to write a test that validates the exact message that the ParseError contains. So what I'd really like to do is something like
spec :: Spec
spec = do
describe "dash" $ do
it "doesn't parse an underscore" $
parse dash "N/A" "_" `shouldSatisfy` (hasErrorMessage "not a dash")
hasErrorMessage (Left (ParseError _ msgs)) expected = msg == expected
hasErrorMessage _ expected = False
But I'm having trouble writing this sort of code, since the ParseError data constructor isn't exported from Text.Parsec.Error.
Is there any way to use pattern matching on types where no data constructor for the type is in scope?
I know I could write hasErrorMessage something like
hasErrorMessage :: String -> (Either ParseError a) -> Bool
hasErrorMessage expected (Left pe) = elem expected $ fmap messageString (errorMessages pe)
but I'd like to understand this nuance, too.
Although the data constructor isn't exported, functions to access its parameters are. You can use these in combination with view patterns to sort of get what you want. In your case, the pattern (errorMessages -> msgs) can stand in almost perfectly for (ParseError _ msgs), with two caveats:
You need {-# LANGUAGE ViewPatterns #-} to use this feature.
errorMessages sorts the messages, which a pattern match on the data constructor wouldn't do.
You can even use this technique with pattern synonyms to make a fake data constructor, so you can use the exact same syntax you would otherwise:
{-# LANGUAGE PatternSynonyms, ViewPatterns #-}
pattern ParseError pos msgs <- ((,) <$> errorPos <*> errorMessages -> (pos, msgs)) where
ParseError pos msgs = foldr addErrorMessage (newErrorUnknown pos) msgs

Converting Paulson's parser combinators to Haskell

I am trying to convert the code from Paulson's ML for the working programmer book chapter 9, Writing Interpreters for the λ-Calculus.
I was wondering if anyone can help me translate this to Haskell.
I'm struggling to understand the syntax.
fun list ph = ph -- repeat ("," $-- ph) >> (op::);
fun pack ph = "(" $-- list ph --$")" >> #1
| empty;
In porting this code to Haskell, I see two challenges: One is rewriting the combinators so they use the type Either SyntaxError rather than exceptions for flow control, and the other is preserving the modularity of ML's functors. That is, writing a parser combinator library that is modular with regards to what keywords / symbols / tokenizer it should use.
While the ML code has the two
functor Lexical (Keyword: KEYWORD) : LEXICAL
functor Parsing (Lex: LEXICAL) : PARSE
you could start by having
data Keyword = Keyword
{ alphas :: [String]
, symbols :: [String]
}
data Token
= Key String
| Id String
deriving (Show, Eq)
lex :: Keyword -> String -> [Token]
lex kw s = ...
where
alphaTok :: String -> Token
alphaTok a | a `elem` alphas kw = Key a
| otherwise = Id a
...
The ML code uses the types string and substring while Haskell's String is actually a [Char]. The lexer functions would look a little different because ML's String.getc could simply be the pattern match c : ss1 in Haskell, etc.
Paulson's parsers have type [Token] → (τ, [Token]) but allow for exceptions. The Haskell parsers could have type [Token] → Either SyntaxError (τ, [Token]):
newtype SyntaxError = SyntaxError String
deriving Show
newtype Parser a = Parser { runParser :: [Token] -> Either SyntaxError (a, [Token]) }
err :: String -> Either SyntaxError b
err msg = Left (SyntaxError msg)
The operators id, $, ||, !!, -- and >> need new names, since they collide with a bunch of built-in operators and single-line comments. Ideas for names could be: ident, kw, |||, +++ and >>>. I would skip implementing the !! operator initially.
Here are two combinators implemented a little differently,
ident :: Parser String
ident = Parser f
where
f :: [Token] -> Either SyntaxError (String, [Token])
f (Id x : toks) = Right (x, toks)
f (Key x : _) = err $ "Identifier expected, got keyword '" ++ x ++ "'"
f [] = err "Identifier expected, got EOF"
(+++) :: Parser a -> Parser b -> Parser (a, b)
(+++) pa pb = Parser $ \toks1 -> do (x, toks2) <- runP pa toks1
(y, toks3) <- runP pb toks2
return ((x, y), toks3)
...
Some final remarks:
Read the paper Monadic Parsing in Haskell (Hutton, Meijer).
You may be interested in SimpleParse by Ken Friis Larsen, an educational parser combinator library that is a simplification of ReadP by Koen Claessen, since its source code is very easy to read. They are both non-deterministic.
If you're interested in using parser combinators in Haskell, rather than porting some old-fashioned library for the learning experience, I encourage you too look at Megaparsec (tutorial), a modern fork of Parsec. The implementation is a little complex.
None of these three libraries (SimpleParse, ReadP, Megaparsec) split lexing and parsing into two separate steps. Rather, they simply build small tokenizing parsers that implicitly eat meaningless whitespace. (See the token combinator in SimpleParse, for example.) However, Megaparsec does allow an arbitrary token type, whether that is Char or some token you have lexed.

converting a list of string into a list of tuples in Haskell

I have a list of strings:
[" ix = index"," ctr = counter"," tbl = table"]
and I want to create a tuple from it like:
[("ix","index"),("ctr","counter"),("tbl","table")]
I even tried:
genTuple [] = []
genTuples (a:as)= do
i<-splitOn '=' a
genTuples as
return i
Any help would be appriciated
Thank you.
Haskell's type system is really expressive, so I suggest to think about the problem in terms of types. The advantage of this is that you can solve the problem 'top-down' and the whole program can be typechecked as you go, so you can catch all kinds of errors early on. The general approach is to incrementally divide the problem into smaller functions, each of which remaining undefined initially but with some plausible type.
What you want is a function (let's call it convert) which take a list of strings and generates a list of tuples, i.e.
convert :: [String] -> [(String, String)]
convert = undefined
It's clear that each string in the input list will need to be parsed into a 2-tuple of strings. However, it's possible that the parsing can fail - the sheer type String makes no guarantees that your input string is well formed. So your parse function maybe returns a tuple. We get:
parse :: String -> Maybe (String, String)
parse = undefined
We can immediately plug this into our convert function using mapMaybe:
convert :: [String] -> [(String, String)]
convert list = mapMaybe parse list
So far, so good - but parse is literally still undefined. Let's say that it should first verify that the input string is 'valid', and if it is - it splits it. So we'll need
valid :: String -> Bool
valid = undefined
split :: String -> (String, String)
split = undefined
Now we can define parse:
parse :: String -> Maybe (String, String)
parse s | valid s = Just (split s)
| otherwise = Nothing
What makes a string valid? Let's say it has to contain a = sign:
valid :: String -> Bool
valid s = '=' `elem` s
For splitting, we'll take all the characters up to the first = for the first tuple element, and the rest for the second. However, you probably want to trim leading/trailing whitespace as well, so we'll need another function. For now, let's make it a no-op
trim :: String -> String
trim = id
Using this, we can finally define
split :: String -> (String, String)
split s = (trim a, trim (tail b))
where
(a, b) = span (/= '=') s
Note that we can safely call tail here because we know that b is never empty because there's always a separator (that's what valid verified). Type-wise, it would've been nice to express this guarantee using a "non-empty string" but that may be a bit overengineered. :-)
Now, there are a lot of solutions to the problem, this is just one example (and there are ways to shorten the code using eta reduction or existing libraries). The main point I'm trying to get across is that Haskell's type system allows you to approach the problem in a way which is directed by types, which means the compiler helps you fleshing out a solution from the very beginning.
You can do it like this:
import Control.Monda
import Data.List
import Data.List.Split
map ((\[a,b] -> (a,b)) . splitOn "=" . filter (/=' ')) [" ix = index"," ctr = counter"," tbl = table"]

How to return a polymorphic type in Haskell based on the results of string parsing?

TL;DR:
How can I write a function which is polymorphic in its return type? I'm working on an exercise where the task is to write a function which is capable of analyzing a String and, depending on its contents, generate either a Vector [Int], Vector [Char] or Vector [String].
Longer version:
Here are a few examples of how the intended function would behave:
The string "1 2\n3 4" would generate a Vector [Int] that's made up of two lists: [1,2] and [3,4].
The string "'t' 'i' 'c'\n't' 'a' 'c'\n't' 'o' 'e'" would generate a Vector [Char] (i.e., made up of the lists "tic", "tac" and "toe").
The string "\"hello\" \"world\"\n\"monad\" \"party\"" would generate a Vector [String] (i.e., ["hello","world"] and ["monad","party"]).
Error-checking/exception handling is not a concern for this particular exercise. At this stage, all testing is done purely, i.e., this isn't in the realm of the IO monad.
What I have so far:
I have a function (and new datatype) which is capable of classifying a string. I also have functions (one for each Int, Char and String) which can convert the string into the necessary Vector.
My question: how can I combine these three conversion functions into a single function?
What I've tried:
(It obviously doesn't typecheck if I stuff the three conversion
functions into a single function (i.e., using a case..of structure
to pattern match on VectorType of the string.
I tried making a Vectorable class and defining a separate instance for each type; I quickly realized that this approach only works if the functions' arguments vary by type. In our case, the the type of the argument doesn't vary (i.e., it's always a String).
My code:
A few comments
Parsing: the mySplitter object and the mySplit function handle the parsing. It's admittedly a crude parser based on the Splitter type and the split function from Data.List.Split.Internals.
Classifying: The classify function is capable of determining the final VectorType based on the string.
Converting: The toVectorNumber, toVectorChar and toVectorString functions are able to convert a string to type Vector [Int], Vector [Char] and Vector [String], respectively.
As a side note, I'm trying out CorePrelude based on a recommendation from a mentor. That's why you'll see me use the generalized versions of the normal Prelude functions.
Code:
import qualified Prelude
import CorePrelude
import Data.Foldable (concat, elem, any)
import Control.Monad (mfilter)
import Text.Read (read)
import Data.Char (isAlpha, isSpace)
import Data.List.Split (split)
import Data.List.Split.Internals (Splitter(..), DelimPolicy(..), CondensePolicy(..), EndPolicy(..), Delimiter(..))
import Data.Vector ()
import qualified Data.Vector as V
data VectorType = Number | Character | TextString deriving (Show)
mySplitter :: [Char] -> Splitter Char
mySplitter elts = Splitter { delimiter = Delimiter [(`elem` elts)]
, delimPolicy = Drop
, condensePolicy = Condense
, initBlankPolicy = DropBlank
, finalBlankPolicy = DropBlank }
mySplit :: [Char]-> [Char]-> [[Char]]
mySplit delims = split (mySplitter delims)
classify :: String -> VectorType
classify xs
| '\"' `elem` cs = TextString
| hasAlpha cs = Character
| otherwise = Number
where
cs = concat $ split (mySplitter "\n") xs
hasAlpha = any isAlpha . mfilter (/=' ')
toRows :: [Char] -> [[Char]]
toRows = mySplit "\n"
toVectorChar :: [Char] -> Vector [Char]
toVectorChar = let toChar = concat . mySplit " \'"
in V.fromList . fmap (toChar) . toRows
toVectorNumber :: [Char] -> Vector [Int]
toVectorNumber = let toNumber = fmap (\x -> read x :: Int) . mySplit " "
in V.fromList . fmap toNumber . toRows
toVectorString :: [Char] -> Vector [[Char]]
toVectorString = let toString = mfilter (/= " ") . mySplit "\""
in V.fromList . fmap toString . toRows
You can't.
Covariant polymorphism is not supported in Haskell, and wouldn't be useful if it were.
That's basically all there is to answer. Now as to why this is so.
It's no good "returning a polymorphic value" like OO languages so like to do, because the only reason to return any value at all is to use it in other functions. Now, in OO languages you don't have functions but methods that come with the object, so it's quite easy to "return different types": each will have its suitable methods built-in, and they can per instance vary. (Whether that's a good idea is another question.)
But in Haskell, the functions come from elsewhere. They don't know about implementation changes for a particular instance, so the only way such functions can safely be defined is to know every possible implementation. But if your return type is really polymorphic, that's not possible, because polymorphism is an "open" concept (it allows new implementation varieties to be added any time later).
Instead, Haskell has a very convenient and totally safe mechanism of describing a closed set of "instances" – you've actually used it yourself already! ADTs.
data PolyVector = NumbersVector (Vector [Int])
| CharsVector (Vector [Char])
| StringsVector (Vector [String])
That's the return type you want. The function won't be polymorphic as such, it'll simply return a more versatile type.
If you insist it should be polymorphic
Now... actually, Haskell does have a way to sort-of deal with "polymorphic returns". As in OO when you declare that you return a subclass of a specified class. Well, you can't "return a class" at all in Haskell, you can only return types. But those can be made to express "any instance of...". It's called existential quantification.
{-# LANGUAGE GADTs #-}
data PolyVector' where
PolyVector :: YourVElemClass e => Vector [e] -> PolyVector'
class YourVElemClass where
...?
instance YourVElemClass Int
instance YourVElemClass Char
instance YourVElemClass String
I don't know if that looks intriguing to you. Truth is, it's much more complicated and rather harder to use; you can't just just any of the possible results directly but can only make use of the elements through methods of YourVElemClass. GADTs can in some applications be extremely useful, but these usually involve classes with very deep mathematical motivation. YourVElemClass doesn't seem to have such a motivation, so you'll be much better off with a simple ADT alternative, than existential quantification.
There's a famous rant against existentials by Luke Palmer (note he uses another syntax, existential-specific, which I consider obsolete, as GADTs are strictly more general).
Easy, use an sum type!
data ParsedVector = NumberVector (Vector [Int]) | CharacterVector (Vector [Char]) | TextString (Vector [String]) deriving (Show)
parse :: [Char] -> ParsedVector
parse cs = case classify cs of
Number -> NumberVector $ toVectorNumber cs
Character -> CharacterVector $ toVectorChar cs
TextString -> TextStringVector $ toVectorString cs

Can constraints be enforced on public data types?

I have the following code :
-- A CharBox is a rectangular matrix of characters
data CharBox = CharBox [String]
deriving Show
-- Build a CharBox, ensuring the contents are rectangular
mkCharBox :: [String] -> CharBox
mkCharBox [] = CharBox []
mkCharBox xxs#(x:xs) = if (all (\s -> (length s) == length x) xs)
then CharBox xxs
else error "CharBox must be a rectangle."
The [[Char]] must be rectangular (i.e. all sub-lists must have the same length) for many functions in the module to work properly. Inside the module I'm always using the mkCharBox "constructor" so I don't have to enforce this constraint all the time.
Initially I wanted my module declaration to look like this :
module CharBox (
CharBox, -- No (CharBox) because it doesn't enforce rectangularity
mkCharBox
) where
But like that, users of my module cannot pattern match on CharBox. In another module I do
findWiresRight :: CharBox -> [Int]
findWiresRight (CharBox xs) = elemIndices '-' (map last xs)
And ghci complains: Not in scope: data constructor 'CharBox'
Is it possible to enforce my constraint that CharBoxes contain only rectangular arrays, while still allowing pattern matching ? (Also if this is not possible, I'd be interested in knowing the technical reason why. I find there's usually a lot to learn in Haskell when exploring such restrictions)
It's not possible in vanilla Haskell to both hide the constructors and support pattern matching.
The usual approaches to address this are:
view patterns, essentially, export the pattern matching functions.
or:
move the invariant into the type system via size types.
The simplest solution would be to add an extract function to the module:
extract :: CharBox -> [String]
extract (CharBox xs) = xs
and then use it instead of pattern matching:
findWiresRight :: CharBox -> [Int]
findWiresRight c = elemIndices '-' $ map last $ extract c

Resources