How to combine Megaparsec with Text.Read (derived Read instance) - haskell

I want to use the derived instances of Read in the megaparsec module.
How can I use 'Text.Read.read' or 'Text.Read.readEither' in a 'Parser a' ?
It needs not to be fast, but easy to maintain and to extend.
The megaparsec module is for testing my application via CLI, so many different datatypes must be parsed.
It shall work in the following way:
import Text.Megaparsec
readableDatatype :: Read a => Parser a
readableDatatype =
-- This is wrong, but describes how it shall work
-- liftA read chunkToTokens
expr' :: Parser UserControlExpr
expr' = timeExpr
<|> timeEventExpr
<|> digiInExpr
<|> quitExpr
digiInExpr :: Parser UserControlExpr
digiInExpr = do
cmdword "digiIn"
inElement <- (readableDatatype :: Parser TI_I)
return $ UserDigiIn inElement
What do I have to write, so that the three functions typecheck, especially readableDataype ?

You can use getInput :: MonadParsec e s m => m s and setInput :: MonadParsec e s m => s -> m () together with reads :: Read a => String -> [(a, String)] for that. getInput and setInput just get and set the input stream the parser is working on and reads takes a string and returns a list of possible parses together with the remaining unconsumed portions of the input. We also need to tell the parser the new offset in the input, otherwise error locations are wrong. We can do that using getOffset and setOffset.
-- For equality constraint (~)
{-# LANGUAGE TypeFamilies #-}
import Text.Megaparsec
import Text.Read (reads)
readableDatatype :: (Read a, MonadParsec e s m, s ~ String) => m a
readableDatatype = do
input <- getInput
offset <- getOffset
choice $
(\(a, input') -> a <$ setInput input'
<* setOffset (offset + length input - length input'))
<$> reads input
If your input is something other than String you will have to convert between that and String after getInput and before setInput.
The following is about performance concerns, so not really relevant to your problem, but maybe it is educational and it may be useful to others who may need a solution with good performance.
Converting the whole input between String and some other type all the time during parsing is a rather big performance bottleneck for larger input. Furthermore using length to calculate the new offset here is not very performant either.
To solve both of these problems need some way to be able to know how much of the input was actually consumed by the Read-parser, so that we can just drop that part from the original input instead of having to convert the whole unconsumed part back to the original input type. But the Read class does not have that. One could try to parse incrementally longer prefixes of the input, which may be faster in cases where the parses done using Read are short compared to the length of the entire input. You could also use unsafePerformIO to write to an IORef how much of the input was actually forced by the Read-parser which would be the fastest but not so pretty solution.
I implemented the latter here. Feel free to use it, but be aware that it is not very well tested. It does however solve all the problems with the above approach.

That did it. Thank you! In the meantime I made a "conservative" solution of the problem by defining the constructors as strings and parsing them, without using read. That has the advantage, that you got the impressive error message of megaparsec, that tell you what symbols are missing.
Example with read:
1:8:
|
1 | digiIn TI_I_Signal1 DirA Dectivated
| ^
unknown parse error
(only a 'a' was missing in "Deactivated")
example with an hand written parser for the datatype:
1:19:
|
1 | digiIn TI_I_Signal1 Dectivated
| ^^^^^^^^
unexpected "Dectivat"
expecting "active", "inactive", '0', or '1'
I think I will use your code block in future datatypes.
Thank you very much!

Related

How to make a custom Attoparsec parser combinator that returns a Vector instead of a list?

{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Text
import Control.Applicative(many)
import Data.Word
parseManyNumbers :: Parser [Int] -- I'd like many to return a Vector instead
parseManyNumbers = many (decimal <* skipSpace)
main :: IO ()
main = print $ parseOnly parseManyNumbers "131 45 68 214"
The above is just an example, but I need to parse a large amount of primitive values in Haskell and need to use arrays instead of lists. This is something that possible in the F#'s Fparsec, so I've went as far as looking at Attoparsec's source, but I can't figure out a way to do it. In fact, I can't figure out where many from Control.Applicative is defined in the base Haskell library. I thought it would be there as that is where documentation on Hackage points to, but no such luck.
Also, I am having trouble deciding what data structure to use here as I can't find something as convenient as a resizable array in Haskell, but I would rather not use inefficient tree based structures.
An option to me would be to skip Attoparsec and implement an entire parser inside the ST monad, but I would rather avoid it except as a very last resort.
There is a growable vector implementation in Haskell, which is based on the great AMT algorithm: "persistent-vector". Unfortunately, the library isn't that much known in the community so far. However to give you a clue about the performance of the algorithm, I'll say that it is the algorithm that drives the standard vector implementations in Scala and Clojure.
I suggest you implement your parser around that data-structure under the influence of the list-specialized implementations. Here the functions are, btw:
-- | One or more.
some :: f a -> f [a]
some v = some_v
where
many_v = some_v <|> pure []
some_v = (fmap (:) v) <*> many_v
-- | Zero or more.
many :: f a -> f [a]
many v = many_v
where
many_v = some_v <|> pure []
some_v = (fmap (:) v) <*> many_v
Some ideas:
Data Structures
I think the most practical data structure to use for the list of Ints is something like [Vector Int]. If each component Vector is sufficiently long (i.e. has length 1k) you'll get good space economy. You'll have
to write your own "list operations" to traverse it, but you'll avoid re-copying data that you would have to perform to return the data in a single Vector Int.
Also consider using a Dequeue instead of a list.
Stateful Parsing
Unlike Parsec, Attoparsec does not provide for user state. However, you
might be able to make use of the runScanner function (link):
runScanner :: s -> (s -> Word8 -> Maybe s) -> Parser (ByteString, s)
(It also returns the parsed ByteString which in your case may be problematic since it will be very large. Perhaps you can write an alternate version which doesn't do this.)
Using unsafeFreeze and unsafeThaw you can incrementally fill in a Vector. Your s data structure might look
something like:
data MyState = MyState
{ inNumber :: Bool -- True if seen a digit
, val :: Int -- value of int being parsed
, vecs :: [ Vector Int ] -- past parsed vectors
, v :: Vector Int -- current vector we are filling
, vsize :: Int -- number of items filled in current vector
}
Maybe instead of a [Vector Int] you use a Dequeue (Vector Int).
I imagine, however, that this approach will be slow since your parsing function will get called for every single character.
Represent the list as a single token
Parsec can be used to parse a stream of tokens, so how about writing
your own tokenizer and letting Parsec create the AST.
The key idea is to represent these large sequences of Ints as a single token. This gives you a lot more latitude in how you parse them.
Defer Conversion
Instead of converting the numbers to Ints at parse time, just have parseManyNumbers return a ByteString and defer the conversion until
you actually need the values. This much enable you to avoid reifying
the values as an actual list.
Vectors are arrays, under the hood. The tricky thing about arrays is that they are fixed-length. You pre-allocate an array of a certain length, and the only way of extending it is to copy the elements into a larger array.
This makes linked lists simply better at representing variable-length sequences. (It's also why list implementations in imperative languages amortise the cost of copying by allocating arrays with extra space and copying only when the space runs out.) If you don't know in advance how many elements there are going to be, your best bet is to use a list (and perhaps copy the list into a Vector afterwards using fromList, if you need to). That's why many returns a list: it runs the parser as many times as it can with no prior knowledge of how many that'll be.
On the other hand, if you happen to know how many numbers you're parsing, then a Vector could be more efficient. Perhaps you know a priori that there are always n numbers, or perhaps the protocol specifies before the start of the sequence how many numbers there'll be. Then you can use replicateM to allocate and populate the vector efficiently.

Extracting Information from Haskell Object

I'm new to Haskell and I'm confused on how to get values out of function results. In my particular case, I am trying to parse Haskell files and see which AST nodes appear on which lines. This is the code I have so far:
import Language.Haskell.Parser
import Language.Haskell.Syntax
getTree :: String -> IO (ParseResult HsModule)
getTree path = do
file <- readFile path
let tree = parseModuleWithMode (ParseMode path) file
return tree
main :: IO ()
main = do
tree <- getTree "ex.hs"
-- <do something with the tree other than print it>
print tree
So on the line where I have the comment, I have a syntax tree as tree. It appears to have type ParseResult HsModule. What I want is just HsModule. I guess what I'm looking for is a function as follows:
extract :: ParseResult a -> a
Or better yet, a general Haskell function
extract :: AnyType a -> a
Maybe I'm missing a major concept about Haskell here?
p.s. I understand that thinking of these things as "Objects" and trying to access "Fields" from them is wrong, but I'd like an explanation of how to deal with this type of thing in general.
Looking for a general function of type
extract :: AnyType a -> a
does indeed show a big misunderstanding about Haskell. Consider the many things AnyType might be, and how you might extract exactly one object from it. What about Maybe Int? You can easily enough convert Just 5 to 5, but what number should you return for Nothing?
Or what if AnyType is [], so that you have [String]? What should be the result of
extract ["help", "i'm", "trapped"]
or of
extract []
?
ParseResult has a similar "problem", in that it uses ParseOk to contain results indicating that everything was fine, and ParseFailed to indicate an error. Your incomplete pattern match successfully gets the result if the parse succeeded, but will crash your program if in fact the parse failed. By using ParseResult, Haskell is encouraging you to consider what you should do if the code you are analyzing did not parse correctly, rather than to just blithely assume it will come out fine.
The definition of ParseResult is:
data ParseResult a = ParseOk a | ParseFailed SrcLoc String
(obtained from source code)
So there are two possibilities: either the parsing succeeded, and it will return a ParseOk instance, or something went wrong during the parsing in which case you get the location of the error, and an error message with a ParseFailed constructor.
So you can define a function:
getData :: ParseResult a -> a
getData (ParseOk x) = x
getData (ParseFailed _ s) = error s
It is better to then throw an error as well, since it is always possible that your compiler/interpreter/analyzer/... parses a Haskell program containing syntax errors.
I just figured out how to do this. It seems that when I was trying to define
extract :: ParseResult a -> a
extract (ParseResult a) = a
I actually needed to use
extract :: ParseResult a -> a
extract (ParseOk a) = a
instead. I'm not 100% sure why this is.

How can I easily express that I don't care about a value of a particular data field?

I was writing tests for my parser, using a method which might not be the best, but has been working for me so far. The tests assumed perfectly defined AST representation for every code block, like so:
(parse "x = 5") `shouldBe` (Block [Assignment [LVar "x"] [Number 5.0]])
However, when I moved to more complex cases, a need for more "fuzzy" verification arised:
(parse "t.x = 5") `shouldBe` (Block [Assignment [LFieldRef (Var "t") (StringLiteral undefined "x")] [Number 5.0]])
I put in undefined in this example to showcase the field I don't want to be compared to the result of parse (It's a source position of a string literal). Right now the only way of fixing that I see is rewriting the code to make use of shouldSatisfy instead of shouldBe, which I'll have to do if I don't find any other solution.
You can write a normalizePosition function which replaces all the position data in your AST with some fixed dummyPosition value, and then use shouldBe against a pattern built from the same dummy value.
If the AST is very involved, consider writing this normalization using Scrap-Your-Boilerplate.
One way to solve this is to parametrize your AST over source locations:
{-# LANGUAGE DeriveFunctor #-}
data AST a = ...
deriving (Eq, Show, Functor)
Your parse function would then return an AST with SourceLocations:
parse :: String -> AST SourceLocation
As we derived a Functor instance above, we can easily replace source locations with something else, e.g. ():
import Data.Functor ((<$))
parseTest :: String -> AST ()
parseTest input = () <$ parse input
Now, just use parseTest instead of parse in your specs.

Parsec returns [Char] instead of Text

I am trying to create a parser for a custom file format. In the format I am working with, some fields have a closing tag like so:
<SOL>
<DATE>0517
<YEAR>86
</SOL>
I am trying to grab the value between the </ and > and use it as part of the bigger parser.
I have come up with the code below. The trouble is, the parser returns [Char] instead of Text. I can pack each Char by doing fmap pack $ return r to get a text value out, but I was hoping type inference would save me from having to do this. Could someone give hints as to why I am getting back [Char] instead of Text, and how I can get back Text without having to manually pack the value?
{-# LANGUAGE NoMonomorphismRestriction #-}
{-# LANGUAGE OverloadedStrings #-}
import Data.Text
import Text.Parsec
import Text.Parsec.Text
-- |A closing tag is on its own line and is a "</" followed by some uppercase characters
-- followed by some '>'
closingTag = do
_ <- char '\n'
r <- between (string "</") (char '>') (many upper)
return r
string has the type
string :: Stream s m Char => String -> ParsecT s u m String
(See here for documentation)
So getting a String back is exactly what's supposed to happen.
Type inference doesn't change types, it only infers them. String is a concrete type, so there's no way to infer Text for it.
What you could do, if you need this in a couple of places, is to write a function
text :: Stream s m Char => String -> ParsecT s u m Text
text = fmap pack . string
or even
string' :: (IsString a, Stream s m Char) => String -> ParsecT s u m a
string' = fmap fromString . string
Also, it doesn't matter in this example but you'd probably want to import Text qualified, names like pack are used in a number of different modules.
As Ørjan Johansen correctly pointed out, string isn't actually the problem here, many upper is. The same principle applies though.
The reason you get [Char] here is that upper parses a Char and many turns that into a [Char]. I would write my own combinator along the lines of:
manyPacked = fmap pack . many
You could probably use type-level programming with type classes etc. to automatically choose between many and manyPack depending on the expect return type, but I don't think that's worth it. (It would probably look a bit like Scala's CanBuiltFrom).

Text or Bytestring

Good day.
The one thing I now hate about Haskell is quantity of packages for working with string.
First I used native Haskell [Char] strings, but when I tried to start using hackage libraries then completely lost in endless conversions. Every package seem to use different strings implementation, some adopts their own handmade thing.
Next I rewrote my code with Data.Text strings and OverloadedStrings extension, I chose Text because it has a wider set of functions, but it seems many projects prefer ByteString.
Someone could give short reasoning why to use one or other?
PS: btw how to convert from Text to ByteString?
Couldn't match expected type
Data.ByteString.Lazy.Internal.ByteString
against inferred type Text
Expected type: IO Data.ByteString.Lazy.Internal.ByteString
Inferred type: IO Text
I tried encodeUtf8 from Data.Text.Encoding, but no luck:
Couldn't match expected type
Data.ByteString.Lazy.Internal.ByteString
against inferred type Data.ByteString.Internal.ByteString
UPD:
Thanks for responses, that *Chunks goodness looks like way to go, but I somewhat shocked with result, my original function looked like this:
htmlToItems :: Text -> [Item]
htmlToItems =
getItems . parseTags . convertFuzzy Discard "CP1251" "UTF8"
And now became:
htmlToItems :: Text -> [Item]
htmlToItems =
getItems . parseTags . fromLazyBS . convertFuzzy Discard "CP1251" "UTF8" . toLazyBS
where
toLazyBS t = fromChunks [encodeUtf8 t]
fromLazyBS t = decodeUtf8 $ intercalate "" $ toChunks t
And yes, this function is not working because its wrong, if we supply Text to it, then we're confident this text is properly encoded and ready to use and converting it is stupid thing to do, but such a verbose conversion still has to take place somewhere outside htmltoItems.
ByteStrings are mainly useful for binary data, but they are also an efficient way to process text if all you need is the ASCII character set. If you need to handle unicode strings, you need to use Text. However, I must emphasize that neither is a replacement for the other, and they are generally used for different things: while Text represents pure unicode, you still need to encode to and from a binary ByteString representation whenever you e.g. transport text via a socket or a file.
Here is a good article about the basics of unicode, which does a decent job of explaining the relation of unicode code-points (Text) and the encoded binary bytes (ByteString): The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
You can use the Data.Text.Encoding module to convert between the two datatypes, or Data.Text.Lazy.Encoding if you are using the lazy variants (as you seem to be doing based on your error messages).
You definitely want to be using Data.Text for textual data.
encodeUtf8 is the way to go. This error:
Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString
against inferred type Data.ByteString.Internal.ByteString
means that you're supplying a strict bytestring to code which expects a lazy bytestring. Conversion is easy with the fromChunks function:
Data.ByteString.Lazy.fromChunks :: [Data.ByteString.Internal.ByteString] -> ByteString
so all you need to do is add the function fromChunks [myStrictByteString] wherever the lazy bytestring is expected.
Conversion the other way can be accomplished with the dual function toChunks, which takes a lazy bytestring and gives a list of strict chunks.
You may want to ask the maintainers of some packages if they'd be able to provide a text interface instead of, or in addition to, a bytestring interface.
Use a single function cs from the Data.String.Conversions.
It will allow you to convert between String, ByteString and Text (as well as ByteString.Lazy and Text.Lazy), depending on the input and the expected types.
You still have to call it, but no longer to worry about the respective types.
See this answer for usage example.
For what it's worth, I found these two helper functions to be quite useful:
import qualified Data.ByteString.Char8 as BS
import qualified Data.Text as T
-- | Text to ByteString
tbs :: T.Text -> BS.ByteString
tbs = BS.pack . T.unpack
-- | ByteString to Text
bst :: BS.ByteString -> T.Text
bst = T.pack . BS.unpack
Example:
foo :: [BS.ByteString]
foo = ["hello", "world"]
bar :: [T.Text]
bar = bst <$> foo
baz :: [BS.ByteString]
baz = tbs <$> bar

Resources