What is the best way to convert a ByteString to an Int? - haskell

I always run into the following error when trying to read a ByteString:
Prelude.read: no parse
Here's a sample of code that will cause this error to occur upon rendering in a browser:
factSplice :: SnapletSplice App App
factSplice = do
mbstr <- getParam "input" -- returns user input as bytestring
let str = maybe (error "splice") show mbstr
let n = read str :: Int
return [X.TextNode $ T.pack $ show $ product [1..n]]
Or perhaps more simply:
simple bs = read (show bs) :: Int
For some reason, after show bs the resulting string includes quotes.
So in order to get around the error I have to remove the quotes then read it.
I use the following function copied from the internet to do so:
sq :: String -> String
sq s#[c] = s
sq ('"':s) | last s == '"' = init s
| otherwise = s
sq ('\'':s) | last s == '\'' = init s
| otherwise = s
sq s = s
Then simple bs = read (sq.show bs) :: Int works as expected.
Why is this the case?
What is the best way to convert a ByteString to an Int?

What the best way to convert a ByteString to an X is depends onX. If you have a good conversion from String, going via Data.BytString.Char8.unpack can be good, if it's an ASCII ByteString. For UTF-8 encoded ByteStrings, the utf8-string package contains the conversion function toString. For some specific types, like Int, as mentioned in the title, special faster conversions exist. For example Data.ByteString.Char8.readInt and readInteger.

Show is used to create a String representation of something, that is useful for debugging and plain-text serialization. The Show typeclass is not just a fancy way of converting anything into a String. That's why ByteString adds quotes to the string: because it's arguably easier to read it that way when debugging or deserializing a data stream.
You can use the Data.ByteString.Char8.unpack function to convert a ByteString to a String, but note that this unpacks the ByteString byte-per-byte, which messes up high-value Unicode characters or other characters that are stored as more than one byte; if you want to do something other than using read on the result, I'd recommend converting the ByteString to Text instead, which offers more flexibility in this situation. Assuming that your encoding is UTF8 in this case (As should be the default in Snap), you can use the Data.Text.Encoding.decodeUtf8 function for this. To then convert a Text value to a String with correct Unicode symbols, you use Data.Text.unpack.
Once you have a String, you are free to read it as much as you want; alternatively, you can choose to read a Text value directly using the functions in the Data.Text.Read module.

Related

How to read three consecutive integers from stdin in Haskell?

I want to read an input like 12 34 56 into three integers using Haskell.
For a single integer, one might use myInteger <- readLn. But for this case, I have not found any solution, except the one of first reading a line, then replacing all spaces with ,, (using something like:
spaceToCommas str =
let repl ' ' = ','
repl c = c
in map repl str
) and then calling read '[' ++ str ++ ']' which feels very hackish. Also, it does not allow me to state that I want to read three integers, it will attempt to read any amount of integers from stdin.
There has to be a better way.
Note that I would like a solution that does not rely on external packages. Using e.g. Parsec is of course great, but this simple example should not require the use of a full-fledged Parser Combinator framework, right?
What about converting the string like:
convert :: Read a => String -> [a]
convert = map read . words
words splits the given string into a list of strings (the "words") and then we perform a read on every element using map.
and for instance use it like:
main = do
line <- getLine
let [a,b,c] = convert line :: [Int] in putStrLn (show (c,a,b))
or if you for instance want to read the first three elements and don't care about the rest (yes this apparently requires super-creativity skills):
main = do
line <- getLine
let (a:b:c:_) = convert line :: [Int] in putStrLn (show (c,a,b))
I here returned a tuple that is rotated one place to the right to show parsing is done.

Convert unescaped unicode to utf8 integer

Firstly, I apologize if the terms "unescaped unicode" and "utf8 integer" are not correct; I don't really know what I'm talking about when I'm talking about encoding.
As a concrete example, I would like to convert the string "\\u00b5ABC" to the string "\181ABC" (\u00b5 and \181 correspond to ยต). By "string" I mean String or Text.
I know how to achieve this by using a tortuous (and perhaps laughable) way:
import Data.Aeson (decode)
import Data.ByteString.Lazy (packChars)
import Data.Text (Text)
decode (packChars "\"\\u00b5ABC\"") :: Maybe Text
I am ready to bet there exists a more direct way...
Edit
Following #Alec's comment, I provide more context. In the background, there is a Javascript program that receives a character string and replaces the characters in this string by their unicode representation \\uxxxx when this unicode representation is between \u007F and \uFFFF.
On the Haskell side, I receive this new string, and I want to replace the \\uxxxx with their corresponding utf8 integer representations.
Here's a nice simple parser written using regex-applicative. First some imports and other nonsense that isn't worth reading:
import Data.Char
import Data.Maybe
import Numeric
import Text.Regex.Applicative
-- no idea why this isn't in Control.Applicative
replicateA :: Applicative f => Int -> f a -> f [a]
replicateA n act = sequenceA (replicate n act)
Now, we want to parse an escaped character. We'll use a regex that matches characters and returns a character, so it's an RE Char Char. Ideally I'd write it this way:
escaped :: RE Char Char
escaped = do
string "\\u"
digits <- replicateM 4 (psym isHexDigit)
return . chr . fst . head . readHex $ digits
The head is safe because we've ensured that readHex will only be passed hex digits, and therefore will succeed. We can almost write it like that, except that RE Char is not a Monad. With newish GHC's you can probably turn on ApplicativeDo and be done with it, but it's not so bad to write in applicative style ourselves anyway and support all GHC's, so let's do that:
escaped :: RE Char Char
escaped
= chr . fst . head . readHex
<$> (string "\\u"
*> replicateA 4 (psym isHexDigit)
)
Anyway, once we have a regex for decoding a single escaped character, it's easy to produce a regex for decoding all the escaped characters and passing unescaped characters through unchanged: many (escaped <|> anySym). Since this regex will always succeed, we can ignore the Maybe-ness of (=~) hedging its bets about whether an expression will match, and write
decodeHex :: String -> String
decodeHex = fromJust . (=~ many (escaped <|> anySym))
Let's try it in ghci:
> decodeHex "\\u00b5ABC"
"\181ABC"
> decodeHex "\\u00bABC"
"\186BC"
> decodeHex "\\udefg"
"\\udefg"
The advantage of writing our own parser like this instead of relying on something like decode is that we gain control and confidence over exactly which transformations are being done; for example, since we know \u will always be followed by four hex digits, we can only transform it when that happens, in case the original, pre-Javascript text contained \\udefg and we want that to appear in the final output, rather than \3567g; and we don't have to worry that it is trying to de-escape other things that we don't want it to do; and we don't have to "extra-escape" our string before we hand it off, either, as you do with adding the extra quotes around it. And of course, the disadvantage is that we had to engineer it ourselves, and probably have less confidence in its correctness since it hasn't been battle-hardened by a thousand users!

parsec running out of memory

I wrote a parser for a large csv file which works on a smaller subset but runs out of memory for ~1.5m lines (the actual file).
After initially parsing all elements into a list(using manyTill), i instead used the parser state to store them in a single binary search tree - this worked for the large file.
i have since split the "element type" in three separate types and want to store them in their own tree, resulting in three trees of different type.
This version, though, only works for the small test file while running out of memory for the larger one.
import qualified Data.Tree.AVL as AVL
import qualified Text.ParserCombinators.Parsec as Parsec
----
data ENW = ENW (AVL.AVL Extent) (AVL.AVL Node) (AVL.AVL Way)
---- used to be Element = Extent | Node | Way in a (Tree Element) - this worked
csvParser :: Parsec String ENW ENW
csvParser = do (Parsec.manyTill (parseL) Parsec.eof) >> Parsec.getState
where parseL = parseLine >> ((Parsec.newline >> return ()) <|> Parsec.eof)
parseLine :: Parsec String ENW ()
parseLine = parseNode <|> parseWay <|> parseExtents
parseNode :: Parsec String ENW ()
parseNode = Parsec.string "node" *> (flip addNode <$> (Node <$> identifier <*> float <*> float)) >>= Parsec.updateState
where identifier = Parsec.tab *> (read <$> Parsec.many1 Parsec.digit)
float = Parsec.tab *> (read <$> parseFloat)
addNode :: ENW -> Node -> ENW
addNode (ENW e n w) node = (ENW e (AVL.push (sndCC node) node n) w)
parseWay and parseExtent follow the same pattern and the whole thing is started with
Parsec.runParser csvParser (ENW AVL.empty AVL.empty AVL.empty) "" input
i dont understand how using three smaller trees instead of a single large one can cause memory issues.
Do you have a good reason to not use Cassava? It can be used to stream CSV data and is probably more robust than an ad hoc CSV parser. My own experience with it has shown it has excellent performance and can be easily extended to parse your own types.
Edit: It also looks like you're working with tab separated value data, not comma separated data, but Cassava lets you specify what delimiter to split columns by.It also appears that the data you have is potentially different on each line so you may need to use Cassava's 'raw' format which returns a Vector ByteString for each line, which you can then parse based on the first element.
I've never seen anyone use the AVL tree package before, is there a good reason you aren't using more standard structures? That package is quite old (Last updated in 2008) and more recent packages are likely to perform better.

Parsec: grabbing raw source after parsing

I have a strange whim. Suppose I have something like this:
data Statement = StatementType Stuff Source
Now I want to parse such a statement, parse all the stuff, and after that I want to put all characters that I've processed (for this particular statement) into resulting data structure. For some reason.
Is it possible, and if yes, how to accomplish that?
In general this is not possible. parsec does not expect a lot from its stream type, in particular there is no way to efficently split a stream.
But for a concrete stream type (e.g. String, or [a], or ByteString) a hack like this would work:
parseWithSource :: Parsec [c] u a -> Parsec [c] u ([c], a)
parseWithSource p = do
input <- getInput
a <- p
input' <- getInput
return (take (length input - length input') input, a)
This solution relies on function getInput that returns current input. So we can get the input twice: before and after parsing, this gives us exact number of consumed elements, and knowing that we can take these elements from the original input.
Here you can see it in action:
*Main Text.Parsec> parseTest (between (char 'x') (char 'x') (parseWithSource ((read :: String -> Int) `fmap` many1 digit))) "x1234x"
("1234",1234)
But you should also look into attoparsec, as it properly supports this functionality with the match function.

Load property file into Map structure

I have a Java like property text files with key value pairs. What are some good approaches for loading that data into haskell and then accessing it.
The file look likes:
XXXX=vvvvv
YYYY=uuuuu
I want to be able to access the "XXXX" key.
You could use a parser library like the excellent Parsec (part of the Haskell Platform). Writing a parser for a format that simple would only take a few minutes.
However, if it's really that simple, you could use split; split the string into lines using the standard lines function (or use Data.List.Split if you need to handle blank lines, etc.), and then use the Data.List.Split functions to split it on '='.
The simplest solution would be rolling your own with break:
import Control.Arrow
parse :: String -> [(String, String)]
parse = map parseField . lines
where parseField = second (drop 1) . break (== '=')
However, this doesn't handle whitespace, blank lines, or anything like that.
As for looking up by key, once you have a structure like [(String, String)], it's easy to put it into a Map (with fromList) and operate on that.
Exploring a few details ehird didn't mention, and a slightly different approach:
import qualified Data.Map as Map
type Key = String
type Val = String
main = do
-- get contents of file (contents :: String)
contents <- readFile "config.txt"
-- split into lines (optionList :: [String])
let optionList = lines contents
-- parse into map (optionMap :: Map Key Val)
let optionMap = optionMapFromList optionList
doStuffWith optionMap
optionMapFromList :: [String] -> Map.Map Key Val
optionMapFromList = foldr step Map.empty
where step line map = case parseOpt line of
Just (key, val) -> Map.insert key val map
Nothing -> map
parseOpt :: String -> Maybe (Key, Val)
parseOpt = undefined
I've expressed my solution to your problem as a fold: taking the list of lines in the file, and turning it into the desired map. Each step of the fold involves inspecting a single line, attempting to parse it into a key/value pair, and when successful, inserting it into the map.
I've left parseOpt undefined; you could use an approach like ehird's parseField, or whatever you like. Perhaps you would prefer to only parse specific options:
interestingOpts = ["XXXX", "YYYY"]
parseOpt line = case find (`isPrefixOf` line) interestingOpts of
Just key -> Just (key, drop 1 $ dropWhile (/= '=') line)
Nothing -> Nothing
Using the prefix testing approach isn't always the best idea, though, if you have (for example) an option "XX" and an option "XXXX". Play around and see what approach suits your data best. If you need high performance, look into using Data.Text instead of Strings.

Resources