Text or Bytestring - string

Good day.
The one thing I now hate about Haskell is quantity of packages for working with string.
First I used native Haskell [Char] strings, but when I tried to start using hackage libraries then completely lost in endless conversions. Every package seem to use different strings implementation, some adopts their own handmade thing.
Next I rewrote my code with Data.Text strings and OverloadedStrings extension, I chose Text because it has a wider set of functions, but it seems many projects prefer ByteString.
Someone could give short reasoning why to use one or other?
PS: btw how to convert from Text to ByteString?
Couldn't match expected type
Data.ByteString.Lazy.Internal.ByteString
against inferred type Text
Expected type: IO Data.ByteString.Lazy.Internal.ByteString
Inferred type: IO Text
I tried encodeUtf8 from Data.Text.Encoding, but no luck:
Couldn't match expected type
Data.ByteString.Lazy.Internal.ByteString
against inferred type Data.ByteString.Internal.ByteString
UPD:
Thanks for responses, that *Chunks goodness looks like way to go, but I somewhat shocked with result, my original function looked like this:
htmlToItems :: Text -> [Item]
htmlToItems =
getItems . parseTags . convertFuzzy Discard "CP1251" "UTF8"
And now became:
htmlToItems :: Text -> [Item]
htmlToItems =
getItems . parseTags . fromLazyBS . convertFuzzy Discard "CP1251" "UTF8" . toLazyBS
where
toLazyBS t = fromChunks [encodeUtf8 t]
fromLazyBS t = decodeUtf8 $ intercalate "" $ toChunks t
And yes, this function is not working because its wrong, if we supply Text to it, then we're confident this text is properly encoded and ready to use and converting it is stupid thing to do, but such a verbose conversion still has to take place somewhere outside htmltoItems.

ByteStrings are mainly useful for binary data, but they are also an efficient way to process text if all you need is the ASCII character set. If you need to handle unicode strings, you need to use Text. However, I must emphasize that neither is a replacement for the other, and they are generally used for different things: while Text represents pure unicode, you still need to encode to and from a binary ByteString representation whenever you e.g. transport text via a socket or a file.
Here is a good article about the basics of unicode, which does a decent job of explaining the relation of unicode code-points (Text) and the encoded binary bytes (ByteString): The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
You can use the Data.Text.Encoding module to convert between the two datatypes, or Data.Text.Lazy.Encoding if you are using the lazy variants (as you seem to be doing based on your error messages).

You definitely want to be using Data.Text for textual data.
encodeUtf8 is the way to go. This error:
Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString
against inferred type Data.ByteString.Internal.ByteString
means that you're supplying a strict bytestring to code which expects a lazy bytestring. Conversion is easy with the fromChunks function:
Data.ByteString.Lazy.fromChunks :: [Data.ByteString.Internal.ByteString] -> ByteString
so all you need to do is add the function fromChunks [myStrictByteString] wherever the lazy bytestring is expected.
Conversion the other way can be accomplished with the dual function toChunks, which takes a lazy bytestring and gives a list of strict chunks.
You may want to ask the maintainers of some packages if they'd be able to provide a text interface instead of, or in addition to, a bytestring interface.

Use a single function cs from the Data.String.Conversions.
It will allow you to convert between String, ByteString and Text (as well as ByteString.Lazy and Text.Lazy), depending on the input and the expected types.
You still have to call it, but no longer to worry about the respective types.
See this answer for usage example.

For what it's worth, I found these two helper functions to be quite useful:
import qualified Data.ByteString.Char8 as BS
import qualified Data.Text as T
-- | Text to ByteString
tbs :: T.Text -> BS.ByteString
tbs = BS.pack . T.unpack
-- | ByteString to Text
bst :: BS.ByteString -> T.Text
bst = T.pack . BS.unpack
Example:
foo :: [BS.ByteString]
foo = ["hello", "world"]
bar :: [T.Text]
bar = bst <$> foo
baz :: [BS.ByteString]
baz = tbs <$> bar

Related

How can I save a variable as a bytestring?

Ik this is a dumb question, but if I have this:
a :: B.ByteString
a = "a"
I get an error that says "Couldn't match type B.ByteString with type [Char]". I know what's the problem but I don't know how to fix it, could you help? thx.
Character string literals in Haskell, by default, are always treated as String, which is equivalent to [Char]. Most string-like data structures define a function called pack to convert from, and the bytestring package is no exception (Note that this is pack from Data.ByteString.Char8; the one in Data.ByteString converts from [Word8]).
import Data.ByteString.Char8(pack)
a :: B.ByteString
a = pack "a"
However, GHC also supports an extension called OverloadedStrings. If you're willing to enable this, ByteString implements a typeclass called IsString. With this extension enabled, the type of a string literal like "a" is no longer [Char] and is instead forall a. IsString a => a (similar to how the type of numerical literals like 3 is forall a. Num a => a). This will happily specialize to ByteString if the type is in scope.
{-# LANGUAGE OverloadedStrings #-}
a :: B.ByteString
a = "a"
If you go this route, make sure you understand the proviso listed in the docs for this instance. For ASCII characters, it won't pose a problem, but if your string has Unicode characters outside the ASCII range, you need to be aware of it.

Why do Data.Binary instances of bytestring add the length of the bytestring as prefix

Looking at the put instances of the various ByteString types we find that the length of the bytestring is always prefixed in the binary file before writing it. For example here - https://hackage.haskell.org/package/binary-0.8.8.0/docs/src/Data.Binary.Class.html#put
Taking an example
instance Binary B.ByteString where
put bs = put (B.length bs) -- Why this??
<> putByteString bs
get = get >>= getByteString
Is there any particular reason for doing this? And is the only way to write Bytestring without prefixing the length - creating our own newtype wrapper and having an instance for Binary?
Is there any particular reason for doing this?
The idea of get and put is that you can combine several objects. For example you can write:
write_func :: ByteString -> Char -> Put
write_func some_bytestring some_char = do
put some_bytestring
put some_char
then you want to define a function that can read the data back, and evidently you want the two functions to act together as an identity function: that if the writer writes a certain ByteString and a certain Char, then you want the read function to read the same ByteString and character.
The reader function should look similar to:
read_fun :: Get (ByteString, Char)
read_fun = do
bs <- get
c <- get
return (bs, c)
but the problem is, when does a ByteString ends? The 'A' character could also be part of the ByteString. You thus need to somehow indicate where the ByteString ends. This can be done by saving the length, or some marker at the end. In case of a marker, you will need to "escape" the bytestring, such that it can not contain the marker itself.
But you thus need some mechanism to specify that when the ByteString ends.
And is the only way to write Bytestring without prefixing the length - creating our own newtype wrapper and having an instance for Binary?
No, in fact it is already in the instance definition. If you want to write a ByteString without length, then you can use putByteString :: ByteString -> Put:
write_func :: ByteString -> Char -> Put
write_func some_bytestring some_char = do
putByteString some_bytestring
put some_char
but when reading the ByteString back, you will need to figure out how many bytes you have to read.

How to combine Megaparsec with Text.Read (derived Read instance)

I want to use the derived instances of Read in the megaparsec module.
How can I use 'Text.Read.read' or 'Text.Read.readEither' in a 'Parser a' ?
It needs not to be fast, but easy to maintain and to extend.
The megaparsec module is for testing my application via CLI, so many different datatypes must be parsed.
It shall work in the following way:
import Text.Megaparsec
readableDatatype :: Read a => Parser a
readableDatatype =
-- This is wrong, but describes how it shall work
-- liftA read chunkToTokens
expr' :: Parser UserControlExpr
expr' = timeExpr
<|> timeEventExpr
<|> digiInExpr
<|> quitExpr
digiInExpr :: Parser UserControlExpr
digiInExpr = do
cmdword "digiIn"
inElement <- (readableDatatype :: Parser TI_I)
return $ UserDigiIn inElement
What do I have to write, so that the three functions typecheck, especially readableDataype ?
You can use getInput :: MonadParsec e s m => m s and setInput :: MonadParsec e s m => s -> m () together with reads :: Read a => String -> [(a, String)] for that. getInput and setInput just get and set the input stream the parser is working on and reads takes a string and returns a list of possible parses together with the remaining unconsumed portions of the input. We also need to tell the parser the new offset in the input, otherwise error locations are wrong. We can do that using getOffset and setOffset.
-- For equality constraint (~)
{-# LANGUAGE TypeFamilies #-}
import Text.Megaparsec
import Text.Read (reads)
readableDatatype :: (Read a, MonadParsec e s m, s ~ String) => m a
readableDatatype = do
input <- getInput
offset <- getOffset
choice $
(\(a, input') -> a <$ setInput input'
<* setOffset (offset + length input - length input'))
<$> reads input
If your input is something other than String you will have to convert between that and String after getInput and before setInput.
The following is about performance concerns, so not really relevant to your problem, but maybe it is educational and it may be useful to others who may need a solution with good performance.
Converting the whole input between String and some other type all the time during parsing is a rather big performance bottleneck for larger input. Furthermore using length to calculate the new offset here is not very performant either.
To solve both of these problems need some way to be able to know how much of the input was actually consumed by the Read-parser, so that we can just drop that part from the original input instead of having to convert the whole unconsumed part back to the original input type. But the Read class does not have that. One could try to parse incrementally longer prefixes of the input, which may be faster in cases where the parses done using Read are short compared to the length of the entire input. You could also use unsafePerformIO to write to an IORef how much of the input was actually forced by the Read-parser which would be the fastest but not so pretty solution.
I implemented the latter here. Feel free to use it, but be aware that it is not very well tested. It does however solve all the problems with the above approach.
That did it. Thank you! In the meantime I made a "conservative" solution of the problem by defining the constructors as strings and parsing them, without using read. That has the advantage, that you got the impressive error message of megaparsec, that tell you what symbols are missing.
Example with read:
1:8:
|
1 | digiIn TI_I_Signal1 DirA Dectivated
| ^
unknown parse error
(only a 'a' was missing in "Deactivated")
example with an hand written parser for the datatype:
1:19:
|
1 | digiIn TI_I_Signal1 Dectivated
| ^^^^^^^^
unexpected "Dectivat"
expecting "active", "inactive", '0', or '1'
I think I will use your code block in future datatypes.
Thank you very much!

Parsec returns [Char] instead of Text

I am trying to create a parser for a custom file format. In the format I am working with, some fields have a closing tag like so:
<SOL>
<DATE>0517
<YEAR>86
</SOL>
I am trying to grab the value between the </ and > and use it as part of the bigger parser.
I have come up with the code below. The trouble is, the parser returns [Char] instead of Text. I can pack each Char by doing fmap pack $ return r to get a text value out, but I was hoping type inference would save me from having to do this. Could someone give hints as to why I am getting back [Char] instead of Text, and how I can get back Text without having to manually pack the value?
{-# LANGUAGE NoMonomorphismRestriction #-}
{-# LANGUAGE OverloadedStrings #-}
import Data.Text
import Text.Parsec
import Text.Parsec.Text
-- |A closing tag is on its own line and is a "</" followed by some uppercase characters
-- followed by some '>'
closingTag = do
_ <- char '\n'
r <- between (string "</") (char '>') (many upper)
return r
string has the type
string :: Stream s m Char => String -> ParsecT s u m String
(See here for documentation)
So getting a String back is exactly what's supposed to happen.
Type inference doesn't change types, it only infers them. String is a concrete type, so there's no way to infer Text for it.
What you could do, if you need this in a couple of places, is to write a function
text :: Stream s m Char => String -> ParsecT s u m Text
text = fmap pack . string
or even
string' :: (IsString a, Stream s m Char) => String -> ParsecT s u m a
string' = fmap fromString . string
Also, it doesn't matter in this example but you'd probably want to import Text qualified, names like pack are used in a number of different modules.
As Ørjan Johansen correctly pointed out, string isn't actually the problem here, many upper is. The same principle applies though.
The reason you get [Char] here is that upper parses a Char and many turns that into a [Char]. I would write my own combinator along the lines of:
manyPacked = fmap pack . many
You could probably use type-level programming with type classes etc. to automatically choose between many and manyPack depending on the expect return type, but I don't think that's worth it. (It would probably look a bit like Scala's CanBuiltFrom).

How do I work with indvidual elements of a ByteString in Haskell

I need to write a function with the following type
replaceSubtrie :: SSTrie -> Data.Word.Word8 -> SSTrie -> SSTrie
replaceSubtrie trie base subtrie = ???
where depending on the value of base, the subtrie will be inserted into the trie in differing ways. SSTrie is my own data type and I know how to work with it, but I have no idea how to deal with the Word8 value.
base is a single "character" (for certain values of "character") taken from a ByteString. Specifically, it is the result of calling index on ByteString -- that's the only reason why I've declared it Word8.
I can't do pattern matching, as there's no Word8 constructor available. And I can't get guards to work because I don't know how to construct a Word8 constant to compare it against.
[edited] Jerome's suggestiong worked. But more generally, are there any good articles out there showing how to work with Bytestrings (and other more low-level data)? Like, how could I have known that fact about Word8?
[edited - Question for Don Stewart]
Right now I've got it working with code like this
replaceSubtrie trie 0x41 subtrie = trie{ a=subtrie }
When I change it to this:
replaceSubtrie trie 'A' subtrie = trie{ a=subtrie }
I get an error:
Trie.hs:40:21:
Couldn't match expected type `Word8' with actual type `Char'
In the pattern: 'A'
In an equation for `replaceSubtrie':
replaceSubtrie trie 'A' subtrie = trie {a = subtrie}
I do have import qualified Data.ByteString.Char8 as C at the top of my file. What am I doing wrong?
I feel a bit silly looking up the ASCII value for 'A', but what the hell
You can simply import Data.ByteString.Char8 or Data.ByteString.Lazy.Char8, to get all the same functions, but permitting the use of character literals in patterns.

Resources