Best way to convert between [Char] and [Word8]? - string

I'm new to Haskell and I'm trying to use a pure SHA1 implementation in my app (Data.Digest.Pure.SHA) with a JSON library (AttoJSON).
AttoJSON uses Data.ByteString.Char8 bytestrings, SHA uses Data.ByteString.Lazy bytestrings, and some of my string literals in my app are [Char].
Haskell Prime's wiki page on Char types seems to indicate this is something still being worked out in the Haskell language/Prelude.
And this blogpost on unicode support lists a few libraries but its a couple years old.
What is the current best way to convert between these types, and what are some of the tradeoffs?
Thanks!

Here's what I have, without using ByteString's internal functions.
import Data.ByteString as S (ByteString, unpack)
import Data.ByteString.Char8 as C8 (pack)
import Data.Char (chr)
strToBS :: String -> S.ByteString
strToBS = C8.pack
bsToStr :: S.ByteString -> String
bsToStr = map (chr . fromEnum) . S.unpack
S.unpack on a ByteString gives us [Word8], we apply (chr . fromEnum) which converts any Enum type to a character. By composing all of them together we'll the function we want!

For conversion between Char8 and Word8 you should be able to use toEnum/fromEnum conversions, as they represent the same data.
For Char and Strings you might be able to get away with Data.ByteString.Char8.pack/unpack or some sort of combination of map, toEnum and fromEnum, but that throws out data if you're using anything other than ASCII.
For strings which could contain more than just ASCII a popular choice is UTF8 encoding. I like the utf8-string package for this:
http://hackage.haskell.org/packages/archive/utf8-string/0.3.6/doc/html/Codec-Binary-UTF8-String.html

Char8 and normal bytestrings are the same thing, just with different interfaces depending on which module you import. Mainly you want to convert between strict and lazy bytestrings, for which you use toChunks and fromChunks.
To put chars into bytestrings, use pack.
Also note that if your chars include codepoints which multibyte representations in UTF-8, then there will be problems.

Note : This answers the question in a very specific case (calling functions on hard-coded strings).
This may seem a minor problem because conversion functions exist as detailed in previous answers.
But I wanted a method to reduce administrative code, i.e. the code that you have to write just to get functions working together.
The solution to reducing type-handling code for strings is to use the OverloadedStrings pragma and import the relevant module(s)
{-# LANGUAGE OverloadedStrings #-}
module Dummy where
import Data.ByteString.Lazy.Char8 (ByteString, append)
bslHandling :: ByteString -> ByteString
bslHandling = (append myWord8List)
myWord8List = "I look like a String, but I'm actually a ByteString"
Note : myWordList type is inferred by the compiler.
If you do not use it in bslHandling, then the above declaration will yeld a classical [Char] type.
It does not solve the problem of passing from one specific type to another
Hope it helps

Maybe you want to do this:
import Data.ByteString.Internal (unpackBytes)
import Data.ByteString.Char8 (pack)
import GHC.Word (Word8)
strToWord8s :: String -> [Word8]
strToWord8s = unpackBytes . pack

Assuming that Char and Word8 are the same,
import Data.Word ( Word8 )
import Unsafe.Coerce ( unsafeCoerce )
toWord8 :: Char -> Word8
toWord8 = unsafeCoerce
strToWord8 :: String -> Word8
strToWord8 = map toWord8

Related

Inverse of `Data.Text.Encoding.decodeLatin1`?

Is there a function f :: Text -> Maybe ByteString such that forall x:
f (decodeLatin1 x) == Just x
Note, decodeLatin1 has the signature:
decodeLatin1 :: ByteString -> Text
I'm concerned that encodeUtf8 is not what I want, as I'm guessing what it does is just dump the UTF-8 string out as a ByteString, not reverse the changes that decodeLatin1 made on the way in to characters in the upper half of the character set.
I understand that f has to return a Maybe, because in general there's Unicode characters that aren't in the Latin character set, but I just want this to round trip at least, in that if we start with a ByteString we should get back to it.
DISCLAIMER: consider this a long comment rather than a solution, because I haven't tested.
I think you can do it with witch library. It is a general purpose type converter library with a fair amount of type safety. There is a type class called TryFrom to perform conversion between types that might fail to cast.
Luckily witch provides conversions from/to encondings too, having an instance TryFrom Text (ISO_8859_1 ByteString), meaning that you can convert between Text and latin1 encoded ByteString. So I think (not tested!!) this should work
{-# LANGUAGE TypeApplications #-}
import Witch (tryInto, ISO_8859_1)
import Data.Tagged (Tagged(unTagged))
f :: Text -> Maybe ByteString
f s = case tryInto #(ISO_8859_1 ByteString) s of
Left err -> Nothing
Right bs -> Just (unTagged bs)
Notice that tryInto returns a Either TryFromException s, so if you want to handle errors you can do it with Either. Up to you.
Also, witch docs points out that this conversion is done via String type, so probably there is an out-of-the-box solution without the need of depending on witch package. I don't know such a solution, and looking to the source code hasn't helped
Edit:
Having read witch source code aparently this should work
import qualified Data.Text as T
import Data.Char (isLatin1)
import qualified Data.ByteString.Char8 as C
f :: Text -> Maybe ByteString
f t = if allCharsAreLatin then Just (C.pack str) else Nothing
where str = T.unpack t
allCharsAreLatin = all isLatin1 str
The latin1 encoding is pretty damn simple -- codepoint X maps to byte X, whenever that's in range of a byte. So just unpack and repack immediately.
import Control.Monad
import qualified Data.Text as T
import qualified Data.ByteString.Char8 as BS
latin1EncodeText :: T.Text -> Maybe BS.ByteString
latin1EncodeText t = BS.pack (T.unpack t) <$ guard (T.all (<'\256') t)
It's possible to avoid the intermediate String, but you should probably make sure this is your bottleneck before trying for that.

Haskell type error occurs when compiling two code snippets together that separately raise no error

I am having a type issue with Haskell, the program below throws the compile time error:
Couldn't match expected type ‘bytestring-0.10.8.2:Data.ByteString.Lazy.Internal.ByteString’ with actual type ‘Text’
Program is:
{-# LANGUAGE OverloadedStrings #-}
module Main where
...
import Control.Concurrent (MVar, newMVar, modifyMVar_, modifyMVar, readMVar)
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Network.WebSockets as WS
import Data.Map (Map)
import Data.Aeson (decode)
...
application :: MVar ServerState -> WS.ServerApp
application state pending = do
conn <- WS.acceptRequest pending
msg <- WS.receiveData conn
-- EITHER this line can be included
T.putStrLn msg
-- OR these two lines, but not both
decodedObject <- return (decode msg :: Maybe (Map String Int))
print decodedObject
...
It seems to me that the basic issue is that putStrLn expects Text whereas decode expects Bytetring.
What I don't get is why I can run this section of the code:
T.putStrLn msg
Or I can run this section of the code:
decodedObject <- return (decode msg :: Maybe (Map String Int))
print decodedObject
But not both together.
What is the proper way to resolve this issue in the program?
I guess this is something like Type Coercion, or Type Inference, or what would be Casting in other languages. The problem is I don't know how to phrase the problem clearly enough to look it up.
It's as if msg can be one of a number of Types, but as soon as it is forced to be one Type, it can't then be another...
I'm also not sure if this overlaps with Overloaded strings. I have the pragma and am compiling with -XOverloadedStrings
I'm quite a newbie, so hope this is a reasonable question.
Any advice gratefully received! Thanks
This is because WS.receiveData is polymorphic on its return type:
receiveData :: WebSocketsData a => Connection -> IO a
it only needs to be WebSocketsData a instance, which both Text and ByteString are. So the compiler just infers the type.
I suggest you just assume it's a ByteString, and convert in Text upon the putStrLn usage.
Thanks to everyone for their advice. My final understanding is that any value in Haskell can be polymorphic until you force it to settle on a type, at which point it can't be any other type (stupid, but I hadn't seen a clear example of that before).
In my example, WS.receiveData returns polymorphic IO a where a is an instance of class WebsocketData which itself is parameterised by a type which can be either Text or Bytestring.
Aeson decode expects a (lazy) Bytestring. Assuming that we settle on (lazy) Bytestring for our a, this means the first line that I mentioned before needs to become:
T.putStrLn $ toStrict $ decodeUtf8 msg
to convert the lazy ByteString to a strict Text. I can do this so long as I know the incoming websocket message is UTF8 encoded.
I may have got some wording wrong there, but think that's basically it.

Most elegant way to do string conversion in Haskell

See this related SO question: Automatic conversion between String and Data.Text in haskell
Given a string of type Text, I want to produce a lazy bytestring.
This works, but I wondered whether it's optimal, given the fact that both Text and the lazy bytestring have the property of being "string-like" and I still use the not-generic unpack:
import qualified Data.ByteString.Lazy (ByteString)
import Data.Text (Text, unpack)
import Data.String (fromString)
import Data.Text (unpack)
convert :: IsString str => Text -> str
convert = fromString . unpack
I found the package string-conversions that offers the polymorphic function
convertString :: a -> b
as part of the ConvertibleStrings typeclass.
While it works fine, I am suspicious: Why would I need an extra package for that? Couldn't there be already a typeclass like IsString that offers a toString method and in combination a universal convert function fromString . toString?
[Ok, while I was editing my question, a possible answer dawned to me]
On the hackage-page of string-conversions it says:
Assumes UTF-8 encoding for both types of ByteStrings.
So there are assumptions that go along with conversions and a universal conversion of string-like types might not be desirable.
Also performance probably depends on the input and output types and a universal conversion would pretend that it's all the same.
So my take on best practice is now this, being explicit rather than polymorphic:
import Data.ByteString.Lazy (ByteString)
import qualified Data.ByteString.Lazy as ByteString
import qualified Data.Text.Encoding as Text
convert :: Text -> ByteString
convert = ByteString.fromStrict . Text.encodeUtf8

differences between Lazy.ByteString and Lazy.Char8.ByteString

I am bit confused over the codes in real world haskell
import qualified Data.ByteString.Lazy.Char8 as L8
import qualified Data.ByteString.Lazy as L
matchHeader :: L.ByteString -> L.ByteString -> Maybe L.ByteString
matchHeader prefix str
| prefix `L8.isPrefixOf` str
= Just (L8.dropWhile isSpace (L.drop (L.length prefix) str))
| otherwise
= Nothing
It seems L and L8 can be used interchangeably somewhere in this function, compiles fine if I replace L with L8 especially for the type L.ByteString and L8.ByteString, I saw in hackage, they're linked to the same source, does that mean Data.ByteString.Lazy.Char8.ByteString is the same as Data.ByteString.Lazy.ByteString ? Why L8.isPrefixOf is used here but not L.isPrefixOf?
That's funny, I've used all the ByteStrings but never noticed (until you mentioned) it that the Char8 and Word8 versions are internally the same data type.
Once mentioned though, I had to go and look at the code.... The following line in Data/ByteString/Lazy/Char8.hs shows that not only are the data types the same, but many of the functions are reexported identically....
-- Functions transparently exported
import Data.ByteString.Lazy
(fromChunks, toChunks, fromStrict, toStrict
,empty,null,length,tail,init,append,reverse,transpose,cycle
,concat,take,drop,splitAt,intercalate,isPrefixOf,group,inits,tails,copy
,hGetContents, hGet, hPut, getContents
,hGetNonBlocking, hPutNonBlocking
,putStr, hPutStr, interact)
So it would seem that most of Data.ByteString.(Lazy.)?Char8 are just a convenience wrapper around Data.ByteString(.Lazy)?. This also explains to me why show has always created stringy output for Word8 ByteStrings.
Of course some stuff does differ, as you can see when you try to create a ByteString-
B.pack "abcd" -- This fails
B.pack [65, 66, 67, 68] -- output is "ABCD"
B8.pack "abcd" -- This works
According to the documentation, both Lazy.ByteString and Lazy.Char.ByteString are a space-efficient representation of a Word8 vector, supporting many efficient operations. So, internally they seem to be same and you can use them interchangeably.
But Lazy.Char.ByteString has additionally these characteristics:
All Chars will be truncated to 8 bits (So be careful!)
The Char8 interface to bytestrings provides an instance of IsString for the ByteString type, enabling you to use string literals, and have them implicitly packed to ByteStrings. (you should enable OverloadedStrings extension for this)

Many types of String (ByteString)

I wish to compress my application's network traffic.
According to the (latest?) "Haskell Popularity Rankings", zlib seems to be a pretty popular solution. zlib's interface uses ByteStrings:
compress :: ByteString -> ByteString
decompress :: ByteString -> ByteString
I am using regular Strings, which are also the data types used by read, show, and Network.Socket:
sendTo :: Socket -> String -> SockAddr -> IO Int
recvFrom :: Socket -> Int -> IO (String, Int, SockAddr)
So to compress my strings, I need some way to convert a String to a ByteString and vice-versa.
With hoogle's help, I found:
Data.ByteString.Char8 pack :: String -> ByteString
Trying to use it:
Prelude Codec.Compression.Zlib Data.ByteString.Char8> compress (pack "boo")
<interactive>:1:10:
Couldn't match expected type `Data.ByteString.Lazy.Internal.ByteString'
against inferred type `ByteString'
In the first argument of `compress', namely `(pack "boo")'
In the expression: compress (pack "boo")
In the definition of `it': it = compress (pack "boo")
Fails, because (?) there are different types of ByteString ?
So basically:
Are there several types of ByteString? What types, and why?
What's "the" way to convert Strings to ByteStrings?
Btw, I found that it does work with Data.ByteString.Lazy.Char8's ByteString, but I'm still intrigued.
There are two kinds of bytestrings: strict (defined in Data.Bytestring.Internal) and lazy (defined in Data.Bytestring.Lazy.Internal). zlib uses lazy bytestrings, as you've discovered.
The function you're looking for is:
import Data.ByteString as BS
import Data.ByteString.Lazy as LBS
lazyToStrictBS :: LBS.ByteString -> BS.ByteString
lazyToStrictBS x = BS.concat $ LBS.toChunks x
I expect it can be written more concisely without the x. (i.e. point-free, but I'm new to Haskell.)
A more efficient mechanism might be to switch to a full bytestring-based layer:
network.bytestring for bytestring sockets
lazy bytestrings for compressoin
binary of bytestring-show to replace Show/Read

Resources