differences between Lazy.ByteString and Lazy.Char8.ByteString - haskell

I am bit confused over the codes in real world haskell
import qualified Data.ByteString.Lazy.Char8 as L8
import qualified Data.ByteString.Lazy as L
matchHeader :: L.ByteString -> L.ByteString -> Maybe L.ByteString
matchHeader prefix str
| prefix `L8.isPrefixOf` str
= Just (L8.dropWhile isSpace (L.drop (L.length prefix) str))
| otherwise
= Nothing
It seems L and L8 can be used interchangeably somewhere in this function, compiles fine if I replace L with L8 especially for the type L.ByteString and L8.ByteString, I saw in hackage, they're linked to the same source, does that mean Data.ByteString.Lazy.Char8.ByteString is the same as Data.ByteString.Lazy.ByteString ? Why L8.isPrefixOf is used here but not L.isPrefixOf?

That's funny, I've used all the ByteStrings but never noticed (until you mentioned) it that the Char8 and Word8 versions are internally the same data type.
Once mentioned though, I had to go and look at the code.... The following line in Data/ByteString/Lazy/Char8.hs shows that not only are the data types the same, but many of the functions are reexported identically....
-- Functions transparently exported
import Data.ByteString.Lazy
(fromChunks, toChunks, fromStrict, toStrict
,empty,null,length,tail,init,append,reverse,transpose,cycle
,concat,take,drop,splitAt,intercalate,isPrefixOf,group,inits,tails,copy
,hGetContents, hGet, hPut, getContents
,hGetNonBlocking, hPutNonBlocking
,putStr, hPutStr, interact)
So it would seem that most of Data.ByteString.(Lazy.)?Char8 are just a convenience wrapper around Data.ByteString(.Lazy)?. This also explains to me why show has always created stringy output for Word8 ByteStrings.
Of course some stuff does differ, as you can see when you try to create a ByteString-
B.pack "abcd" -- This fails
B.pack [65, 66, 67, 68] -- output is "ABCD"
B8.pack "abcd" -- This works

According to the documentation, both Lazy.ByteString and Lazy.Char.ByteString are a space-efficient representation of a Word8 vector, supporting many efficient operations. So, internally they seem to be same and you can use them interchangeably.
But Lazy.Char.ByteString has additionally these characteristics:
All Chars will be truncated to 8 bits (So be careful!)
The Char8 interface to bytestrings provides an instance of IsString for the ByteString type, enabling you to use string literals, and have them implicitly packed to ByteStrings. (you should enable OverloadedStrings extension for this)

Related

Inverse of `Data.Text.Encoding.decodeLatin1`?

Is there a function f :: Text -> Maybe ByteString such that forall x:
f (decodeLatin1 x) == Just x
Note, decodeLatin1 has the signature:
decodeLatin1 :: ByteString -> Text
I'm concerned that encodeUtf8 is not what I want, as I'm guessing what it does is just dump the UTF-8 string out as a ByteString, not reverse the changes that decodeLatin1 made on the way in to characters in the upper half of the character set.
I understand that f has to return a Maybe, because in general there's Unicode characters that aren't in the Latin character set, but I just want this to round trip at least, in that if we start with a ByteString we should get back to it.
DISCLAIMER: consider this a long comment rather than a solution, because I haven't tested.
I think you can do it with witch library. It is a general purpose type converter library with a fair amount of type safety. There is a type class called TryFrom to perform conversion between types that might fail to cast.
Luckily witch provides conversions from/to encondings too, having an instance TryFrom Text (ISO_8859_1 ByteString), meaning that you can convert between Text and latin1 encoded ByteString. So I think (not tested!!) this should work
{-# LANGUAGE TypeApplications #-}
import Witch (tryInto, ISO_8859_1)
import Data.Tagged (Tagged(unTagged))
f :: Text -> Maybe ByteString
f s = case tryInto #(ISO_8859_1 ByteString) s of
Left err -> Nothing
Right bs -> Just (unTagged bs)
Notice that tryInto returns a Either TryFromException s, so if you want to handle errors you can do it with Either. Up to you.
Also, witch docs points out that this conversion is done via String type, so probably there is an out-of-the-box solution without the need of depending on witch package. I don't know such a solution, and looking to the source code hasn't helped
Edit:
Having read witch source code aparently this should work
import qualified Data.Text as T
import Data.Char (isLatin1)
import qualified Data.ByteString.Char8 as C
f :: Text -> Maybe ByteString
f t = if allCharsAreLatin then Just (C.pack str) else Nothing
where str = T.unpack t
allCharsAreLatin = all isLatin1 str
The latin1 encoding is pretty damn simple -- codepoint X maps to byte X, whenever that's in range of a byte. So just unpack and repack immediately.
import Control.Monad
import qualified Data.Text as T
import qualified Data.ByteString.Char8 as BS
latin1EncodeText :: T.Text -> Maybe BS.ByteString
latin1EncodeText t = BS.pack (T.unpack t) <$ guard (T.all (<'\256') t)
It's possible to avoid the intermediate String, but you should probably make sure this is your bottleneck before trying for that.

Haskell type error occurs when compiling two code snippets together that separately raise no error

I am having a type issue with Haskell, the program below throws the compile time error:
Couldn't match expected type ‘bytestring-0.10.8.2:Data.ByteString.Lazy.Internal.ByteString’ with actual type ‘Text’
Program is:
{-# LANGUAGE OverloadedStrings #-}
module Main where
...
import Control.Concurrent (MVar, newMVar, modifyMVar_, modifyMVar, readMVar)
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Network.WebSockets as WS
import Data.Map (Map)
import Data.Aeson (decode)
...
application :: MVar ServerState -> WS.ServerApp
application state pending = do
conn <- WS.acceptRequest pending
msg <- WS.receiveData conn
-- EITHER this line can be included
T.putStrLn msg
-- OR these two lines, but not both
decodedObject <- return (decode msg :: Maybe (Map String Int))
print decodedObject
...
It seems to me that the basic issue is that putStrLn expects Text whereas decode expects Bytetring.
What I don't get is why I can run this section of the code:
T.putStrLn msg
Or I can run this section of the code:
decodedObject <- return (decode msg :: Maybe (Map String Int))
print decodedObject
But not both together.
What is the proper way to resolve this issue in the program?
I guess this is something like Type Coercion, or Type Inference, or what would be Casting in other languages. The problem is I don't know how to phrase the problem clearly enough to look it up.
It's as if msg can be one of a number of Types, but as soon as it is forced to be one Type, it can't then be another...
I'm also not sure if this overlaps with Overloaded strings. I have the pragma and am compiling with -XOverloadedStrings
I'm quite a newbie, so hope this is a reasonable question.
Any advice gratefully received! Thanks
This is because WS.receiveData is polymorphic on its return type:
receiveData :: WebSocketsData a => Connection -> IO a
it only needs to be WebSocketsData a instance, which both Text and ByteString are. So the compiler just infers the type.
I suggest you just assume it's a ByteString, and convert in Text upon the putStrLn usage.
Thanks to everyone for their advice. My final understanding is that any value in Haskell can be polymorphic until you force it to settle on a type, at which point it can't be any other type (stupid, but I hadn't seen a clear example of that before).
In my example, WS.receiveData returns polymorphic IO a where a is an instance of class WebsocketData which itself is parameterised by a type which can be either Text or Bytestring.
Aeson decode expects a (lazy) Bytestring. Assuming that we settle on (lazy) Bytestring for our a, this means the first line that I mentioned before needs to become:
T.putStrLn $ toStrict $ decodeUtf8 msg
to convert the lazy ByteString to a strict Text. I can do this so long as I know the incoming websocket message is UTF8 encoded.
I may have got some wording wrong there, but think that's basically it.

Haskell requires more memory than Python when I read map from file. Why?

I have this simple code in Python:
input = open("baseforms.txt","r",encoding='utf8')
S = {}
for i in input:
words = i.split()
S.update( {j:words[0] for j in words} )
print(S.get("sometext","not found"))
print(len(S))
It requires 300MB for work. "baseforms.txt" size is 123M.
I've wrote the same code in Haskell:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Text.Lazy.Encoding(decodeUtf8)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as I
import Control.Monad(liftM)
main = do
text <- B.readFile "baseforms.txt"
let m = (M.fromList . (concatMap (parseLine.decodeUtf8))) (B.lines text)
print (M.lookup "sometext" m)
print (M.size m)
where
parseLine line = let base:forms = T.words line in [(f,base)| f<-forms]
It requires 544 MB and it's slower than Python version. Why? Is it possible to optimise Haskell version?
There is a lot happening in the Haskell version that's not happening in the Python version.
readFile uses lazy IO, which is a bit weird in general. I would generally avoid lazy IO.
The file, as a bytestring, is broken into lines which are then decoded as UTF-8. This seems a little unnecessary, given the existence of Text IO functions.
The Haskell version is using a tree (Data.Map) whereas the Python version is using a hash table.
The strings are all lazy, which is probably not necessary if they're relatively short. Lazy strings have a couple words of overhead per string, which can add up. You could fuse the lazy strings, or you could read the file all at once, or you could use something like conduit.
GHC uses a copying collector, whereas the default Python implementation uses malloc() with reference counting and the occasional GC. This fact alone can account for large differences in memory usage, depending on your program.
Who knows how many thunks are getting created in the Haskell version.
It's unknown whether you've enabled optimizations.
It's unknown how much slower the Haskell version is.
We don't have your data file so we can't really test it ourselves.
It's a bit late, but I studied this a little and think Dietrich Epp's account is right, but can be simplified a little. Notice that there doesn't seem to be any real python programming going on in the python file: it is orchestrating a very simple sequence of calls to C string operations and then to a C hash table implementation. (This is often a problem with really simple python v. Haskell benchmarks.) The Haskell, by contrast, is building an immense persistent Map, which is a fancy tree. So the main points of opposition here are C vs Haskell, and hashtable-with-destructive-update vs persistent map. Since there is little overlap in the input file, the tree you are constructing includes all the information in the input string, some of it repeated, and then rearranged with a pile of Haskell constructors. This is I think the source of the alarm you are experiencing, but it can be explained.
Compare these two files, one using ByteString:
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
main = do m <- fmap proc (B.readFile "baseforms.txt")
print (M.lookup (B.pack "sometext") m)
print (M.size m)
proc = M.fromList . concatMap (\(a:bs) -> map (flip (,) a) bs)
. map B.words . B.lines
and the other a Text-ified equivalent:
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
import Data.Text.Encoding(decodeUtf8)
import qualified Data.Text as T
main = do
m <- fmap proc (B.readFile "baseforms.txt")
print (M.lookup (T.pack "sometext") m)
print (M.size m)
proc = M.fromList . concatMap (\(a:bs) -> map (flip (,) a) bs)
. map T.words . T.lines . decodeUtf8
On my machine, the python/C takes just under 6 seconds, the bytestring file takes 8 seconds, and the text file just over 10.
The bytestring implementation seems to use a bit more memory than the python, the text implementation distinctly more. The text implementation takes more time because, of course, it adds a conversion to text and then uses text operations to break the string and text comparisons to build the map.
Here is a go at analyzing the memory phenomena in the text case. First we have the bytestring in memory (130m). Once the text is constructed (~250m, to judge unscientifically from what's going on in top), the bytestring is garbage collected while we construct the tree. In the end the text tree (~380m it looks like) uses more memory than the bytestring tree (~260m) because the text fragments in the tree are bigger. The program as a whole uses more because the text held in memory during the tree construction is itself bigger. To put it crudely: each bit of white-space is being turned into a tree constructor and two text constructors together with the text version of whatever the first 'word' of the line was and whatever the text representation next word is. The weight of the constructors seems in either case to be about 130m, so at the last moment of the construction of the tree we are using something like 130m + 130m + 130m = 390m in the bytestring case, and 250m + 130m + 250m = 630m in the text case.

Best way to convert between [Char] and [Word8]?

I'm new to Haskell and I'm trying to use a pure SHA1 implementation in my app (Data.Digest.Pure.SHA) with a JSON library (AttoJSON).
AttoJSON uses Data.ByteString.Char8 bytestrings, SHA uses Data.ByteString.Lazy bytestrings, and some of my string literals in my app are [Char].
Haskell Prime's wiki page on Char types seems to indicate this is something still being worked out in the Haskell language/Prelude.
And this blogpost on unicode support lists a few libraries but its a couple years old.
What is the current best way to convert between these types, and what are some of the tradeoffs?
Thanks!
Here's what I have, without using ByteString's internal functions.
import Data.ByteString as S (ByteString, unpack)
import Data.ByteString.Char8 as C8 (pack)
import Data.Char (chr)
strToBS :: String -> S.ByteString
strToBS = C8.pack
bsToStr :: S.ByteString -> String
bsToStr = map (chr . fromEnum) . S.unpack
S.unpack on a ByteString gives us [Word8], we apply (chr . fromEnum) which converts any Enum type to a character. By composing all of them together we'll the function we want!
For conversion between Char8 and Word8 you should be able to use toEnum/fromEnum conversions, as they represent the same data.
For Char and Strings you might be able to get away with Data.ByteString.Char8.pack/unpack or some sort of combination of map, toEnum and fromEnum, but that throws out data if you're using anything other than ASCII.
For strings which could contain more than just ASCII a popular choice is UTF8 encoding. I like the utf8-string package for this:
http://hackage.haskell.org/packages/archive/utf8-string/0.3.6/doc/html/Codec-Binary-UTF8-String.html
Char8 and normal bytestrings are the same thing, just with different interfaces depending on which module you import. Mainly you want to convert between strict and lazy bytestrings, for which you use toChunks and fromChunks.
To put chars into bytestrings, use pack.
Also note that if your chars include codepoints which multibyte representations in UTF-8, then there will be problems.
Note : This answers the question in a very specific case (calling functions on hard-coded strings).
This may seem a minor problem because conversion functions exist as detailed in previous answers.
But I wanted a method to reduce administrative code, i.e. the code that you have to write just to get functions working together.
The solution to reducing type-handling code for strings is to use the OverloadedStrings pragma and import the relevant module(s)
{-# LANGUAGE OverloadedStrings #-}
module Dummy where
import Data.ByteString.Lazy.Char8 (ByteString, append)
bslHandling :: ByteString -> ByteString
bslHandling = (append myWord8List)
myWord8List = "I look like a String, but I'm actually a ByteString"
Note : myWordList type is inferred by the compiler.
If you do not use it in bslHandling, then the above declaration will yeld a classical [Char] type.
It does not solve the problem of passing from one specific type to another
Hope it helps
Maybe you want to do this:
import Data.ByteString.Internal (unpackBytes)
import Data.ByteString.Char8 (pack)
import GHC.Word (Word8)
strToWord8s :: String -> [Word8]
strToWord8s = unpackBytes . pack
Assuming that Char and Word8 are the same,
import Data.Word ( Word8 )
import Unsafe.Coerce ( unsafeCoerce )
toWord8 :: Char -> Word8
toWord8 = unsafeCoerce
strToWord8 :: String -> Word8
strToWord8 = map toWord8

Mysterious word ("LPS") appears in a list of Haskell output

I am new to Haskell and trying to fiddle with some test cases I usually run into in the real world. Say I have the text file "foo.txt" which contains the following:
45.4 34.3 377.8
33.2 98.4 456.7
99.1 44.2 395.3
I am trying to produce the output
[[45.4,34.3,377.8],[33.2,98.4,456.7],[99.1,44.2,395.3]]
My code is below, but I'm getting some bogus "LPS" in the output... not sure what it represents.
import qualified Data.ByteString.Lazy.Char8 as BStr
import qualified Data.Map as Map
readDatafile = (map (BStr.words) . BStr.lines)
testFunc path = do
contents <- BStr.readFile path
print (readDatafile contents)
When invocated with testFunc "foo.txt" the output is
[[LPS ["45.4"],LPS ["34.3"],LPS ["377.8"]],[LPS ["33.2"],LPS ["98.4"],LPS ["456.7"]],[LPS ["99.1"],LPS ["44.2"],LPS ["395.3"]]]
Any help is appreciated! Thanks. PS: Using ByteString as this will be used on massive files in the future.
EDIT:
I am also puzzled as to why the output list is grouped as above (with each number bound in []), when in ghci the below line gives a different arrangment.
*Main> (map words . lines) "45.4 34.3 377.8\n33.2 98.4 456.7\n99.1 44.2 395.3"
[["45.4","34.3","377.8"],["33.2","98.4","456.7"],["99.1","44.2","395.3"]]
What you're seeing is indeed a constructor. When you read the file, the result is of course a list of lists of Bytestrings, but what you want is a list of lists of Floats.
What you could do :
readDatafile :: BStr.ByteString -> [[Float]]
readDatafile = (map ((map (read . BStr.unpack)) . BStr.words)) . BStr.lines
This unpacks the Bytestring (i.e. converts it to a string). The read converts the string to a float.
Not sure if using bytestrings here even helps your performance though.
This is an indication of the internal lazy bytestring representation type pre-1.4.4.3 (search the page for "LPS"). LPS is a constructor.
readDatafile is returning a [[ByteString]], and what you are seeing is the 'packed' representation of all those characters you read.
readDatafile = map (map Bstr.unpack . bStr.words) . Bstr.lines
Here's an example ghci run demonstrating the problem. My output is different than yours because I'm using GHC 6.10.4:
*Data.ByteString.Lazy.Char8> let myString = "45.4"
*Data.ByteString.Lazy.Char8> let myByteString = pack "45.4"
*Data.ByteString.Lazy.Char8> :t myString
myString :: [Char]
*Data.ByteString.Lazy.Char8> :t myByteString
myByteString :: ByteString
*Data.ByteString.Lazy.Char8> myString
"45.4"
*Data.ByteString.Lazy.Char8> myByteString
Chunk "45.4" Empty
*Data.ByteString.Lazy.Char8> unpack myByteString
"45.4"
This is just the lazy bytestring constructor. You're not parsing those strings into integers yet, so you'll see the underlying string. Note that lazy bytestrings are not the same as String, so they have a different printed representation when 'Show'n.
LPS was the old constructor for the old Lazy ByteString newtype. It has since been replaced with an explicit data type, so the current behavior is slightly different.
When you call Show on a Lazy ByteString it prints out the code that would generate approximately the same lazy bytestring you gave it. However, the usual import for working with ByteStrings doesn't export the LPS -- or in later revisions, the Chunk/Empty constructors. So it shows it with the LPS constructor wrapped around a list of strict bytestring chunks, which print themselves as strings.
On the other hand, I wonder if the lazy ByteString Show instance should do the same thing that most other show instances for complicated data structures do and say something like:
fromChunks ["foo","bar","baz"]
or even:
fromChunks [pack "foo",pack "bar", pack "baz"]
since the former seems to rely on {-# LANGUAGE OverloadedStrings #-} for the resulting code fragment to be really parseable as Haskell code. On the other-other hand, printing bytestrings as if they were strings is really convenient. Alas, both options are more verbose than the old LPS syntax, but they are more terse than the current Chunk "Foo" Empty. In the end, Show just needs to be left invertible by Read, so its probably best not to muck around changing things lest it randomly break a ton of serialized data. ;)
As for your problem, you are getting a [[ByteString]] instead of [[Float]] by mapping words over your lines. You need to unpack that ByteString and then call read on the resulting string to generate your floating point numbers.

Resources