What makes a Bytestring "lazy"? - haskell

I am learning Haskell but having some difficulty understanding how exactly lazy ByteStrings work. Hackage says that "Lazy ByteStrings use a lazy list of strict chunks which makes it suitable for I/O streaming tasks". In contrast, a strict list is stored as one large array.
What are these "chunks" in lazy byteStrings? How does your compiler know just how large a chunk should be? Further, I understand that the idea behind a lazy list is that you don't have to store the entire thing, which thus allows for infinite lists and all of that. But how is this storage implemented? Does each chunk have a pointer to a next chunk?
Many thanks in advance for the help :)

You can find the definition of the lazy ByteString here:
data ByteString = Empty | Chunk {-# UNPACK #-} !S.ByteString ByteString
deriving (Typeable)
so Chunk is one data-constructor - the first part is a strict (!) strict (S.) ByteString and then some more Chunks or Empty via the second recursive (lazy) ByteString part.
Note that the second part does not have the (!) there - so this can be a GHC thunk (the lazy stuff in Haskell) that will only be forced when you need it (for example pattern-match on it).
That means a lazy ByteString is either Empty or you get a strict (you can think of this as already loaded if you want) part or chunk of the complete string with a lazy remaining/rest/tail ByteString.
As about the size that depends on the code that is generating this lazy bytestring - the compiler does not come into this.
You can see this for hGetContents:
hGetContents = hGetContentsN defaultChunkSize
where defaultChunkSize is defined to be 32 * 1024 - 2 * sizeOf (undefined :: Int) - so a bit less than 32kB
And yes the rest (snd. argument to Chunk) can be seen as a pointer to the next Chunk or Empty (just like with a normal list).

Related

Efficient way to combine a lazy ByteString and a lazy Text

I'm writing some code that is rendering an HTML page (via servant, if that's relevant), and for various complicated reasons, I have to construct the HTML by "combining" two segments.
One segment is fetched from an internal HTTP API which returns a Data.ByteString.Lazy
The other segment is rendered using the ede library, which generates a Data.Text.Lazy
What options do I have if I have to combine these two segments efficiently? The two segments can be reasonably large (few 100 kbs each). This servant server is going to see quite some traffic, so any inefficiency (like copying 100s of kbs of memory for every req/res, will quickly add up).
Assuming your endpoint returns a lazy ByteString, use the function encodeUtf8 from Data.Text.Lazy.Encoding to convert your lazy Text into a lazy ByteString, and then return the append of the two lazy ByteStrings.
Internally, lazy ByteStrings are basically lists of strict ByteString chunks. Concatenating them is list concatenation, and doesn't incur in new allocations for the bytes themselves.
A time and space-efficient implementation of lazy byte vectors using
lists of packed Word8 arrays
Some operations, such as concat, append, reverse and cons, have better
complexity than their Data.ByteString equivalents, due to
optimisations resulting from the list spine structure.
If you had a large number of lazy ByteStrings instead of two, you should take the extra step of using lazyByteString to convert them to Builders, concatenate the Builders, and then get the result lazy ByteString using toLazyByteString. This will avoid the inefficiency of left-associated list concatenation.
Builders denote sequences of bytes. They are Monoids where mempty is
the zero-length sequence and mappend is concatenation, which runs in
O(1).

Most efficient way of converting a Data.ByteString.Lazy to a CStringLen

I need to encode some data to JSON and then push is to the syslog using hsyslog. The types of the two relevant functions are:
Aeson.encode :: a -> Data.ByteString.Lazy.ByteString
System.Posix.Syslog.syslog :: Maybe Facility
-> Priority
-> CStringLen
-> IO ()
What's the most efficient way (speed & memory) to convert a Lazy.ByteString -> CStringLen? I found Data.ByteString.Unsafe, but it works only with ByteString, not Lazy.ByteString?
Shall I just stick a unsafeUseAsCStringLen . Data.String.Conv.toS and call it a day? Will it to the right thing wrt efficiency?
I guess I would use Data.ByteString.Lazy.toStrict in place of toS, to avoid the additional package dependency.
Anyway, you won't find anything more efficient than:
unsafeUseAsCStringLen (toStrict lbs) $ \cstrlen -> ...
In general, toStrict is an "expensive" operation, because a lazy ByteString will generally be made up of a bunch of "chunks" each consisting of a strict ByteString and not necessarily yet loaded into memory. The toStrict function must force all the strict ByteString chunks into memory and ensure that they are copied into a single, contiguous block as required for a strict ByteString before the no-copy unsafeUseAsCStringLen is applied.
However, toStrict handles a lazy ByteString that consists of a single chunk optimally without any copying.
In practice, aeson uses an efficient Data.ByteString.Builder to create the JSON, and if the JSON is reasonably small (less than 4k, I think), it will build a single-chunk lazy ByteString. In this case, toStrict is zero-copy, and unsafeUseAsCStringLen is zero copy, and the entire operation is basically free.
But note that, in your application, where you are passing the string to the syslogger, fretting about the efficiency of this operation is crazy. My guess would be that you'd need thousands of copy operations to even make a dent in the performance of the overall action.

How to approach writing of custom decoding function from `ByteString` to `Text`

Suppose I wish to write something like this:
-- | Decode a 'ByteString' containing Code Page 437 encoded text.
decodeCP437 :: ByteString -> Text
decodeCP437 = undefined
(I know about encoding package, but its dependency list is ridiculous price to pay for this single, and I believe quite trivial function.)
My question is how to construct Text from ByteString with reasonable efficiency, in particular without using lists. It seems to me that Data.Text.Encoding should be a good source for inspiration, but at first sight it uses withForeignPtr and I guess it's too low level for my use case.
How the problem should be approached? In a nutshell, I guess I need to continuously take bytes (Word8) from ByteString, translate every byte to corresponding Char, and somehow efficiently build Text from them. Complexity of basic building functions in Data.Text for Text construction not surprisingly indicates that appending characters one by one is not the best idea, but I don't see better tools for this available.
Update: I want to create strict Text. It seems that the only option is to create builder then get lazy Text from it (O(n)) and then convert to strict Text (O(n)).
You can use the Builder API, which offers O(1) singleton :: Char -> Builder and O(1) (<>) :: Builder -> Builder -> Builder for efficient construction operations.

Memory efficient strings in Haskell

The commonly recommended Haskell string types seem to be ByteString or Text. I often work with a very large number of short (English word sized) strings, and typically need to store them in a lookup table such as Data.Map. In many cases I find that in this scenario, a table of Strings can take up less memory then a table of ByteStrings. Unboxed Data.Vectors of Word8 are also (much) more compact than ByteStrings.
What is the best practice when one needs to store and compare large numbers of small strings in Haskell?
Below I have tried to condense a particular problematic case into a small example:
import qualified Data.ByteString.Lazy.Char8 as S
import qualified Data.ByteString as Strict
import qualified Data.Map as Map
import qualified Data.Vector.Unboxed as U
import qualified Data.Serialize as Serialize
import Control.Monad.State
main = putStr
. unlines . map show . flip evalState (0,Map.empty)
. mapM toInt
. S.words
=<<
S.getContents
toInt x = do
let x' =
U.fromList . Strict.unpack . -- Comment this line to increase memory usage
Serialize.encode $ x
(i,t) <- get
case Map.lookup x' t of
Just j -> return j
Nothing -> do
let i' = i + (1::Int)
put (i', Map.insert x' i t)
return i
When I run this on a file containing around 400.000 words of English text, the version with strict bytestring keys uses around 50MB memory, the one with Word8 vectors uses 6MB.
In the absence of other answers, I'm going to go out on a limb here.
What is the best practice when one needs to store and compare large numbers of small strings in Haskell?
If the small strings are meant to be human readable (e.g. an English word) then use Text. If they are meant to be read only by the computer, use ByteString. The decision to use strict or lazy variants of these depends on how you build and use these small strings.
You shouldn't need to use your own unboxed Vectors of Word8. If you experiencing a specific situation where regular String is faster than Text or ByteString, then throw the details up on StackOverflow and we'll try to figure out why. If you perform detailed analysis and can prove that an unboxed Vector of Word8 consistently works significantly better than Text or ByteString, then start conversations on mailing lists, irc, reddit, etc; the standard libraries are not set in stone, and improvements are always welcome.
But I think it highly likely that you are just doing something weird, as hammar and shang suggest.
P.S. for your particular use case, instead of storing a lot of small strings, you should consider a more appropriate data structure catered to your needs, e.g. a Trie as danr suggests.
A (strict) ByteSting is a constructor over an unboxed ForiegnPtr to a Word8 and two unboxed Ints.
A ForeignPtr is another constructor over an Addr# (a GHC prim) and a ForeignPtrContents:
data ForeignPtrContents
= PlainForeignPtr !(IORef (Finalizers, [IO ()]))
| MallocPtr (MutableByteArray# RealWorld) !(IORef (Finalizers, [IO ()]))
| PlainPtr (MutableByteArray# RealWorld)
...
For short strings, ByteStrings simply pack too much administration to benefit their contiguous representation of the actual "string" data.
For the original question - I'd check an average word length of your corpus, but I can't see ByteString being more efficient than String aka [Char] which uses 12 bytes per Char (source the original ByteString paper).
A general plea to Haskellers (not aimed the poster of the original question) - please stop bashing String aka [Char] - having both String and Text (and ByteString when you really need bytes) makes sense. Or use Clean where the contiguous String representation is better suited to short strings.
Caveat - I may have been looking at an old version of the ByteString internals with regards to what data types it uses internally.
I know this is a 6-year old post, but I was wondering the same recently, and found this useful blog post: https://markkarpov.com/post/short-bs-and-text.html. It seems that yes, this is a recognized issue, and Short(Text/ByteString) are the solution.

What is the point of the strictness declaration?

I am starting Haskell and was looking at some libraries where data types are defined with "!". Example from the bytestring library:
data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) -- payload
{-# UNPACK #-} !Int -- offset
{-# UNPACK #-} !Int -- length
Now I saw this question as an explanation of what this means and I guess it is fairly easy to understand. But my question is now: what is the point of using this? Since the expression will be evaluated whenever it is need, why would you force the early evaluation?
In the second answer to this question C.V. Hansen says: "[...] sometimes the overhead of lazyness can be too much or wasteful". Is that supposed to mean that it is used to save memory (saving the value is cheaper than saving the expression)?
An explanation and an example would be great!
Thanks!
[EDIT] I think I should have chosen an example without {-# UNPACK #-}. So let me make one myself. Would this ever make sense? Is yes, why and in what situation?
data MyType = Const1 !Int
| Const2 !Double
| Const3 !SomeOtherDataTypeMaybeMoreComplex
The goal here is not strictness so much as packing these elements into the data structure. Without strictness, any of those three constructor arguments could point either to a heap-allocated value structure or a heap-allocated delayed evaluation thunk. With strictness, it could only point to a heap-allocated value structure. With strictness and packed structures, it's possible to make those values inline.
Since each of those three values is a pointer-sized entity and is accessed strictly anyway, forcing a strict and packed structure saves pointer indirections when using this structure.
In the more general case, a strictness annotation can help reduce space leaks. Consider a case like this:
data Foo = Foo Int
makeFoo :: ReallyBigDataStructure -> Foo
makeFoo x = Foo (computeSomething x)
Without the strictness annotation, if you just call makeFoo, it will build a Foo pointing to a thunk pointing to the ReallyBigDataStructure, keeping it around in memory until something forces the thunk to evaluate. If we instead have
data Foo = Foo !Int
This forces the computeSomething evaluation to proceed immediately (well, as soon as something forces makeFoo itself), which avoids leaving a reference to the ReallyBigDataStructure.
Note that this is a different use case than the bytestring code; the bytestring code forces its parameters quite frequently so it's unlikely to lead to a space leak. It's probably best to interpret the bytestring code as a pure optimization to avoid pointer dereferences.

Resources