Why do Data.Binary instances of bytestring add the length of the bytestring as prefix - haskell

Looking at the put instances of the various ByteString types we find that the length of the bytestring is always prefixed in the binary file before writing it. For example here - https://hackage.haskell.org/package/binary-0.8.8.0/docs/src/Data.Binary.Class.html#put
Taking an example
instance Binary B.ByteString where
put bs = put (B.length bs) -- Why this??
<> putByteString bs
get = get >>= getByteString
Is there any particular reason for doing this? And is the only way to write Bytestring without prefixing the length - creating our own newtype wrapper and having an instance for Binary?

Is there any particular reason for doing this?
The idea of get and put is that you can combine several objects. For example you can write:
write_func :: ByteString -> Char -> Put
write_func some_bytestring some_char = do
put some_bytestring
put some_char
then you want to define a function that can read the data back, and evidently you want the two functions to act together as an identity function: that if the writer writes a certain ByteString and a certain Char, then you want the read function to read the same ByteString and character.
The reader function should look similar to:
read_fun :: Get (ByteString, Char)
read_fun = do
bs <- get
c <- get
return (bs, c)
but the problem is, when does a ByteString ends? The 'A' character could also be part of the ByteString. You thus need to somehow indicate where the ByteString ends. This can be done by saving the length, or some marker at the end. In case of a marker, you will need to "escape" the bytestring, such that it can not contain the marker itself.
But you thus need some mechanism to specify that when the ByteString ends.
And is the only way to write Bytestring without prefixing the length - creating our own newtype wrapper and having an instance for Binary?
No, in fact it is already in the instance definition. If you want to write a ByteString without length, then you can use putByteString :: ByteString -> Put:
write_func :: ByteString -> Char -> Put
write_func some_bytestring some_char = do
putByteString some_bytestring
put some_char
but when reading the ByteString back, you will need to figure out how many bytes you have to read.

Related

How to get nth byte from ByteString?

How can I get nth byte of ByteString in Haskell?
I tried to find function like !! for ByteStrings, but found nothing.
ByteString.index is the function you're looking for.
Most of the "containerish" types emulate the extended list interface; you also want to be careful because that index function will crash the program if you feed it a string that's too short (as will !! on ordinary lists). A better implementation might be
import Data.ByteString as B
nthByte :: Int -> B.ByteString -> Maybe Word8
nthByte n bs = fst <$> B.uncons (B.drop n bs)
which, reading inside out, drops the first n bytes (maybe producing an empty byte string), then attempts to split the first character from the remainder, and if successful, ignores the rest of the string.

ByteString to Vector conversion

I have a ByteString that is containing the representation of Floats. Each Float is represented by 3 bytes in the ByteString.
I need to do some processing on the Float values, so I would like to perform that processing on an Vector of Float values. What would be the best way to do this?
I have a function toFloat :: [Word8] -> Float that converts 3 bytes of the ByteString to a Float. So I was thinking of iterating over the ByteString in steps of 3 Bytes and converting every step to a Float for a vector.
I've looked at the library functions for Vector but I can't find anything that suits this purpose. Data.Vector.Storable.ByteString.byteStringToVector looked promising but it converts every byte (instead of every 3 bytes) and doesn't give me any control over how the conversion of ByteString to Float should happen.
Just use Data.Vector.generate:
V.generate (BS.length bs `div` 3) $ \i ->
myToFloat (bs BS.! 3*i) (bs BS.! 3*i+1) (bs BS.! 3*i+2)
It'll allocate the vector all at once, and populate it. Data.ByteString.! is O(1), so this is quite efficient.
Try using
splitAt :: Int -> ByteString -> (ByteString, ByteString)
to split the ByteString into two: one of exactly 3 characters, and another containing the rest of the input. You can use this to implement a recursive function that will give you all the groups of length 3 (similar to Data.List.Split.chunksOf), and then you can use unpack on each to get the [Word8] you need. Pass that through your toFloat function, and convert to a vector with Vector.fromList.
There are a number of steps there that seem like perhaps they could be expensive, but I think probably the compiler is smart enough to fuse some of them, like the unpack/fromList pair. And splitting a ByteString is O(1), so that part's not as expensive as it looks either. Seems like this ought to be as suitable an approach as any.

Serialization of a TChan String

I have declared the following
type KEY = (IPv4, Integer)
type TPSQ = TVar (PSQ.PSQ KEY POSIXTime)
type TMap = TVar (Map.Map KEY [String])
data Qcfg = Qcfg { qthresh :: Int, tdelay :: Rational, cwpsq :: TPSQ, cwmap :: TMap, cw
chan :: TChan String } deriving (Show)
and would like this to be serializable in a sense that Qcfg can either be written to disk or be sent over the network. When I compile this I get the error
No instances for (Show TMap, Show TPSQ, Show (TChan String))
arising from the 'deriving' clause of a data type declaration
Possible fix:
add instance declarations for
(Show TMap, Show TPSQ, Show (TChan String))
or use a standalone 'deriving instance' declaration,
so you can specify the instance context yourself
When deriving the instance for (Show Qcfg)
I am now not quite sure whether there is a chance at all to serialize my TChan although all individual nodes in it are members of the show class.
For TMap and TPSQ I wonder whether there are ways to show the values in the TVar directly (because it does not get changed, so there should no need to lock it) without having to declare an instance that does a readTVar ?
I understood your comments to mean that you want to serialize the contents of the TVar and not the TVar itself.
There is only one way to extract the value from a TVar, and that's readTVar:
readTVar :: TVar a -> STM a
... which you can do in the IO monad using atomically:
atomically . readTVar :: TVar a -> IO a
TChan is more tricky, though, since you can't inspect the contents without flushing out the entire TChan. This is doable, even if wastefully, by inspecting the entire contents as a single STM action and then reinserting them all. If this is what you choose to do, it would also eventually require being run in the IO monad.
This means you won't be able to derive a Show instance for it, since Show expects a pure computation that converts it to a String, and not one residing in the IO monad.
However, there's no reason you have to use the Show class. You can just define a custom function to serialize your data type in the IO monad. Also, it's generally not advisable to use Show for serialization purposes since:
Some of your data types (like PSQ) have no Read instance
It's a pain in the butt to define Read instances in general
String representations are very space-inefficient
So I would recommend you use a proper serialization library like binary or cereal to do serialization and deserialization. These convert data types to a binary representation, and they make it very easy to define encoders and decoders.
However, even those libraries only accept instances for pure conversions and not operations in the IO monad, so what you must do is factor your serialization into a two-step process:
Extract the contents of your TVars in the IO monad.
Serialize the contents (along with the rest of your data-type) using cereal/binary.
There is still one last caveat, which is that not all of your data types have Binary instances (assuming we use the binary package), but fortunately lists do have a Binary instance, so a convenient work-around is to just convert your data type to a list (using toList and serialize the list. Then, when you deserialize the list, you use fromList to recover your original type.
So the following function will do all of that (using binary):
serializeQcfg file (Qcfg qthresh tdelay cwpsq cwmap cwchan) = do
-- Step 1: Extract contents of concurrency variables
psq <- atomically $ readTVar cwpsq
myMap <- atomically $ readTVar cwmap
myChan <- atomically $ entireTChan cwchan
-- Step 2: Encode the extracted data
encodeFile file (qthresh, tdelay, toList psq, myMap, myChan)
Edit: Actually, it's probably better to combine all the atomic transactions into a single transaction as Daniel pointed out, so you would actually do:
serializeQcfg file (Qcfg qthresh tdelay cwpsq cwmap cwchan) = do
-- Step 1: Extract contents of concurrency variables
(psq, myMap, myChain) <- atomically $ (,,) <$> readTVar cwpsq
<*> readTVar cwmap
<*> entireTChan cwchan
-- Step 2: Encode the extracted data
encodeFile file (qthresh, tdelay, toList psq, myMap, myChan)
I left out the implementation of entireTChan, which would basically flush the TChan to inspect the entire contents and then reload it again, but its type signature would be something like:
entireTChan :: TChan a -> STM [a]
I also left out the deserialization implementation, but I think if you understand the above example and take the time to learn how to use the binary or cereal packages you should be able to figure it out easily enough.

Text or Bytestring

Good day.
The one thing I now hate about Haskell is quantity of packages for working with string.
First I used native Haskell [Char] strings, but when I tried to start using hackage libraries then completely lost in endless conversions. Every package seem to use different strings implementation, some adopts their own handmade thing.
Next I rewrote my code with Data.Text strings and OverloadedStrings extension, I chose Text because it has a wider set of functions, but it seems many projects prefer ByteString.
Someone could give short reasoning why to use one or other?
PS: btw how to convert from Text to ByteString?
Couldn't match expected type
Data.ByteString.Lazy.Internal.ByteString
against inferred type Text
Expected type: IO Data.ByteString.Lazy.Internal.ByteString
Inferred type: IO Text
I tried encodeUtf8 from Data.Text.Encoding, but no luck:
Couldn't match expected type
Data.ByteString.Lazy.Internal.ByteString
against inferred type Data.ByteString.Internal.ByteString
UPD:
Thanks for responses, that *Chunks goodness looks like way to go, but I somewhat shocked with result, my original function looked like this:
htmlToItems :: Text -> [Item]
htmlToItems =
getItems . parseTags . convertFuzzy Discard "CP1251" "UTF8"
And now became:
htmlToItems :: Text -> [Item]
htmlToItems =
getItems . parseTags . fromLazyBS . convertFuzzy Discard "CP1251" "UTF8" . toLazyBS
where
toLazyBS t = fromChunks [encodeUtf8 t]
fromLazyBS t = decodeUtf8 $ intercalate "" $ toChunks t
And yes, this function is not working because its wrong, if we supply Text to it, then we're confident this text is properly encoded and ready to use and converting it is stupid thing to do, but such a verbose conversion still has to take place somewhere outside htmltoItems.
ByteStrings are mainly useful for binary data, but they are also an efficient way to process text if all you need is the ASCII character set. If you need to handle unicode strings, you need to use Text. However, I must emphasize that neither is a replacement for the other, and they are generally used for different things: while Text represents pure unicode, you still need to encode to and from a binary ByteString representation whenever you e.g. transport text via a socket or a file.
Here is a good article about the basics of unicode, which does a decent job of explaining the relation of unicode code-points (Text) and the encoded binary bytes (ByteString): The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
You can use the Data.Text.Encoding module to convert between the two datatypes, or Data.Text.Lazy.Encoding if you are using the lazy variants (as you seem to be doing based on your error messages).
You definitely want to be using Data.Text for textual data.
encodeUtf8 is the way to go. This error:
Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString
against inferred type Data.ByteString.Internal.ByteString
means that you're supplying a strict bytestring to code which expects a lazy bytestring. Conversion is easy with the fromChunks function:
Data.ByteString.Lazy.fromChunks :: [Data.ByteString.Internal.ByteString] -> ByteString
so all you need to do is add the function fromChunks [myStrictByteString] wherever the lazy bytestring is expected.
Conversion the other way can be accomplished with the dual function toChunks, which takes a lazy bytestring and gives a list of strict chunks.
You may want to ask the maintainers of some packages if they'd be able to provide a text interface instead of, or in addition to, a bytestring interface.
Use a single function cs from the Data.String.Conversions.
It will allow you to convert between String, ByteString and Text (as well as ByteString.Lazy and Text.Lazy), depending on the input and the expected types.
You still have to call it, but no longer to worry about the respective types.
See this answer for usage example.
For what it's worth, I found these two helper functions to be quite useful:
import qualified Data.ByteString.Char8 as BS
import qualified Data.Text as T
-- | Text to ByteString
tbs :: T.Text -> BS.ByteString
tbs = BS.pack . T.unpack
-- | ByteString to Text
bst :: BS.ByteString -> T.Text
bst = T.pack . BS.unpack
Example:
foo :: [BS.ByteString]
foo = ["hello", "world"]
bar :: [T.Text]
bar = bst <$> foo
baz :: [BS.ByteString]
baz = tbs <$> bar

Split ByteString on a ByteString (instead of a Word8 or Char)

I know I already have the Haskell Data.ByteString.Lazy function to split a CSV on a single character, such as:
split :: Word8 -> ByteString -> [ByteString]
But I want to split on a multi-character ByteString (like splitting on a String instead of a Char):
split :: ByteString -> ByteString -> [ByteString]
I have multi-character separators in a csv-like text file that I need to parse, and the individual characters themselves appear in some of the fields, so choosing just one separator character and discarding the others would contaminate the data import.
I've had some ideas on how to do this, but they seem kind of hacky (e.g. take three Word8s, test if they're the separator combination, start a new field if they are, recurse further), and I imagine I would be reinventing a wheel anyway. Is there a way to do this without rebuilding the function from scratch?
The documentation of Bytestrings breakSubstring contains a function that does what you are asking for:
tokenise x y = h : if null t then [] else tokenise x (drop (length x) t)
where (h,t) = breakSubstring x y
There are a few functions in bytestring for splitting on subsequences:
breakSubstring :: ByteString -> ByteString -> (ByteString,ByteString)
There's also a
bytestring-csv package, http://hackage.haskell.org/package/bytestring-csv
a split package: http://hackage.haskell.org/package/split for strings though.

Resources