Most efficient way of converting a Data.ByteString.Lazy to a CStringLen

Most efficient way of converting a Data.ByteString.Lazy to a CStringLen - haskell

I need to encode some data to JSON and then push is to the syslog using hsyslog. The types of the two relevant functions are:
Aeson.encode :: a -> Data.ByteString.Lazy.ByteString
System.Posix.Syslog.syslog :: Maybe Facility
-> Priority
-> CStringLen
-> IO ()
What's the most efficient way (speed & memory) to convert a Lazy.ByteString -> CStringLen? I found Data.ByteString.Unsafe, but it works only with ByteString, not Lazy.ByteString?
Shall I just stick a unsafeUseAsCStringLen . Data.String.Conv.toS and call it a day? Will it to the right thing wrt efficiency?

I guess I would use Data.ByteString.Lazy.toStrict in place of toS, to avoid the additional package dependency.
Anyway, you won't find anything more efficient than:
unsafeUseAsCStringLen (toStrict lbs) $ \cstrlen -> ...
In general, toStrict is an "expensive" operation, because a lazy ByteString will generally be made up of a bunch of "chunks" each consisting of a strict ByteString and not necessarily yet loaded into memory. The toStrict function must force all the strict ByteString chunks into memory and ensure that they are copied into a single, contiguous block as required for a strict ByteString before the no-copy unsafeUseAsCStringLen is applied.
However, toStrict handles a lazy ByteString that consists of a single chunk optimally without any copying.
In practice, aeson uses an efficient Data.ByteString.Builder to create the JSON, and if the JSON is reasonably small (less than 4k, I think), it will build a single-chunk lazy ByteString. In this case, toStrict is zero-copy, and unsafeUseAsCStringLen is zero copy, and the entire operation is basically free.
But note that, in your application, where you are passing the string to the syslogger, fretting about the efficiency of this operation is crazy. My guess would be that you'd need thousands of copy operations to even make a dent in the performance of the overall action.

Related

What makes a Bytestring "lazy"?

I am learning Haskell but having some difficulty understanding how exactly lazy ByteStrings work. Hackage says that "Lazy ByteStrings use a lazy list of strict chunks which makes it suitable for I/O streaming tasks". In contrast, a strict list is stored as one large array.
What are these "chunks" in lazy byteStrings? How does your compiler know just how large a chunk should be? Further, I understand that the idea behind a lazy list is that you don't have to store the entire thing, which thus allows for infinite lists and all of that. But how is this storage implemented? Does each chunk have a pointer to a next chunk?
Many thanks in advance for the help :)

You can find the definition of the lazy ByteString here:
data ByteString = Empty | Chunk {-# UNPACK #-} !S.ByteString ByteString
deriving (Typeable)
so Chunk is one data-constructor - the first part is a strict (!) strict (S.) ByteString and then some more Chunks or Empty via the second recursive (lazy) ByteString part.
Note that the second part does not have the (!) there - so this can be a GHC thunk (the lazy stuff in Haskell) that will only be forced when you need it (for example pattern-match on it).
That means a lazy ByteString is either Empty or you get a strict (you can think of this as already loaded if you want) part or chunk of the complete string with a lazy remaining/rest/tail ByteString.
As about the size that depends on the code that is generating this lazy bytestring - the compiler does not come into this.
You can see this for hGetContents:
hGetContents = hGetContentsN defaultChunkSize
where defaultChunkSize is defined to be 32 * 1024 - 2 * sizeOf (undefined :: Int) - so a bit less than 32kB
And yes the rest (snd. argument to Chunk) can be seen as a pointer to the next Chunk or Empty (just like with a normal list).

How to approach writing of custom decoding function from `ByteString` to `Text`

Suppose I wish to write something like this:
-- | Decode a 'ByteString' containing Code Page 437 encoded text.
decodeCP437 :: ByteString -> Text
decodeCP437 = undefined
(I know about encoding package, but its dependency list is ridiculous price to pay for this single, and I believe quite trivial function.)
My question is how to construct Text from ByteString with reasonable efficiency, in particular without using lists. It seems to me that Data.Text.Encoding should be a good source for inspiration, but at first sight it uses withForeignPtr and I guess it's too low level for my use case.
How the problem should be approached? In a nutshell, I guess I need to continuously take bytes (Word8) from ByteString, translate every byte to corresponding Char, and somehow efficiently build Text from them. Complexity of basic building functions in Data.Text for Text construction not surprisingly indicates that appending characters one by one is not the best idea, but I don't see better tools for this available.
Update: I want to create strict Text. It seems that the only option is to create builder then get lazy Text from it (O(n)) and then convert to strict Text (O(n)).

You can use the Builder API, which offers O(1) singleton :: Char -> Builder and O(1) (<>) :: Builder -> Builder -> Builder for efficient construction operations.

Idiomatic way to take a substring of a ByteString

I need to make extensive use of:
slice :: Int -> Int -> ByteString -> ByteString
slice start len = take len . drop start
Two part question:
Does this already have a name? I can't find anything searching for that type on Hoogle, but it seems like it should be a really common need. I also tried searching for (Int, Int) -> ByteString -> ByteString and some flip'd versions of same. I also tried looking for [a] versions to see if there was a name in common use.
Is there a better way to write it?
I'm suspicious that I'm doing something wrong because I strongly expected to find lots of people having gone down the same road, but my google-fu isn't finding anything.

The idiomatic way is via take and drop, which has O(1) complexity on strict bytestrings.
slice is not provided, to discourage the reliance on unsafe indexing operations.

According to the documentation there is no such function. Currently strict ByteStrings are represented as a pointer to beggining of pinned memory, an offset and a length. So, indeed, your implementation is better way to do splice. However, you should be careful with splices because spliced bytestrings takes the same amount of space as the original bytestring. In order to avoid this you might want to copy a spliced bytestring, but this is not always necessarily.

Converting Data.Time.UTCTime to / from ByteString

Let's say I need to do write/read of a Data.Time.UTCTime in "%Y-%m-%d %H:%M:%S" format many many times to/from a file.
It seems to me that, using Data.Time.formatTime or Data.Time.parseTime to convert UTCTime to/from String and then packing/unpacking the String to/from ByteString, would be too slow since it involves an intermediate String. But writing a ByteString builder/parser of UTCTime by hand seems like repeating a lot of work already done in formatTime and parseTime.
I guess my question is: Is there a systematic way to get functions of type t -> String or String -> t converted to t -> ByteString or ByteString -> t with increased efficiency without repeating a lot of work?
I am totally a Haskell newbie, so please forgive me if the question is just stupid.

No, there isn't a general way to convert a function of type t -> String to one of type t -> ByteString. It might help you reconcile yourself to that reality if you recall that a ByteString isn't just a faster String; it's lower-level than that. A ByteString is a sequence of bytes; it doesn't mean anything more unless you have an encoding in mind.
So your options are:
Use function composition, as in phg's answer:
import Data.ByteString.Char8 as B
timeToByteStr :: UTCTime -> ByteString
timeToByteStr = B.pack . formatTime'
parseBStrTime :: ByteString -> Maybe UTCTime
parseBStrTime = parseTime' . B.unpack
where I've corrected the function names. I've also used, e.g., parseTime' instead of parseTime, to elide over the format string and TimeLocale you need to pass in.
(It's also worth noting that Data.Text is often a better choice than Data.ByteString.Char8. The latter only works properly if your Chars fall in Unicode code points 0-255.)
If performance is an issue and this conversion profiles out as a bottleneck, write a parser/builder.
Just use Strings.
The last option is an underrated choice for Haskell newbies, IMHO. String is suboptimal, performance-wise, but it's not spawn-of-the-Devil. If you're just getting started with Haskell, why not keep your life as simple as possible, until and unless your code becomes too slow?

If you're not really in the need of heavily optimized code, the usual and easiest way is to use function composition:
timeToByte = toBytes . formatTime
byteToTime = parseTime . fromBytes
or something like that, as I'm not familiar with those libraries.
If after profiling you recognize that this approach is still to slow, I guess you will have to write something by hand.

Referential transparency and mmap in Haskell

I was hoping to use System.INotify and System.IO.MMap together in order to watch for file modifications and then quickly perform diffs for sending patches over a network. However, in the documentation for System.IO.MMap there's a couple of warnings about referential transparency:
The documentation states
It is only safe to mmap a file if you know you are the sole user. Otherwise referential transparency may be or may be not compromised. Sadly semantics differ much between operating systems.
The values that MMap returns are IO ByteString, surely when I use this value with putStr I'm expecting a different result each time? I assume that the author means that the value could change during an IO operation such as putStr and crash?
START-OF-EDIT: Come to think of it, I guess answer to this part of the question is somewhat obvious...
If the value changes any time after it is unboxed it would be problematic.
do
v <- mappedValue :: IO ByteString
putStr v
putStr v -- Expects the same value of v everywhere
END-OF-EDIT
Shouldn't it be possible to acquire some kind of lock on the mapped region or on the file?
Alternatively, would it be possible to write a function copy :: IO ByteString -> IO ByteString that takes a snapshot of the file in its current state in a safe way?

I think the author means that the value can change even inside a lifted function that can view it as a plain ByteString (no IO).
The meory mapped file is a region of memory. It doesn't make much sense to copy its content back and forth, for performance reasons (otherwise one could just do plain old stream-based I/O). So the ByteString you are getting is live.
If you want to have a snapshot, just use a stream-based I/O. That's what reading a file does: creates a file snapshot in the memory! I guess an alternative would be using the ForeignPtr interface which does not carry the referential transparency warning. I'm not familiar with ForeignPtrs so I cannot guarantee it will work, but it looks promising and I would investigate it.
You can also try calling map id on your ByteString but it is not guaranteed you will get a copy distinct from the original.
Mandatory file locking, especially on Linux, is a mess that is better avoided. Advisory file locking is OK, except nobody is using it, so it effectively does not exist.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Most efficient way of converting a Data.ByteString.Lazy to a CStringLen - haskell

Related

What makes a Bytestring "lazy"?

How to approach writing of custom decoding function from `ByteString` to `Text`

Idiomatic way to take a substring of a ByteString

Converting Data.Time.UTCTime to / from ByteString

Referential transparency and mmap in Haskell

Categories

Resources