Haskell: find out how many bytes a Get expression would consume - haskell

I am writing a tool which includes a deserialization mechanism for my bachelor thesis, for which I use the Get Monad (Data.Binary.Get). I ran into the following problem:
During deserialization, there is a part where I have a getter of type Get a and I need to read a ByteString of length n, where n is the amount of bytes that would be consumed if I ran my getter at this position. In other words, I need to know how much bytes my getter would consume without consuming them.
There is a way to do this:
readBytes :: Get a -> Get ByteString
readBytes getter = do safe <- lookAhead getRemainingLazyByteString
let info = runGetOrFail getter safe
-- n_cB = number of consumed bytes
case info of Right (_, n_cB, _) -> getLazyByteString n_cB
But this is hideous beyond description. Every time this method is called, it copies the entire remainder of the file.
Even though this doesn't seem like a hard problem in theory, and so far the Get Monad has been capable of doing everything I needed, I cannot find a better solution.

I need to know how much bytes my getter would consume without
consuming them.
Perhaps you could perform two calls to the bytesRead :: Get Int64 function, the second call inside a lookAhead, after having parsed the a value. Something like
bytesRead1 <- bytesRead
bytesRead2 <- lookAhead (getter *> bytesRead)
return (bytesRead2 - bytesRead1)
I'm not sure about how bytesRead behaves inside lookAhead, however.

Related

STM-friendly list as a change log

I need an advice on the data structure to use as an atomic change log.
I'm trying to implement the following algorithm. There is a flow of incoming
changes updating an in-memory map. In Haskell-like pseudocode it is
update :: DataSet -> SomeListOf Change -> Change -> STM (DataSet, SomeListOf Change)
update dataSet existingChanges newChange = do
...
return (dataSet, existingChanges ++ [newChange])
where DataSet is a map (currently it is the Map from the stm-containers package, https://hackage.haskell.org/package/stm-containers-0.2.10/docs/STMContainers-Map.html). The whole "update" is called from arbitrary number of threads. Some of the Change's can be rejected due to domain semantics, I use throwSTM for that to throw away the effect of the transaction. In case of successful commit the "newChange" is added to the list.
There exists separate thread which calls the following function:
flush :: STM (DataSet, SomeListOf Change) -> IO ()
this function is supposed to take the current snapshot of DataSet together with the list of changes (it has to a consistent pair) and flush it to the filesystem, i.e.
flush data = do
(dataSet, changes) <- atomically $ readTVar data_
-- write them both to FS
-- ...
atomically $ writeTVar data_ (dataSet, [])
I need an advice about the data structure to use for "SomeListOf Change". I don't want to use [Change] because it is "too ordered" and I'm afraid there will be too many conflicts, which will force the whole transaction to retry. Please correct me, if I'm wrong here.
I cannot use the Set (https://hackage.haskell.org/package/stm-containers-0.2.10/docs/STMContainers-Set.html) because I still need to preserve some order, e.g. the order of transaction commits. I could use TChan for it and it looks like a good match (exactly the order of transaction commits), but I don't know how to implement the "flush" function so that it would give the consistent view of the whole change log together with the DataSet.
The current implementation of that is here https://github.com/lolepezy/rpki-pub-server/blob/add-storage/src/RRDP/Repo.hs, in the functions applyActionsToState and rrdpSyncThread, respectively. It uses TChan and seems to do it in a wrong way.
Thank you in advance.
Update: A reasonable answer seems to be like that
type SomeListOf c = TChan [c]
update :: DataSet -> TChan [Change] -> Change -> STM DataSet
update dataSet existingChanges newChange = do
...
writeTChan changeChan $ reverse (newChange : existingChanges)
return dataSet
flush data_ = do
(dataSet, changes) <- atomically $ (,) <$> readTVar data_ <*> readTChan changeChan
-- write them both to FS
-- ...
But I'm still not sure whether it's a neat solution to pass the whole list as an element of the channel.
I'd probably just go with the list and see how far it takes performance-wise. Given that, you should consider that both, appending to the end of a list and reversing it are O(n) operations, so you should try to avoid this. Maybe you can just prepend the incoming changes like this:
update dataSet existingChanges newChange = do
-- ...
return (dataSet, newChange : existingChanges)
Also, your example for flush has the problem that reading and updating the state is not atomic at all. You must accomplish this using a single atomically call like so:
flush data = do
(dataSet, changes) <- atomically $ do
result <- readTVar data_
writeTVar data_ (dataSet, [])
return result
-- write them both to FS
-- ...
You could then just write them out in reverse order (because now changes contains the elements from newest to oldest) or reverse here once if it's important to write them out oldest to newest. If that's important I'd probably go with some data structure which allows O(1) element access like a good old vector.
When using a fixed-size vector you would obviously have to deal with the problem that it can become "full" which would mean your writers would have to wait for flush to do it's job before adding fresh changes. That's why I'd personally go for the simple list first and see if it's sufficient or where it needs to be improved.
PS: A dequeue might be a good fit for your problem as well, but going fixed size forces you to deal with the problem that your writers can potentially produce more changes than your reader can flush out. The dequeue can grow infinitely, but you your RAM probably isn't. And the vector has pretty low overhead.
I made some (very simplistic) investigation
https://github.com/lolepezy/rpki-pub-server/tree/add-storage/test/changeLog
imitating exactly the type of load I supposedly going to have. I used the same STMContainers.Map for the data set and usual list for the change log. To track the number of transaction retries, I used Debug.Trace.trace, meaning, the number of lines printed by trace. And the number of unique lines printed by trace gives me the number of committed transactions.
The result is here (https://github.com/lolepezy/rpki-pub-server/blob/add-storage/test/changeLog/numbers.txt). The first column is the number of threads, the second is the number of change sets generated in total. The third column is the number of trace calls for the case without change log and the last one is the number of trace calls with the change log.
Apparently most of the time change log adds some extra retries, but it's pretty much insignificant. So, I guess, it's fair to say that any data structure would be good enough, because most of the work is related to updating the map and most of the retries are happening because of it.

How to impurely modify a state associated with an object?

In Haskell, I have a container like:
data Container a = Container { length :: Int, buffer :: Unboxed.Vector (Int,a) }
This container is a flattened tree. Its accessor (!) performs a binary (log(N)) search through the vector in order to find the right bucket where index is stored.
(!) :: Container a -> Int -> a
container ! index = ... binary search ...
Since consecutive accesses are likely to be in the same bucket, this could be optimized in the following way:
if `index` is on the the last accessed bucket, skip the search
The tricky point is the last accessed bucket part. In JavaScript, I'd just impurely modify a hidden variable on the container object.
function read(index,object){
var lastBucket = object.__lastBucket;
// if the last bucket contains index, no need to search
if (contains(object, lastBucket, index))
var bucket = lastBucket;
// if it doesn't
else {
// then we search the bucket
var bucket = searchBucket(index,object);
// And impurely annotate it on the container, so the
// next time we access it we could skip the search.
container.__lastBucket = bucket;
}
return object.buffer[bucket].value;
}
Since this is just an optimization and the result is the same independent of the branch taken, I believe it doesn't break referential transparency. How is it possible, in Haskell, to impurely modify an state associated with a runtime value?
~
I have thought in 2 possible solutions.
A global, mutable hashmap linking pointers to the lastBucket value, and use unsafePerformIO to write on it. But I'd need a way to get the runtime pointer of an object, or at least an unique id of some sort (how?).
Add an extra field to Container, lastBucket :: Int, and somehow impurely modify it within (!), and consider that field internal (because it obviously break referential transparency).
Using solution (1), I managed to get the following design. First, I added a __lastAccessedBucket :: IORef Int field to my datatype, as suggested by #Xicò:
data Container a = Container {
length :: Int,
buffer :: V.Vector (Int,a),
__lastAccessedBucket :: IORef Int }
Then, I had to update the functions that create a new Container in order to create a new IORef using unsafePerformIO:
fromList :: [a] -> Container a
fromList list = unsafePerformIO $ do
ref <- newIORef 0
return $ Container (L.length list) buffer ref
where buffer = V.fromList (prepare list)
Finally, I created two new functions, findBucketWithHint, a pure function which searches the bucket of an index with guess (i.e., the bucket where you think it might be), and the unsafeFindBucket function, which replaces the pure findBucket when performance is needed, by always using the last accessed bucket as the hint:
unsafeFindBucket :: Int -> Container a -> Int
unsafeFindBucket findIdx container = unsafePerformIO $ do
let lastBucketRef = __lastAccessedBucket contianer
lastBucket <- readIORef lastBucketRef
let newBucket = findBucketWithHint lastBucket findIdx container
writeIORef lastBucketRef newBucket
return $ newBucket
With this, unsafeFindBucket is technically a pure function with the same API of the original findBucket function, but is an order of magnitude faster in some benchmarks. I have no idea how safe this is and where it could cause bugs. Threads are certainly a concern.
(This is more an extended comment than an answer.)
First I'd suggest to check if this isn't a case of premature optimization. After all, O(log n) ins't that bad.
If this part is indeed performance-critical, your intention is definitely valid. The usual warning for unsafePerformIO is "use it only if you know what you're doing", which you obviously do, and it can help to make things pure and fast at the same time.
Be sure that you follow all the precautions in the docs, in particular setting the proper compiler flags (you might want to use the OPTIONS_GHC pragma).
Also make sure that the IO operation is thread safe. The easiest way to ensure that is to use IORef together with atomicModifyIORef.
The disadvantage of an internal mutable state is that the performance of the cache will deteriorate if it's accessed from multiple threads, if they lookup different elements.
One remedy would be to explicitly thread the updated state instead of using the internal mutable state. This is obviously what you want to avoid, but if your program is using monads, you could just add another monadic layer that'd internally keep the state for you and expose the lookup operation as a monadic action.
Finally, you could consider using splay trees instead of the array. You'd still have (amortized) O(log n) complexity, but their big advantage is that by design they move frequently accessed elements near the top. So if you'll be accessing a subset of elements of size k, they'll be soon moved to the top, so the lookup operations will be just O(log k) (constant for a single, repeatedly accessed element). Again, they update the structure on lookups, but you could use the same approach with unsafePerformIO and atomic updates of IORef to keep the outer interface pure.

How does the GHC garbage collector / runtime know that it can create an array `inplace'

For example
main = do
let ls = [0..10000000]
print ls
This will create the array 'inplace', using O(1) memory.
The following edit causes the program to run out of memory while executing.
main = do
let ls = [0..10000000]
print ls
print ls
ls in this case must be kept in memory to be printed again. It would actually be heaps more memory efficient to recalculate the array again 'inplace' than to try to keep this in place. That's an aside though. My real question is "how and when does GHC communicate to the runtime system that ls can be destroyed while it's created in O(1) time?" I understand that liveness analysis can find this information, I'm just wondering where the information is used. Is it the garbage collector that is passed this info? Is it somehow compiled away? (If I look at the compiled core from GHC then both examples use eftInt, so if it's a compiler artifact then it must happen at a deeper level).
edit: My question was more about finding where this optimization took place. I thought maybe it was in the GC, which was fed some information from some liveness check in the compilation step. Due to the answers so far I'm probably wrong. This is most likely then happening at some lower level before core, so cmm perhaps?
edit2: Most of the answers here assume that the GC knows that ls is no longer referenced in the first example, and that it is referenced again in the second example. I know the basics of GC and I know that arrays are linked lists, etc. My question is exactly HOW the GC knows this. The answer could probably be only (a) it is getting extra information from the compiler, or (b) it doesn't need to know this, that this information is handled 100% by the compiler
ls here is a lazy list, not an array. In practice, it's closer to a stream or generator in another language.
The reason the first code works fine is that it never actually has the whole list in memory. ls is defined lazily and then consume element-by-element by print. As print is going along, there are no other references to the beginning of ls so list items can be garbage collected immediately.
In theory, GHC could realize that it's more efficient to not store the list in memory between the two prints but instead recompute it. However, this is not always desirable—a lot of code is actually faster if things are only evaluated once—and, more importantly, would make the execution model even more confusing for programmers.
This explanation is probably a lie, especially because I'm making it up as I go, but that shouldn't be a problem.
The essential mistake you're making is assuming that a value is live if a variable bound to it is in scope in a live expression. This is simply wrong. A value bound to a variable is only live as a result if it is actually mentioned in a live expression.
The job of the runtime is very simple
Execute the expression bound to main.
There is no 2.
We can think of this execution as involving a couple different steps that repeat over and over:
Figure out what to do now.
Figure out what to do next.
So we start with some main expression. From the start, the "root set" for GC consists of those names that are used in that main expression, not the things that are in scope in that expression. If I write
foo = "Hi!"
main = print "Bye!"
then since foo is not mentioned in main, it is not in the root set at the beginning, and since it is not even mentioned indirectly by anything mentioned by main, it is dead right from the start.
Now suppose we take a more interesting example:
foo = "Hi!"
bar = "Bye!"
main = print foo >> print bar
Now foo is mentioned in main, so it starts out live. We evaluate main to weak head normal form to find out what to do, and we get, approximately,
(primitive operation that prints out "Hi!") >> print bar
Note that foo is no longer mentioned, so it is dead!
Now we execute that primitive operation, printing "Hi!", and our "to do list" is reduced to
print bar
Now we evaluate that to WHNF, and get, roughly,
(primitive operation to print "Bye!")
Now bar is dead. We print "Bye!" and exit.
Consider, now, the first program you described:
main = do
let ls = [0..10000000]
print ls
This desugars to
main =
let ls = [0..10000000]
in print ls
This is where we start. The "root set" at the beginning is everything mentioned in the in clause of the expression. So we conceptually have ls and print to start out. Now we can imagine that print, specialized to [Integer], looks something vaguely like the following (this is greatly simplified, and will print out the list differently, but that really doesn't matter).
print xs = case xs of
[] -> return ()
(y:ys) = printInteger y >> print ys
So when we start executing main (What do we do now? What will we do afterwards?), we are trying to calculate print ls. To do this, we pattern match on the first constructor of ls, which forces ls to be evaluated to WHNF. We find the second pattern, y:ys, matches, so we replace print ls with print Integer y >> print ys, where y points to the first element of ls and ys points to the thunk representing the second list constructor of ls. But note that ls itself is now dead! Nothing is pointing to it! So as print forces bits of the list, the bits it has already passed become dead.
In contrast, when you have
main =
let ls = ...
in print ls >> print ls
and you start executing, you start by calculating the thing to do first (print ls). You get
(printInteger y >> print ys) >> print ls
Everything is the same, except that the second part of the expression now points to ls. So even though the first part will be dropping pieces of the list as it goes, the second part will keep holding on to the beginning, keeping it all live.
Edit
Let me try explaining with something a little simpler than IO. Pretend that your program is an expression of type [Int], and the job of the runtime system is to print each element on its own line. So we can write
countup m n = if m == n then [] else m : countup (m+1)
main = countup 0 1000
The runtime system holds a value representing everything that it should print. Let's call the "current value" whatPrint. The RTS needs to follow a process:
Set whatPrint to main.
Is whatPrint empty? If so, I'm done, and can exit the program. If not, it is a cons, printNow : whatPrint'.
Calculate printNow and print it.
Set whatPrint to whatPrint'
Go to step 1.
In this model, the "root set" for garbage collection is just whatPrint.
In a real program, we don't produce a list; we produce an IO action. But such an action is also a lazy data structure (conceptually). You can think of >>=, return, and each primitive IO operation as a constructor for IO. Think of it as
data IO :: * -> * where
Return :: a -> IO a
Bind :: IO a -> (a -> IO b) -> IO b
PrintInt :: Int -> IO ()
ReadInt :: IO Int
...
The initial value of whatShouldIDo is main, but its value evolves over time. Only what it points to directly is in the root set. There is no magical analysis necessary.

Is it safe to reuse a conduit?

Is it safe to perform multiple actions using the same conduit value? Something like
do
let sink = sinkSocket sock
something $$ sink
somethingElse $$ sink
I recall that in the early versions of conduit there were some dirty hacks that made this unsafe. What's the current status?
(Note that sinkSocket doesn't close the socket.)
That usage is completely safe. The issue in older versions had to do with blurring the line between resumable and non-resumable components. With modern versions (I think since 0.4), the line is very clear between the two.
It might be safe to reuse sinks in the sense that the semantics for the "used" sink doesn't change. But you should be aware of another threat: space leaks.
The situation is analogous to lazy lists: you can consume a huge list lazily in a constant space, but if you process the list twice it will be kept in memory. The same thing might happen with a recursive monadic expression: if you use it once it's constant size, but if you reuse it the structure of the computation is kept in memory, resulting in space leak.
Here's an example:
import Data.Conduit
import Data.Conduit.List
import Control.Monad.Trans.Class (lift)
consumeN 0 _ = return ()
consumeN n m = do
await >>= (lift . m)
consumeN (n-1) m
main = do
let sink = consumeN 1000000 (\i -> putStrLn ("Got one: " ++ show i))
sourceList [1..9000000::Int] $$ sink
sourceList [1..22000000::Int] $$ sink
This program uses about 150M of ram on my machine, but if you remove the last line or repeat the definition of sink in both places, you get a nice constant space usage.
I agree that this is a contrived example (this was the first that came to my mind), and this is not very likely to happen with most Sinks. For example this will not happen with your sinkSocket. (Why is this contrived: because the control structure of the sink doesn't depend on the values it gets. And that is also why it can leak.) But, for example, for sources this would be much more common. (Many of the common Sources exhibit this behavior. The sourceList would be an obvious example, because it would actually keep the source list in memory. But, enumFromTo is no different, although there is no data to keep in memory, just the structure of the monadic computation.)
So, all in all, I think it's important to be aware of this.

Measuring TChan length

I need to store a buffer of some values in STM. Writer threads need to monitor the buffer's size. I started to implement this thing using TChan but than I found out that the API does not provide a way to measure the length of the channel. Being a one stubborn fella I then implemented the thing myself:
readTChanLength ch = do
empty <- isEmptyTChan ch
if empty
then return 0
else do
value <- readTChan ch
length <- readTChanLength ch
unGetTChan ch value
return $ 1 + length
Now everything works fine, but I am wondering what the reasons for such a trivial thing not to be implemented in the standard library are and what the preferred approach to that sorta problem is. I realize that this algorithm has at least an O(n) complexity, but it can't be the reason, right?
The preferred approach is to keep a counter with the channel, and atomically increment the counter while writing the channel, and decrementing the counter when reading the channel.
Your solution traverses through all element of the channel, which will probably not work well for actual high-concurrent workloads.

Resources