Suppose I have a record, e.g. Person, and I want to be able to look this person up through multiple data structures. Maybe there's an index by name, another index by the person's zip code, and another index by the person's current latitude and longitude. And maybe many more data structures. All of which exist because I need to efficiently look up a person or persons with different criteria.
If I just need to read a person's attributes, this is no problem. But now suppose I need to look up a person using one of these data structures and then update the person's data.
In an OOP language, each data structure would point to the same person in memory. So when you update one, you're implicitly updating the referents of the other data structures as well. This is pretty much the definition of side effects and impurity. I know it's totally counter to the Haskell paradigm, and I'm not expecting Haskell to work this way.
So, what is the Haskell-ish way to do it? To be clear, the problem is this: I look up a person by one data structure, and I pass that person (and maybe some other arbitrary data) into a function of type ArbitraryData -> Person -> Person. How do I propagate this change across all the various lookup structures?
As a relative newcomer to Haskell, my first instinct is to reconstruct every lookup structure with the newly updated person, every time I update a person. But that seems like a lot of ceremony, with plenty of opportunity for me to screw up in a way GHC can't detect, and not at all elegant. Haskell is known for its elegance, and I can't imagine it lacks an elegant solution to such a common and basic problem. So I think I'm missing something.
For reference, this question expands on some of the issues I was discussing in the following questions:
Multiple lookup structures for same data: Memory duplication?
Identity of simulation objects in Haskell
Edit
One solution that just crossed my mind: Don't maintain a copy of each lookup structure in your main state. Just keep one single list of all persons in existence, and that's the only thing we need to update when we update a person. Every time you need to lookup by, say, zip code, pass the list of all persons into a function that generates the efficient by-zip-code data structure. Then perform the lookup on the result.
I don't know if this would be efficient. If it results in the CPU actually recomputing the lookup structure on each use, it's unacceptable. But I know Haskell sometimes can avoid reevaluating identical expressions. Unfortunately, I still haven't figured out when this is the case. So I don't know if this approach is viable.
So in other words: Can I write my functions as if they're computing the lookup each time, when in fact GHC will optimize it away for cases where the underlying data hasn't changed? Because that would be a very elegant solution to the problem I've identified above.
Since I answered this, a few people in #haskell on Freenode recommended alternative, premade solutions:
mm_freak_ recommended Data.IxSet.
donri recommended Data.Store, which was said to provide lenses.
You can make a data structure that contains your lookup tables, as well as a Vector of actual Persons. The lookup tables will give you an Int or a list of Ints (rather than a Person or a list of Persons) which is the index into the Vector Person. For example:
data PersonStuff = PersonStuff {
persons :: Vector Person,
firstNameLookupTable :: LookupTable Name,
...
}
data LookupTable a = LookupTable {
table :: Map a Int,
update :: Person -> Person -> Map a Int -> Map a Int
}
The update function is given the old Person, the updated Person, and will update the table only if the relevant details have changed. When a Person is modified through the convenient PersonStuff functions you'll write, those functions will update all the lookup tables for you, returning a new PersonStuff with all associated data. This makes for a pure data structure with quick lookups.
You can make functions like updatePeopleWithFirstName :: Name -> (Person -> Person) -> PersonStuff -> PersonStuff that will get all the people with a first name, apply a Person -> Person to each of them, modify their entries in the Vector, and use the update functions to update all of the lookupTables.
I would probably just update every lookup structure with the new value. Perhaps grouping the structures in a record and providing a global update function.
Or perhaps you could designate one of the search criteria as "primary", and have the values in the other lookup maps point to the "primary key" of the object, instead of to the object value itself. That would cause one additional lookup for each access by non-primary key, though.
We have two challenges. The first is "How do [we] propagate [a] change across ... various lookup structures". The second is to minimize the work done when we perform lookups.
Let's make some working code so that we have something concrete to discuss.
To begin with, let's look at what an "update" or "change" is. An update or change starts in one state, and ends up in another state. It's a function from the previous state to the next state. It basically is type Update = State -> State. In Haskell, we can make the state disappear by hiding it in some Monad; this is a very common practice, so despite the fact it looks "impure" it is very "Haskell-ish". You can read more about this idea by reading about the state monad.
Here's a class similar to MonadState that lets us talk about values we can allocate (new), update (set), and inspect (get).
-- Class for a typed dictionary in a monadic context
class (Monad m) => MonadReference m where
type Reference :: * -> *
new :: (Typeable a) => a -> m (Reference a)
set :: (Typeable a) => (Reference a) -> a -> m ()
get :: (Typeable a) => (Reference a) -> m a
We'll use this to write some very simple example code.
data Person = Person {
name :: String
} deriving (Show, Typeable)
data Company = Company {
legalName :: String
} deriving (Show, Typeable)
-- the only thing we need MonadIO for in this exmple is printing output
example1 :: (MonadIO m, MonadReference m) => m ()
example1 = do
-- alice :: Reference Person
alice <- new $ Person { name = "Alice" }
bob <- new $ Person { name = "Bob" }
-- company :: Reference Company
company <- new $ Company { legalName = "Eve's Surveillance" }
(liftIO . print) =<< get alice
(liftIO . print) =<< get bob
(liftIO . print) =<< get company
(liftIO . putStrLn) ""
set alice Person { name = "Mike" }
set company Company { legalName = "Mike's Meddling" }
(liftIO . print) =<< get alice
(liftIO . print) =<< get bob
(liftIO . print) =<< get company
We've used new, get, and set to create some References, inspect them, and modify them.
To get this to work, we need a bit of boring boilerplate. We'll borrow IORef for our implementation of a Reference to run this code without writing too much code ourselves.
{-# LANGUAGE TypeFamilies, DeriveDataTypeable #-}
module Main (
main
) where
import Data.Typeable
import Data.Traversable
import Control.Applicative
import Data.IORef
--transformers package:
import Control.Monad.IO.Class
main = example1
-- Instead of implementing a dictionary, for an example we'll just use IORefs when we have IO.
instance MonadReference IO where
type Reference = IORef
new = newIORef
set = writeIORef
get = readIORef
Now, on top of updating people, we'd also like to update the people in a multiple data structures. We'll look at two data structures: a list, [Person], and a tuple, (Person,Company). Now, we could make a list of References to people, say (people :: [Reference Person]) = [alice, bob], but this isn't very useful. For example, we don't really know how to show it. It would be more useful if Reference weren't intermingled inside the list. Naively, Reference [Person] would be more useful. But it would mean nothing to set this Reference, so clearly we have the wrong type. Reference [Person] would just let us call get to turn it into an m [Person], so we could skip that and just use m [Person]. Here's an example that does that:
-- the only thing we need MonadIO for in this exmple is printing output
example2 :: (MonadIO m, MonadReference m) => m ()
example2 = do
-- alice :: Reference Person
alice <- new $ Person { name = "Alice" }
bob <- new $ Person { name = "Bob" }
-- company :: Reference Company
company <- new $ Company { legalName = "Eve's Surveillance" }
(liftIO . print) =<< get alice
(liftIO . print) =<< get bob
(liftIO . print) =<< get company
let people = do
a <- get alice
b <- get bob
return [a, b]
let structure2 = do
a <- get alice
c <- get company
return (a, c)
(liftIO . print) =<< people
(liftIO . print) =<< structure2
(liftIO . putStrLn) ""
set alice Person { name = "Mike" }
set company Company { legalName = "Mike's Meddling" }
(liftIO . print) =<< get alice
(liftIO . print) =<< get bob
(liftIO . print) =<< get company
(liftIO . print) =<< people
(liftIO . print) =<< structure2
Now we know quite a bit about what a library or libraries for doing this should look like. Here are some of the requirements we might have already imagined:
We need something that keeps the state of all of the entities
We need a way to go from one state to a new state that has a new entity
We need a way to update an entity stored in the state
We need a way to retrieve an entity from the state
Here are some requirements that emerge from experimenting with some code:
We need a way to make a state dependent value that depends on the current state of an entity. We saw this in get alice, get bob, and get company.
We need a way to make a state dependent value out of something constant. We saw this in the use of the (:), [], and (,) constructors.
We need a way to combine together multiple state dependent values into new state dependent values.
There's are also a few problems with our example. If we embrace MonadReference m => m a as the type of a state dependent value of type a, there's nothing to stop something that we think is getting the value from the state from also modifying it.
A state dependent value shouldn't be able to modify the state.
We also have performance problems. All of our state dependent values are being completely recalculated every time we use them. A good performance requirement might be:
A state dependent value shouldn't be calculated unless state that it depends on has been modified.
Armed with these new requirements, we can make new interfaces. After we make new interfaces, we can equip them with a naive implementation. After we have a naive implementation, we can address our requirements for performance, and make a performant implementation.
Some exercise that could prepare us for the next steps include reading about or playing with Control.Applicative, the publisher-subscriber design pattern, the operational monad and transformer Program and ProgramT or the free monad and transformer Free, FreeF, and FreeT, Data.Traversable, Control.Lens, and the knockout.js javascript library.
Update: The new interfaces
Based on our new requirements for what state dependent values are, we can write a new interface:
-- Class for a monad with state dependent values
class (MonadReference m, Applicative Computed, Monad Computed) => MonadComputed m where
type Computed :: * -> *
track :: (Typeable a) => Reference a -> m (Computed a)
runComputed :: (Typeable a) => (Computed a) -> m a
These address our new requirements as follows:
track makes a state dependent value that depends on a Reference, which satisfies our first new requirement.
Applicative's pure and Monad's return both provide a method by which to create new Computed values that contain a constant.
Applicative's <*> and Monad's >>= provide methods by which to combine computed values into new computed values.
The Computed type provides a means for an implementation to exclude unwanted types.
Now we can write new example code in terms of this interface. We'll construct computed values three different ways: Using Data.Traversable's sequenceA on lists with the Applicative instance for Computed, using the Monad instance for Computed, and finally using the Applicative instance for Computed.
-- the only thing we need MonadIO for in this exmple is printing output
example :: (MonadIO m, MonadComputed m) => m ()
example = do
-- aliceRef :: Reference Person
aliceRef <- new $ Person { name = "Alice" }
-- alice :: Computed Person
alice <- track aliceRef
bobRef <- new $ Person { name = "Bob" }
bob <- track bobRef
-- companyRef :: Reference Company
companyRef <- new $ Company { legalName = "Eve's Surveillance" }
-- company :: Computed Company
company <- track companyRef
(liftIO . print) =<< runComputed alice
(liftIO . print) =<< runComputed bob
(liftIO . print) =<< runComputed company
let people = Traversable.sequenceA [alice, bob]
let structure2 = do
a <- alice
c <- company
return (a, c)
let structure3 = (pure (,)) <*> structure2 <*> bob
(liftIO . print) =<< runComputed people
(liftIO . print) =<< runComputed structure2
(liftIO . print) =<< runComputed structure3
(liftIO . putStrLn) ""
set aliceRef Person { name = "Mike" }
set companyRef Company { legalName = "Mike's Meddling" }
(liftIO . print) =<< runComputed alice
(liftIO . print) =<< runComputed bob
(liftIO . print) =<< runComputed company
(liftIO . print) =<< runComputed people
(liftIO . print) =<< runComputed structure2
(liftIO . print) =<< runComputed structure3
Note that if we didn't want or need to track aliceRef and track bobRef independently, we could create a list of Computed values by mapM track [aliceRef, bobRef].
Now we can make another simple implementation for IO, so that we can run our example and see that we are on the right track. We'll use operational's Program type to make this simple and get us both an Applicative and a Monad instance.
-- Evaluate computations built in IO
instance MonadComputed IO where
-- Store the syntax tree in a Program from operational
type Computed = Program IORef
track = return . singleton
runComputed c = case view c of
Return x -> return x
ref :>>= k -> do
value <- readIORef ref
runComputed (k value)
At this point the entire running example is:
{-# LANGUAGE TypeFamilies, DeriveDataTypeable, FlexibleContexts #-}
module Main (
main
) where
import Data.Typeable
import qualified Data.Traversable as Traversable
import Control.Applicative
import Data.IORef
--transformers package:
import Control.Monad.IO.Class
--operational package:
import Control.Monad.Operational
main = example
data Person = Person {
name :: String
} deriving (Show, Typeable)
data Company = Company {
legalName :: String
} deriving (Show, Typeable)
-- the only thing we need MonadIO for in this exmple is printing output
example :: (MonadIO m, MonadComputed m) => m ()
example = do
-- aliceRef :: Reference Person
aliceRef <- new $ Person { name = "Alice" }
-- alice :: Computed Person
alice <- track aliceRef
bobRef <- new $ Person { name = "Bob" }
bob <- track bobRef
-- companyRef :: Reference Company
companyRef <- new $ Company { legalName = "Eve's Surveillance" }
-- company :: Computed Company
company <- track companyRef
(liftIO . print) =<< runComputed alice
(liftIO . print) =<< runComputed bob
(liftIO . print) =<< runComputed company
let people = Traversable.sequenceA [alice, bob]
let structure2 = do
a <- alice
c <- company
return (a, c)
let structure3 = (pure (,)) <*> structure2 <*> bob
(liftIO . print) =<< runComputed people
(liftIO . print) =<< runComputed structure2
(liftIO . print) =<< runComputed structure3
(liftIO . putStrLn) ""
set aliceRef Person { name = "Mike" }
set companyRef Company { legalName = "Mike's Meddling" }
(liftIO . print) =<< runComputed alice
(liftIO . print) =<< runComputed bob
(liftIO . print) =<< runComputed company
(liftIO . print) =<< runComputed people
(liftIO . print) =<< runComputed structure2
(liftIO . print) =<< runComputed structure3
-- Class for a typed dictionary in a monadic context
class (Monad m) => MonadReference m where
type Reference :: * -> *
new :: (Typeable a) => a -> m (Reference a)
set :: (Typeable a) => Reference a -> a -> m ()
get :: (Typeable a) => Reference a -> m a
-- Class for a monad with state dependent values
class (MonadReference m, Applicative Computed, Monad Computed) => MonadComputed m where
type Computed :: * -> *
track :: (Typeable a) => Reference a -> m (Computed a)
runComputed :: (Typeable a) => (Computed a) -> m a
-- Instead of implementing a dictionary, for an example we'll just use IORefs when we have IO.
instance MonadReference IO where
type Reference = IORef
new = newIORef
set = writeIORef
get = readIORef
-- Evaluate computations built in IO
instance MonadComputed IO where
-- Store the syntax tree in a Program from operational
type Computed = Program IORef
track = return . singleton
runComputed c = case view c of
Return x -> return x
ref :>>= k -> do
value <- readIORef ref
runComputed (k value)
We still need to address our performance requirement to minimize the work done when we perform lookups. Our goal requirement was:
A state dependent value shouldn't be calculated unless state that it depends on has been modified.
We can now clarify this to be in terms of our interface:
runComputed shouldn't be calculated unless a Computed value that it depends on has been modified since the last time runComputed was executed.
We can now see that our desired solution is going to be something like cache invalidation or bottom-up query evaluation. I'd guess that in a language with lazy evaluation, they both work out to be about the same thing.
Final Update: Performance
Equipped with a new interface, we can now explore and address our performance goal. In doing so, I discovered that there's an additional, subtle requirement that we missed. We would like runComputed to reuse previously computed values if the value hasn't been changed. What we didn't notice is that Haskell's type system should and is preventing us from doing so. A value of type Computed a always means the same thing, it's never actually modified. So the computations that were building our structures will mean the same thing, "a computation constructed from these parts" even after we have executed runComputed. We need to slip in somewhere to put the side effect from the first runComputed. We can do this with the type m (Computed a) instead. The new method in MonadComputed m that does this is:
share :: (Typeable a) => (Computed a) -> m (Computed a)
The new Computed a we get back means something slightly different: "a possibly cached computation constructed from these parts". We were already doing something similar, but telling Haskell about it instead of telling our code. We wrote, for example:
let people = Traversable.sequenceA [alice, bob]
This let told the Haskell compiler that each time it encountered people it should use the same thunk. If we instead wrote Traversable.sequenceA [alice, bob] each time it would be used, the Haskell compiler probably wouldn't have created and maintained a pointer to a single thunk. This can be a nice thing to know when juggling memory. If you want to maintain something in memory and avoid computation, use let, if you want to recompute it to avoid holding onto the memory, don't use let. Here we explicitly want to hold on to our computed structures, so we are going to use our new equivalent, share
people <- share $ Traversable.sequenceA [alice, bob]
The remainder of the changes to the example code at the end are to demonstrate more possible updates.
Now that we have the interface finalized, we can work on an implementation. This implementation will still take advantage of IO and IORefs. Our implementation is going to be based on subscribing to be notified of changes, and invalidating cached changes and their dependants when a change happens. This data structure stores a value and the subscribers that want to be notified:
-- A published value for IO, using Weak references to the subscribers
data Published a = Published {
valueRef :: IORef a,
subscribers :: IORef [Weak (IO ())]
}
Something that needs to be notified when something happens in IO could be as simple as IO (), but then the cycle between a dependant computation and a value would hold all the dependant computations in memory until the original value is forgotten. Instead a Weak pointer (from System.Mem.Weak) to the dependant's update action should allow the garbage collecter to collect these.
First we'll implement MonadReference IO. Our code to handle References to entities is modified to peek through Published to get the value, and to execute all of the subscribers when the value is set.
-- A new implementation that keeps an update list
instance MonadReference IO where
type Reference = Published
new = newIORefPublished
set = setIORefPublished
get = readIORefPublished
-- Separate implemenations for these, since we'd like to drop the Typeable constraint
newIORefPublished value =
do
ref <- newIORef value
subscribersRef <- newIORef []
return Published { valueRef = ref, subscribers = subscribersRef }
setIORefPublished published value =
do
writeIORef (valueRef published) value
notify $ subscribers published
--readIORefPublished = readIORef . valueRef
readIORefPublished x = do
putStrLn "getting"
readIORef $ valueRef x
Notifying the subscribers is a bit tricky. We need to forget about any subscriber that has been removed by garbage collection. I anticipated that a subscriber might be subscribing to things during its update action for the tricky case of binding, so when a subscriber is garbage collected, we don't assume that the new set of subscribers is the old set except for the garabage collected ones, instead we filter them as a separate cleanupWeakRefs step.
notify :: IORef [Weak (IO ())] -> IO ()
notify = go
where
go subscribersRef = do
subscribers <- readIORef subscribersRef
needsCleanup <- (liftM (any id)) (mapM notifySubscriber subscribers)
when needsCleanup $ cleanupWeakRefs subscribersRef
notifySubscriber weakSubscriber = do
maybeSubscriber <- deRefWeak weakSubscriber
case maybeSubscriber of
Nothing -> return True
Just subscriber -> subscriber >> return False
cleanupWeakRefs :: IORef [Weak a] -> IO ()
cleanupWeakRefs ref = do
weaks <- readIORef ref
newWeaks <- (liftM catMaybes) $ mapM testWeak weaks
writeIORef ref newWeaks
where
testWeak weakRef = liftM (>> Just weakRef) $ deRefWeak weakRef
We're done with our handling of entities, time to get on to the interesting and tricky part, the computations. Here's the complete data type for a computation or state dependent value:
-- Data type for building computations
data IORefComputed a where
Pure :: a -> IORefComputed a
Apply :: IORefComputed (b -> a) -> IORefComputed b -> IORefComputed a
Bound :: IORefComputed b -> (b -> IORefComputed a) -> IORefComputed a
Tracked :: Published a -> IORefComputed a
Shared :: Published (Either (IORefComputed a) a) -> IORefComputed a
Pure represents values that don't depend on anything. Apply represents values built by applications of <*>. Bound represents values built using the Monad instance's >>=. Tracked are ordinary state dependent values made using track. Shared are the points at which we remember computations and are notified of changes to tracked values, made using share. We reuse the Published type to store a value and its subscribers, but the value we store is Either the computation that needs to be perfomed when the shared cache is dirty, (IORefComputed a), or the cached value when the cache is clean, a. Here are the instances that let the user use these:
instance Monad IORefComputed where
return = Pure
(>>=) = Bound
(>>) _ = id
instance Applicative IORefComputed where
pure = return
(<*>) = Apply
instance Functor IORefComputed where
fmap = (<*>) . pure
-- Evaluate computations built in IO
instance MonadComputed IO where
type Computed = IORefComputed
track = trackIORefComputed
runComputed = evalIORefComputed
share = shareIORefComputed
-- Separate implementations, again to drop the Typeable constraint
trackIORefComputed = return . Tracked
Note: the optimization of >> almost certainly violates the Monad laws in the presence of _|_.
Now we need to make the non-trivial implementations of runComputed and share. First we'll look at share, which does most of the new work:
shareIORefComputed :: IORefComputed a -> IO (IORefComputed a)
shareIORefComputed c =
case c of
Apply cf cx -> do
sharedf <- shareIORefComputed cf
sharedx <- shareIORefComputed cx
case (sharedf, sharedx) of
-- Optimize away constants
(Pure f, Pure x) -> return . Pure $ f x
_ -> do
let sharedc = sharedf <*> sharedx
published <- newIORefPublished $ Left sharedc
-- What we are going to do when either argument changes
markDirty <- makeMarkDirty published published sharedc
subscribeTo sharedf markDirty
subscribeTo sharedx markDirty
return $ Shared published
Bound cx k -> do
sharedx <- shareIORefComputed cx
case cx of
-- Optimize away constants
(Pure x) -> shareIORefComputed $ k x
_ -> do
let dirtyc = sharedx >>= k
published <- newIORefPublished $ Left dirtyc
-- What we are going to do when the argument to k changes
markDirty <- makeMarkDirty published published dirtyc
subscribeTo sharedx markDirty
return $ Shared published
_ -> return c
When we are asked to share an application of <*>, Apply, we first share both of its arguments. We optimize away the value if we can determine it to be constant. If we can't optimize it away, we make a new, initially dirty cache, and ask to be updated whenever either argument changes.
Dealing with >>= is much more difficult. We share the argument to >>=, but we don't know what Computed value the function will return until we evaluate it with each argument. We say that it can be computed by evaluating the entire bind, and ask to have the cache invalidated whenever the argument changes. We will definitely want to improve this later.
In all other cases there's nothing to be done to cache the value; it is either a constant or an already cached Tracked or Shared.
If you are doubting the need for share, replace this definition with
shareIORefComputed c = return c
and run the examples. You'll see that every involved value is read every time we run runComputed. There's nothing you can do in runComputed to modify an existing Computed to know about a place it has been cached, because we can't change existing values in Haskell.
Now we'll implement runComputed. The basic idea is that we evaluate things as before, but when we encounter a dirty shared cache we calculate its new value and update the cache. These updates do not trigger notification of the subscribers.
evalIORefComputed :: IORefComputed a -> IO a
evalIORefComputed c =
case c of
Pure x -> return x
Apply cf cx -> do
f <- evalIORefComputed cf
x <- evalIORefComputed cx
return (f x)
Bound cx k -> do
value <- evalIORefComputed cx
evalIORefComputed (k value)
Tracked published -> readIORefPublished published
Shared publishedThunk -> do
thunk <- readIORefPublished publishedThunk
case thunk of
Left computation#(Bound cx k) -> do
x <- evalIORefComputed cx
-- Make a shared version of the computed computation
currentExpression <- shareIORefComputed (k x)
let gcKeyedCurrentExpression = Left currentExpression
writeIORef (valueRef publishedThunk) gcKeyedCurrentExpression
markDirty <- makeMarkDirty publishedThunk gcKeyedCurrentExpression computation
subscribeTo currentExpression markDirty
evalIORefComputed c
Left computation -> do
value <- evalIORefComputed computation
writeIORef (valueRef publishedThunk) (Right value)
return value
Right x ->
return x
This is straightforward except for what we do for a dirty shared >>=. We evaluate the argument, then we share the resulting computation. The trick is that we ask that the entire shared thunk be marked dirty when this new value is updated. We ask the garbage collected to forget about this when the dirty marking for this currentExpression is garbage collected. This provides a window during which the thunk might be marked dirty even if it no longer depends on currentExpression. Now a shared bind will be marked dirty by both changes to its argument, changes to the computed value that depended on its argument, and changes to computed values that recently depended on its argument and haven't been garbage collected yet.
The remainder of the implementation is building the weak references to notifications and subscribing to a published value by prepending the new subscriber.
makeMarkDirty :: Published (Either (IORefComputed a) a) -> k -> IORefComputed a -> IO (Weak (IO ()))
makeMarkDirty published key definition =
do
let markDirty = do
existing <- readIORef (valueRef published)
case existing of
Right _ -> setIORefPublished published $ Left definition
_ -> return ()
mkWeak key markDirty Nothing
subscribeTo :: IORefComputed a -> Weak (IO ()) -> IO ()
subscribeTo (Tracked published) trigger = modifyIORef' (subscribers published) (trigger :)
subscribeTo (Shared published) trigger = modifyIORef' (subscribers published) (trigger :)
subscribeTo _ _ = return ()
The complete compiling example code is on github. It requires the transformers package.
If you run the example code, you'll notice that:
After the company name is changed, runComputed people performs only a single get to get the cached value
After no changes are made, all three structures each perform only a single get to get the entire cached structure
After bob is changed, , runComputed structure2 performs only a single get to get the cached value, and the work done to compute structure3 is less, even though bob is in structure3.
structure2, the one built with the Monad instance, requires the most work to compute, due to the extra intermediary shared values.
If you need it done efficiently, you'll have to downgrade to mutable datastructures and basically the IO monad.
These updateable references between objects like in OO are available in Haskell aswell. These are IORefs. There are also thread-safe versions of them: MVar and TVar - the choice between them depends on your concurrerency model.
This data structure with different kinds of references between objects is called Graph and it happens so that I'm currently working on a Haskell Graph Database project. The project is getting close to its first release. An in-memory datastructure is already implemented, persistence-layer too, all that's left is client and server. So just keep an eye on it. I'll reddit about it on release. The source repository is here: https://github.com/nikita-volkov/graph-db/, though I haven't been pushing updates for some time so it's a bit outdated.
Haskell tries to encourage you to think about values, not entities. By this, I mean that pure code will in most cases structure things by transforming values from one kind to another, not modifying or updating data shared by many others. Equality/identity of objects is defined entirely by their content, not their location. But let me be more concrete.
The general solution to "pure mutation" is to create an endomorphism. In your case, if you had a Directory of people you could read a person's data with a function with the signature
type Name = String
get :: Name -> Directory -> Person
and modify it with a function
mod :: Name -> (Person -> Person) -> (Directory -> Directory)
If you have a lot of modification functions f, g, h, i then you can string them together
mod i . mod h . mod g . mod f
But what's important to realize is that every Directory created in that chain can potentially exist on its own and be updated/read/modified. That's the nature of immutability---data is persistent and we have to manually push our data "through time" as we modify it.
So how do you propagate changes to other structures? In short... you can't. If you're trying to, you're modeling things in ways that are very hard to do purely.
Haskell asks you What do you mean by "propagate"? These objects are based on data in the past and we cannot mutate that fact.
There are definitely limitations to pure, immutable data. Some algorithms cannot translate and are often implemented by recreating "pointer arithmetic" atop a unique name generator and a finite Map. If this is your case, it's better to start introducing impure effects via the ST or IO monads where you can get true memory mutation out of the STRef and IORef container types.
The "update all the index structures" approach doesn't have to be needless ceremony, if you model your concept of a "collection of people with efficient lookup operations" as a unitary thing in itself, rather than a bunch of independent collections that you're "manually" trying to keep in sync with each other.
Say you've got a Person type. Then you have a collection of Person objects that you want to be indexed by the types Name and Zip. You could use things like Map Name Person and Map Zip Person, but that doesn't really express your meaning. You don't have two groups of people, one keyed by Name and the other keyed by Zip. You have one group of people, which can by looked up by either Name or Zip, so the code you write and data structures you use should reflect that.
Lets call the collection type People. For your index lookup you'll end up with something like findByName :: People -> Name -> Person and findByZip :: People -> Zip -> Person.
You've also got functions of type Person -> Person that can "update" Person records. So you can use findByName to pull out a Person from a People, then apply an update function to get a new Person. Now what? You'll have to construct a new People with the original Person replaced with a new Person. The "update" functions can't handle this, since they're only concerned with Person values, and don't know anything about your People store (there could even be many People stores). So you'll need a function like updatePeople :: Person -> Person -> People -> People, and you'll end up writing a lot of code like this:
let p = findByName name people
p' = update p
in updatePeople p p' people
That's a bit boilerplatey. Looks like a job for updateByName :: Name -> (Person -> Person) -> People -> People.
With that, where in an OO language you might write something like people.findByName(name).changeSomething(args) you can now write updateByName name (changeSomething args) people. Not so different!
Note that I haven't talked at all about how any of these data structures or operations are actually implemented. I'm thinking purely about the concepts you have and the operations that make sense on them. That means a scheme like this will work regardless of how you're implementing them; you even can (probably should?) hide the implementation details behind a module barrier. You may well implement People as a record of multiple collections mapping different things to your Person records, but you from the "outside" you can just think of it it a single collection that supports multiple different types of lookup/update operations, and don't have to worry about keeping multiple indexes in sync. It's only within the implementation of the People type and its operations that you have to worry about that, which gives you a place to solve it once and well, rather than having to do it correctly on every operation.
You can take this sort of thing further. With some extra assumptions (such as the knowledge that your Name, Zip, and any other indexes are all implemented with the same pattern just on different fields of Person/People) you can probably use type classes and/or template Haskell to avoid having to implement findByName, findByZip, findByFavouriteSpoon etc separately (although having separate implementations gives you more opportunity to use different indexing strategies depending on the types involved, and may help with optimizing the updates so that e.g. you only have to update the indexes that could possibly be invalidated). You can use type classes and type families to implement a findBy that uses the type of whatever index key it is invoked on to determine which index to use, whether you have separate implementations or a single generic one (although this means that you can't have multiple indexes with the same type).
Here's an example I knocked up when I should've been working, providing type-class-based findBy and updateBy operations:
{-# LANGUAGE FlexibleContexts, MultiParamTypeClasses, TypeFamilies #-}
import Data.Map (Map, (!), adjust, delete, insert)
-- sample data declarations
newtype Name = Name String
deriving (Eq, Ord, Show)
newtype Zip = Zip Int
deriving (Eq, Ord, Show)
data Person = Person
{ name :: Name
, zipCode :: Zip
}
-- you probably wouldn't export the constructor here
data People = People
{ byName :: Map Name Person
, byZip :: Map Zip Person
}
-- class for stores that can be indexed by key
class FindBy key store where
type Result key store
findBy :: key -> store -> Result key store
updateBy :: key -> (Result key store -> Result key store) -> store -> store
-- helper functions
-- this stuff would be hidden
updateIndex
:: Ord a
=> (Person -> a) -> Person -> Person -> Map a Person -> Map a Person
updateIndex f p p' = insert (f p') p' . delete (f p)
-- this function has some per-index stuff;
-- note that if you add a new index to People you get a compile error here
-- telling you to account for it
-- also note that we put the *same* person in every map; sharing should mean
-- that we're not duplicating the objects, so no wasted memory
replacePerson :: Person -> Person -> People -> People
replacePerson p p' ps = ps { byName = byName', byZip = byZip' }
where
byName' = updateIndex name p p' $ byName ps
byZip' = updateIndex zipCode p p' $ byZip ps
-- a "default" definition for updateBy in terms of findBy when the store happens
-- to be People and the result happens to be Person
updatePeopleBy
:: (FindBy key People, Result key People ~ Person)
=> key -> (Person -> Person) -> People -> People
updatePeopleBy k f ps =
let p = findBy k ps
in replacePerson p (f p) ps
-- this is basically the "declaration" of all the indexes that can be used
-- externally
instance FindBy Name People where
type Result Name People = Person
findBy n ps = byName ps ! n
updateBy = updatePeopleBy
instance FindBy Zip People where
type Result Zip People = Person
findBy z ps = byZip ps ! z
updateBy = updatePeopleBy
Jarret, I strongly suggest you investigate Zippers, both in the simple form documented on the Haskell wiki and the more advanced, generic version developed by Oleg Kiselyov. To quote Oleg,
Zipper is an updateable and yet pure functional cursor into a data structure. It lets us replace an item deep in a data structure, e.g., a tree or a term, without any mutation. The result will share as much of its components with the old structure as possible. The old data structure is still available, which is useful if we wish to 'undo' the operation later on.
The wiki page gives a simple example of how one node of tree can be updated without any need to rebuild the rest of the tree.
If you wrap your different views in zippers and use a shared key, you should see significant efficiency gains. If you wrapped your different views in an appropriate monad (e.g. State Monad), you could update the location with one operation and see all the different views move to point to the "same" obect.
Related
I'm getting started with Haskell (from many years of C and c++) and have decided to attempt a small database project. I'm using a predefined binder library to a C database library (Database.kyotocabint). I'm struggling to get my head round how to do anything with the iterator interfaces due to the separation of effects when using a pre-defined method.
The toy demo to iterate over the data base and print it out (which works fine) is
test7 = do
db <- openTree "testdatabase/mydb.kct" defaultLoggingOptions (Writer [] [])
let visitor = \k v -> putStr (show k) >> putStr ":" >> putStrLn (show v) >>
return (Left NoOperation)
iterate db visitor False
close db
Where iterate and visitor are provided by the library bindings and the relevant types are
iterate :: forall db. WithDB db => db -> VisitorFull -> Writable -> IO ()
visitor :: ByteString -> ByteString -> IO (Either VisitorAction b)
But I can't see to how extract information out from inside the iterator rather than process each one individually - for example collect all the keys beginning with 'a' in a list or even just count the number of entries.
Am I limited because iterate just has the type IO () and so I can't build in side effects and would have to rebuild this replacing the library versions? The state monad on paper seems to adress this but the visitor type doesn't seem to allow me to maintain the state over subsequent visitor calls.
What would be the Haskell way to solve this ?
Matthew
Edit - many thanks for the clear answer below which siad both 0 its not the Haskell way but also provided a solution - this answer led me to Mutable objects which I found a clear explanation of the options.
The kyotocabinet library unfortunately does not seem to support your operation. Beyond iterate, it should expose some similar operation which returns something more complex than IO (), say IO a or IO [a] while requiring a more complex visitor function.
Still, since we work inside IO, there is a workaround: we can exploit IORefs and collect results. I want to stress, though, that this is not idiomatic code one would write in Haskell, but something one is forced to use because if the limitation of this library.
Anyway, the code would look something like this (untested):
test7 = do
db <- openTree "testdatabase/mydb.kct" defaultLoggingOptions (Writer [] [])
w <- newIORef [] -- create mutable var, initialize to []
let visitor = \k v -> do
putStrLn (show k ++ ":" ++ show v)
modifyIORef w ((k,v):) -- prepend (k,v) to the list w
return (Left NoOperation)
iterate db visitor False
result <- readIORef w -- get the whole list
print result
close db
Since you come from C++, you might want to compare the code above to the following pseudo-C++:
std::vector<std::pair<int,int>> w;
db.iterate([&](int k, int v) {
std::cout << k << ", " << v << "\n";
w.push_back({k,v});
});
// here we can read w, even if db.iterate returns void
Again, this is not something I would consider idiomatic Haskell.
I have this simple project where I tried to experiment with Haskell and Cairo library.
I ran into a problem with redrawing the state of the application. At the moment I have the record that holds global state. I had to store it in Data.IORef monad in order to be able to pass the data from one GUI event to another.
I was inspired by Elm to try it like that, but I wonder if there is better way to do it?
I'm particularly concerned about this fragment in main function.
-- update globalModel in place
readIORef globalModel >>= (\m -> writeIORef globalModel (Model
(fromIntegral kv)
(if ((keyToHeading (fromIntegral kv)) == None)
then (heading m)
else (keyToHeading (fromIntegral kv)))))
Please advice of better approach.
This might be a simpler refactoring. Note that modifyIORef' combines readIORef and writeIORef into a single strict operation:
ifNoneThen :: Heading -> Heading -> Heading
None `ifNoneThen` x = x
h `ifNoneThen` _ = h
updateModel :: Int -> Model -> Model
updateModel kv oldModel = Model newKv newHeading
where newKv = fromIntegral kv
newHeading = keyToHeading newKv `ifNoneThen` heading oldModel
foo :: Int -> Model -> IO ()
foo kv globalModel = modifyIORef' globalModel (updateModel kv)
A small utility function for Headers, ifNoneThen (that you might be able to use elsewhere as well) cleans up the if statement in the body of the code.
Some minor refactoring to remove duplicated calls (why call fromIntegral kv 3 times if you can call it once and give the result a meaningful name?).
Factoring the pure code (construction of the new model from an old model) out of the impure code (updating the IORef) makes it easier to reason about the code - there's fewer places for unexpected side-effects to crop up.
Starting out with Haskell and Yesod, probably getting a bit too far with Yesod relative to Haskell :)
I build entities using Persistent via
share [mkPersist sqlSettings, mkMigrate "migrateAll"][persistLowerCase|
Game
title String
company String
UniqueTitle title
deriving Show
Tag
label String
description String Maybe
UniqueLabel label
deriving Show
GameTag
gameId GameId
tagId TagId
UniqueGameTag gameId tagId
|]
-- Yesod related code ...
In main I have
main :: IO ()
main = do
let taggings = fromFile :: [(Game, Tag)] -- fromFile code not included
runStderrLoggingT $ withSqlitePool ":inmemory:" 10 $ λpool → liftIO $ do
runResourceT $ flip runSqlPool pool $ do
runMigration migrateAll
let (g, t) = head taggings
gid ← insert g
tid ← insert t
insert (GameTag gid tid)
warp 3000 $ App pool
Doing this I get the first relation into the database, and by selecting elements from the list I can add more 'by hand', but I can't figure out how to get all the relations into the database by somehow iterating over taggings. How do i define a function that I can map over taggings ::[(Game, Tag)] and inserts the game tags of the type GameTag constructed
by Persistent?
The main trick here isn't in pulling out the function, that is easy:
f (g, t) = do
gid <- insert g
tid <- insert t
insert (GameTag gid tid)
The trick is knowing how to use this.... Standard map won't work alone, because the function is defined in a monad (you can use it, it will just give you a list of actions back without running them).
map f taggings -- returns just a list, type [ResourceT IO a], doesn't run anything
Here are two ways to actually run the actions from within main.
sequence (map f taggings) --sequentially runs the actions in the list
or, the more readable
forM taggings f
or, in the slightly more verbose
forM taggings $ \tagging -> do
f tagging
You also might want to look at mapM. Also also should learn about forM_ and sequence_ to supress the (often useless) return values.
I built a small game engine to manage a board of squares (currently used for playing a Conway's game of life). All the data is accessed throught lenses from fclabels and State. The engine couples user input and graphic rendering (usual game loop).
The computations between frames can sometimes be slow and long to execute. So I would like to use concurrency to manage the squares, using STM's TVar.
My data is currently represented like this:
data World = World {
… -- window configuration, not important
, _squares :: TVar [Square]
}
mkLabels [''World] -- creates labels, similar to mkLenses
My functions run in the Game Monad, which is defined as follow:
type Game a = StateT World IO a
Using the monadic versions of labels. I use Getters & Setters inside my monad.
I would like to know if there is a way to somehow write new labels that behave like these:
gets :: MonadState f m => Lens (->) f o -> m o
…
puts :: MonadState f m => Lens (->) f o -> o -> m ()
But that takes care of STM (gets would involve readTVar, puts would involve writeTvar, etc.).
If I understand you correctly, you want to define a lens tlens s.t.:
gets tlens
is the same as:
do tvar <- gets squares
sqs <- liftIO $ atomically $ readTVar tvar
return sqs
and where puts tlens sqs is the same as:
do tvar <- gets squares
liftIO $ atomically $ writeTVar tvar sqs
I think this can be answered by looking at the type of gets:
gets :: MonadState f m => Lens (->) f o -> m o
The lens parameter is pure and not monadic. To get at the contents of the TVar you'll need to run code in the IO-monad.
Moreover, the definition of gets in Data.Label.Monadic is (link) is:
gets lens = State.gets (Total.get lens)
where State is Control.Monad.State and Total is Data.Label.Total.
But State.gets takes a pure function, so again you're not going to be able to create a lens which will work with gets.
I want to compose operations that may fail, but there is a way to roll back.
For example - an external call to book a hotel room, and an external call to charge a credit card. Both of those calls may fail such as no rooms left, invalid credit card. Both have ways to roll back - cancel hotel room, cancel credit charge.
Is there a name for this type of (not real) atomic. Whenever i search for haskell transaction, I get STM.
Is there an abstraction, a way to compose them, or a library in haskell or any other language?
I feel you could write a monad Atomic T which will track these operations and roll them back if there is an exception.
Edit:
These operations may be IO operations. If the operations were only memory operations, as the two answers suggest, STM would suffice.
For example booking hotels would via HTTP requests. Database operations such as inserting records via socket communication.
In the real world, for irreversible operations there is a grace period before the operation will be done - e.g. credit cards payments and hotel bookings may be settled at the end of the day, and therefore it is fine to cancel before then.
This is exactly the purpose of STM. Actions are composed so that they succeed or fail together, automatically.
Very similar to your hotel room problem is the bank transaction example in Simon Peyton-Jones's chapter in "Beautiful Code": http://research.microsoft.com/en-us/um/people/simonpj/papers/stm/beautiful.pdf
If you need to resort to making your own monad, it will look something like this:
import Control.Exception (onException, throwIO)
newtype Rollbackable a = Rollbackable (IO (IO (), a))
runRollbackable :: Rollbackable a -> IO a
runRollbackable (Rollbackable m) = fmap snd m
-- you might want this to catch exceptions and return IO (Either SomeException a) instead
instance Monad Rollbackable where
return x = Rollbackable $ return (return (), x)
Rollbackable m >>= f
= do (rollback, x) <- m
Rollbackable (f x `onException` rollback)
(You will probably want Functor and Applicative instances also, but they're trivial.)
You would define your rollbackable primitive actions in this way:
rollbackableChargeCreditCard :: CardNumber -> CurrencyAmount -> Rollbackable CCTransactionRef
rollbackableChargeCreditCard ccno amount = Rollbackable
$ do ref <- ioChargeCreditCard ccno amount
return (ioUnchargeCreditCard ref, ref)
ioChargeCreditCard :: CardNumber -> CurrencyAmount -> IO CCTransactionRef
-- use throwIO on failure
ioUnchargeCreditCard :: CCTransactionRef -> IO ()
-- these both just do ordinary i/o
Then run them like so:
runRollbackable
$ do price <- rollbackableReserveRoom roomRequirements when
paymentRef <- rollbackableChargeCreditCard ccno price
-- etc
If your computations could be done only with TVar like things then STM is perfect.
If you need a side effect (like "charge Bob $100") and if there is a error later issue a retraction (like "refund Bob $100") then you need, drumroll please: Control.Exceptions.bracketOnError
bracketOnError
:: IO a -- ^ computation to run first (\"acquire resource\")
-> (a -> IO b) -- ^ computation to run last (\"release resource\")
-> (a -> IO c) -- ^ computation to run in-between
-> IO c -- returns the value from the in-between computation
Like Control.Exception.bracket, but only performs the final action if there was an
exception raised by the in-between computation.
Thus I could imagine using this like:
let safe'charge'Bob = bracketOnError (charge'Bob) (\a -> refund'Bob)
safe'charge'Bob $ \a -> do
rest'of'transaction
which'may'throw'error
Make sure you understand where to use the Control.Exception.mask operation if you are in a multi-threaded program and try things like this.
And I should emphasize that you can and should read the Source Code to Control.Exception and Control.Exception.Base to see how this is done in GHC.
You really can do this with the clever application of STM. The key is to separate out the IO parts. I assume the trouble is that a transaction might appear to succeed initially, and fail only later on. (If you can recognize failure right away, or soon after, things are simpler):
main = do
r <- reserveHotel
c <- chargeCreditCard
let room = newTVar r
card = newTVar c
transFailure = newEmptyTMVar
rollback <- forkIO $ do
a <- atomically $ takeTMVar transFailure --blocks until we put something here
case a of
Left "No Room" -> allFullRollback
Right "Card declined" -> badCardRollback
failure <- listenForFailure -- A hypothetical IO action that blocks, waiting for
-- a failure message or an "all clear"
case failures of
"No Room" -> atomically $ putTMVar (Left "No Room")
"Card Declined" -> atomically $ putTMVar (Right "Card declined")
_ -> return ()
Now, there's nothing here that MVars couldn't handle: all we're doing is forking a thread to wait and see if we need to fix things. But you'll presumably be doing some other stuff with your card charges and hotel reservations...