Is there an inplace map function for mutable vectors? - haskell

If I have a mutable vector (with type IOVector a for example), is there a map-like function that can modify the elements in place?
The vector package provides the modify function but this is only one element at a time. Should I use this or is there a preferred method?
And to clarify, the type of the vector will be the same before and after.

Yep, use modify if you want to modify elements in place. If you find yourself often modifying everything in place, you can define mapModify as follows.
import Data.Foldable (for_)
import qualified Data.Vector.Mutable as MV
mapModify :: (a -> a) -> IOVector a -> IO ()
mapModify f v = for_ [0 .. MV.length v - 1)] (MV.modify v f)
That said, constantly modifying all elements in a vector sounds like you may be better off using immutable vectors and mapping over them using the regular fmap. If that code ends up fusing properly, the intermediate vectors will never even be materialized.

Related

How to define an unordered collection in Haskell

Wondering how you would define an unordered group/collection in Haskell, where by "collection" I mean it can have many copies of the same element, and the items are unordered. I know of the List data type in Haskell, but this is inherently ordered. I would like to see what the definition would look like for an unordered collection/group/list.
I would define it this way
import qualified Data.Map.Lazy as Map
type MultiSet' a = Map.Map a Int
Just a mapping from a type a to an Int. In mathematics it would be something like f : S -> N. The elements you put into it must be ordable, that is because the underlying structure of the Map is defined by a binary tree. This shouldn't be a problem as you can forget about it when using the data structure. See the very extensive documentation of Data.Map for functions to deal with our MultiSet'.
Now there is already a definition together with implementation for this and it is called MultiSet. You can browse to its source code as well, there you see they defined it in an almost an identical way (they used the strict version of the map).
Alternatively you can use a hashmap, it will look like this:
import qualified Data.HashMap.Lazy as Map
type MultiSet'' a = Map.HashMap a Int
The elements you put into it do not need to be ordable, but hashable.
If you just want a structure that has no reasonable order then why not compose a Map with a hash?
type MyBag a = Map (Int,a) Int
insert x mp = Data.Map.insertWith (+) 1 (hash x, x) mp
The above is a balanced binary tree with an order that depends on the hash of the value you have inserted. The map itself is boring, along the lines of data Map k a = Bin (Map k a) a (Map k a) | Nil.
This said, I think you underspecified what you are looking for and what you hope to learn. Your searches have probably yielded hashtables and unordered-containers - why aren't those sufficiently informative?

Sharing observer function code between mutable and frozen versions of a type

I was working on creating my own custom mutable/frozen data type that internally contains an MVector/Vector. It needs to be mutable for performance reasons so switching to an immutable data structure is not something I am considering.
It seems like implementing an observer function for one of the two versions should allow me to just steal that implementation for the other type. Here are the two options I am considering:
render :: Show a => MCustom s a -> ST s String
render mc = ...non trivial implementation...
show :: Show a => Custom a -> a
show c = runST $ render =<< unsafeThaw c
Where unsafeThaw calls Vector.unsafeThaw under the covers, which should be safe as that thawed vector is never mutated, only read. This approach feels the cleanest, the only downside is that render is strict, which forces show to be strict whereas a duplicate implementation could correctly stream the String without forcing it all at once.
The other option, which feels much more dirty but that I think is safe is to do this:
show :: Show a => Custom a -> a
show c = ...non trivial implementation that allows lazy streaming...
render :: Show a => MCustom s a -> ST s String
render mc = do
s <- show <$> unsafeFreeze mc
s `deepseq` pure s
Are either of these my best option? If not what else should I do?
To me it seemed most intuitive to build one version off of the other. But it seems like if I make the mutable version the base version then I will end up with a lot more strictness then I want, even if the implementations seem fairly clean and logical, just because ST necessitates strictness unless I throw in some unsafeInterleaveST calls, but these would only be safe when the mutable observer was called via an immutable object.
On the other hand if I make the immutable version the base version then I will end up with more dirty, deepseq code, and sometimes I would just have to reimplement things. For example all in place editing functions can be done on a frozen object pretty easily by just copying the frozen object and then calling unsafeThaw on it and modifying the copy in place before calling unsafeFreeze and returning it. But doing the opposite isn't really doable, as a copy modification that is used for the immutable version cannot be converted to an in place modification.
Should I perhaps write all modification functions alongside the mutable implementation, and all observer functions alongside the immutable implementation. And then have a file that depends on both that unifies everything via unsafeThaw and unsafeFreeze?
How about having a pure function
show :: (StringLike s, Show a) => Custom a -> s
You can get both lazy and strict output with different instantiations of s, in which cons is either lazy or strict; e.g. String or Text:
class StringLike s where
cons :: Char -> s -> s
nil :: s
uncons :: s -> Maybe (Char, s)
instance StringLike String where ...
instance StringLike Text where ...
You could use other methods, e.g. phantom types, or simply having two functions (showString and showText), to distinguish between lazy and strict output if you like. But if you look at types as specifications of a function's semantics, then the place to indicate laziness or strictness is in the return type of that operation. This removes the need for some sort of strict show for Custom inside of ST.
For the MCustom version, you probably do not export the String version, e.g:
render :: MCustom s a -> ST s Text
render a = show <$> unsafeFreeze a
You can throw in a seq to force the result when the function runs but the entire Text would be forced when any character is used anyways.
But the simplest solution seems to just abstract the pattern of using a mutable structure in an immutable fashion, e.g.
atomically :: (NFData a) => (Custom x -> a) -> MCustom s x -> ST s a
atomically f v = do
r <- f <$> unsafeFreeze v
r `deepseq` pure r
This saves you from using unsafeFreeze/deepseq all over your code, just as you have modify to do immutable operations on mutable vectors.

Writing fusible O(1) update for vector

It is continuation of this question. Since vector library doesn't seem to have a fusible O(1) update function, I am wondering if it is possible to write a fusible O(1) update function that doesn't involve unsafeFreeze and unsafeThaw. It would use vector stream representation, I guess - I am not familiar with how to write one using stream and unstream - hence, this question. The reason is this will give us the ability to write a cache-friendly update function on vector where only a narrow region of vector is being modified, and so, we don't want to walk through entire vector just to process that narrow region (and this operation can happen billions of times in each function call - so, the motivation to keep the overhead really low). The transformation functions like map process entire vector - so they will be too slow.
I have a toy example of what I want to do, but the upd function below uses unsafeThaw and unsafeFreeze - it doesn't seem to be optimized away in the core, and also breaks the promise of not using the buffer further:
module Main where
import Data.Vector.Unboxed as U
import Data.Vector.Unboxed.Mutable as MU
import Control.Monad.ST
upd :: Vector Int -> Int -> Int -> Vector Int
upd v i x = runST $ do
v' <- U.unsafeThaw v
MU.write v' i x
U.unsafeFreeze v'
sum :: Vector Int -> Int
sum = U.sum . (\x -> upd x 0 73) . (\x -> upd x 1 61)
main = print $ Main.sum $ U.fromList [1..3]
I know how to implement imperative algorithms using STVector. In case you are wondering why this alternative approach, I want to try out this approach of using pure vectors to check how GHC transformation of a particular algorithm differs when written using fusible pure vector streams (with monadic operations under the hood of course).
When the algorithm is written using STVector, it doesn't seem to be as nicely iterative as I would like it to be (I guess it is harder for GHC optimizer to spot loops when there is lot of mutability strewn around). So, I am investigating this alternative approach to see I can get a nicer loop in there.
The upd function you have written does not look correct, let alone fusable. Fusion is a library level optimization and requires you to write your code out of certain primatives. In this case what you want is not just fusion, but recycling which can be easily achieved via the bulk update operations such as // and update. These operations will fuse, and even happen in place much of the time.
If you really want to write your own destructive update based code DO NOT use unsafeThaw--use modify
Any function is a fusible update function; you seem to be trying to escape from the programming model the vector package is trying to get you to use
module Main where
import Data.Vector.Unboxed as U
change :: Int -> Int -> Int
change 0 n = 73
change 1 n = 61
change m n = n
myfun2 = U.sum . U.imap change . U.enumFromStepN 1 1
main = print $ myfun2 30000000
-- this doesn't create any vectors much less 'update' them, as you will see if you study the core.

Set operators not provided by Data.Vector of Haskell, what is the reason

My application involves heavy array operations (e.g. log(1) indexing), thus Data.Vector and Data.Vector.Unboxed are preferred to Data.List.
It also involves many set operations (e.g. intersectBy), which however, are not provided by the Data.Vector.
Each of these functions can be implemented like in Data.List in 3-4 lines .
Is there any reason they all not implemented with Data.Vector? I can only speculate. Maybe set operations in Data.Vector is discouraged for performance reasons, i.e. intersectBy would first produce the intersection through list comprehension and then convert the list into a Data.Vector?
I assume it's missing because intersection of unsorted, immutable arrays must have a worst-case run time of Ω(n*m) without using additional space and Data.Vector is optimized for performance. If you want, you can write that function yourself, though:
import Data.Vector as V
intersect :: Eq a => V.Vector a -> V.Vector a -> V.Vector a
intersect x = V.filter (`V.elem` x)
Or by using a temporary set data structure to achieve an expected O(n + m) complexity:
import Data.HashSet as HS
intersect :: (Hashable a, Eq a) => V.Vector a -> V.Vector a -> V.Vector a
intersect x = V.filter (`HS.member` set)
where set = HS.fromList $ V.toList x
If you can afford the extra memory usage, maybe you can use some kind of aggregate type for your data, for example an array for fast random access and a hash trie like Data.HashSet for fast membership checks and always keep both containers up to date. That way you can reduce the asymptotic complexity for intersection to something like O(min(n, m))

Short-lived memoization in Haskell?

In an object-oriented language when I need to cache/memoize the results of a function for a known life-time I'll generally follow this pattern:
Create a new class
Add to the class a data member and a method for each function result I want to cache
Implement the method to first check to see if the result has been stored in the data member. If so, return that value; else call the function (with the appropriate arguments) and store the returned result in the data member.
Objects of this class will be initialized with values that are needed for the various function calls.
This object-based approach is very similar to the function-based memoization pattern described here: http://www.bardiak.com/2012/01/javascript-memoization-pattern.html
The main benefit of this approach is that the results are kept around only for the life time of the cache object. A common use case is in the processing of a list of work items. For each work item one creates the cache object for that item, processes the work item with that cache object then discards the work item and cache object before proceeding to the next work item.
What are good ways to implement short-lived memoization in Haskell? And does the answer depend on if the functions to be cached are pure or involve IO?
Just to reiterate - it would be nice to see solutions for functions which involve IO.
Let's use Luke Palmer's memoization library: Data.MemoCombinators
import qualified Data.MemoCombinators as Memo
import Data.Function (fix) -- we'll need this too
I'm going to define things slightly different from how his library does, but it's basically the same (and furthermore, compatible). A "memoizable" thing takes itself as input, and produces the "real" thing.
type Memoizable a = a -> a
A "memoizer" takes a function and produces the memoized version of it.
type Memoizer a b = (a -> b) -> a -> b
Let's write a little function to put these two things together. Given a Memoizable function and a Memoizer, we want the resultant memoized function.
runMemo :: Memoizer a b -> Memoizable (a -> b) -> a -> b
runMemo memo f = fix (f . memo)
This is a little magic using the fixpoint combinator (fix). Never mind that; you can google it if you are interested.
So let's write a Memoizable version of the classic fib example:
fib :: Memoizable (Integer -> Integer)
fib self = go
where go 0 = 1
go 1 = 1
go n = self (n-1) + self (n-2)
Using a self convention makes the code straightforward. Remember, self is what we expect to be the memoized version of this very function, so recursive calls should be on self. Now fire up ghci.
ghci> let fib' = runMemo Memo.integral fib
ghci> fib' 10000
WALL OF NUMBERS CRANKED OUT RIDICULOUSLY FAST
Now, the cool thing about runMemo is you can create more than one freshly memoized version of the same function, and they will not share memory banks. That means that I can write a function that locally creates and uses fib', but then as soon as fib' falls out of scope (or earlier, depending on the intelligence of the compiler), it can be garbage collected. It doesn't have to be memoized at the top level. This may or may not play nicely with memoization techniques that rely on unsafePerformIO. Data.MemoCombinators uses a pure, lazy Trie, which fits perfectly with runMemo. Rather than creating an object which essentially becomes a memoization manager, you can simply create memoized functions on demand. The catch is that if your function is recursive, it must be written as Memoizable. The good news is you can plug in any Memoizer that you wish. You could even use:
noMemo :: Memoizer a b
noMemo f = f
ghci> let fib' = runMemo noMemo fib
ghci> fib' 30 -- wait a while; it's computing stupidly
1346269
Lazy-Haskell programming is, in a way, the memoization paradigm taken to a extreme. Also, whatever you do in an imperative language is possible in Haskell, using either IO monad, the ST monad, monad transformers, arrows, or you name what.
The only problem is that these abstraction devices are much more complicated than the imperative equivalent that you mentioned, and they need a pretty deep mind-rewiring.
I believe the above answers are both more complex than necessary, although they might be more portable than what I'm about to describe.
As I understand it, there is a rule in ghc that each value is computed exactly once when it's enclosing lambda expression is entered. You may thus create exactly your short lived memoization object as follows.
import qualified Data.Vector as V
indexerVector :: (t -> Int) -> V.Vector t -> Int -> [t]
indexerVector idx vec = \e -> tbl ! e
where m = maximum $ map idx $ V.toList vec
tbl = V.accumulate (flip (:)) (V.replicate m [])
(V.map (\v -> (idx v, v)) vec)
What does this do? It groups all the elements in the Data.Vector t passed as it's second argument vec according to integer computed by it's first argument idx, retaining their grouping as a Data.Vector [t]. It returns a function of type Int -> [t] which looks up this grouping by this pre-computed index value.
Our compiler ghc has promised that tbl shall only be thunked once when we invoke indexerVector. We may therefore assign the lambda expression \e -> tbl ! e returned by indexVector to another value, which we may use repeatedly without fear that tbl ever gets recomputed. You may verify this by inserting a trace on tbl.
In short, your caching object is exactly this lambda expression.
I've found that almost anything you can accomplish with a short term object can be better accomplished by returning a lambda expression like this.
You can use very same pattern in haskell too. Lazy evaluation will take care of checking whether value is evaluated already. It has been mentioned mupltiple times already but code example could be useful. In example below memoedValue will calculated only once when it is demanded.
data Memoed = Memoed
{ value :: Int
, memoedValue :: Int
}
memo :: Int -> Memoed
memo i = Memoed
{ value = i
, memoedValue = expensiveComputation i
}
Even better you can memoize values which depend on other memoized values. You shoud avoid dependecy loops. They can lead to nontermination
data Memoed = Memoed
{ value :: Int
, memoedValue1 :: Int
, memoedValue2 :: Int
}
memo :: Int -> Memoed
memo i = r
where
r = Memoed
{ value = i
, memoedValue1 = expensiveComputation i
, memoedValue2 = anotherComputation (memoedValue1 r)
}

Resources