Sum over Haskell Map - haskell

Is there a standard function to sum all values in a Haskell map. My Map reads something like [(a,2),(b,4),(c,6)]?
Essentially what I am trying to do is a normalized frequency distribution. So the values of the keys in the above map are counts for a,b,c. I need to normalize them as [(a,1/6),(b,1/3),(c,1/2)]

You can simply do Map.foldl' (+) 0 (or M.foldl', if you imported Data.Map as M).
This is just like foldl' (+) 0 . Map.elems, but slightly more efficient. (Don't forget the apostrophe — using foldl or foldr to do sums with the standard numeric types (Int, Integer, Float, Double, etc.) will build up huge thunks, which will use up lots of memory and possibly cause your program to overflow the stack.)
However, only sufficiently recent versions of containers (>= 0.4.2.0) contain Data.Map.foldl', and you shouldn't upgrade it with cabal install, since it comes with GHC. So unless you're on GHC 7.2 or above, foldl' (+) 0 . Map.elems is the best way to accomplish this.
You could also use Data.Foldable.sum, which works on any instance of the Foldable typeclass, but will still build up large thunks on the common numeric types.
Here's a complete example:
normalize :: (Fractional a) => Map k a -> Map k a
normalize m = Map.map (/ total) m
where total = foldl' (+) 0 $ Map.elems m
You'll need to import Data.List to use foldl'.

let
total = foldr (\(_, n) r -> r + n) 0 l
in map (\(x, y) -> (x, y/total) l
Where l is your map.

Simple:
import qualified Data.Map as M
sumMap = M.foldl' (+) 0
normalizeMap m =
let s = sumMap m in
M.map (/ s) m
main = do
let m = M.fromList [("foo", 1), ("bar", 2), ("baz", 6)]
(print . sumMap) m
(print . normalizeMap) m
prints:
9.0
fromList [("bar",0.2222222222222222),("baz",0.6666666666666666),("foo",0.1111111111111111)]

Related

Haskell nested function order

I'm trying to write a function in Haskell to generate multidimensional lists.
(Technically I'm using Curry, but my understanding is that it's mostly a superset of Haskell, and the thing I'm trying to do is common to Haskell as well.)
After a fair bit of head scratching, I realized my initial desired function (m_array generating_function list_of_dimensions, giving a list nested to a depth equal to length list_of_dimensions) was probably at odds with they type system itself, since (AFAICT) the nesting-depth of lists is part of its type, and my function wanted to return values whose nesting-depths differed based on the value of a parameter, meaning it wanted to return values whose types varied based on the value of a parameter, which (AFAICT) isn't supported in Haskell. (If I'm wrong, and this CAN be done, please tell me.) At this point I moved on to the next paragraph, but if there's a workaround I've missed that takes very similar parameters and still outputs a nested list, let me know. Like, maybe if you can encode the indices as some data type that implicitly includes the nesting level in its type, and is instantiated with e.g. dimensions 5 2 6 ..., maybe that'd work? Not sure.
In any case, I thought that perhaps I could encode the nesting-depth by nesting the function itself, while still keeping the parameters manageable. This did work, and I ended up with the following:
ma f (l:ls) idx = [f ls (idx++[i]) | i <- [0..(l-1)]]
However, so far it's still a little clunky to use: you need to nest the calls, like
ma (ma (ma (\_ i -> 0))) [2,2,2] []
(which, btw, gives [[[0,0],[0,0]],[[0,0],[0,0]]]. If you use (\_ i -> i), it fills the array with the indices of the corresponding element, which is a result I'd like to keep available, but could be a confusing example.)
I'd prefer to minimize the boilerplate necessary. If I can't just call
ma (\_ i -> i) [2,2,2]
I'd LIKE to be able to call, at worst,
ma ma ma (\_ i -> i) [2,2,2] []
But if I try that, I get errors. Presumably the list of parameters is being divvied up in a way that doesn't make sense for the function. I've spent about half an hour googling and experimenting, trying to figure out Haskell's mechanism for parsing strings of functions like that, but I haven't found a clear explanation, and understanding eludes me. So, the formal questions:
How does Haskell parse e.g. f1 f2 f3 x y z? How are the arguments assigned? Is it dependent on the signatures of the functions, or does it e.g. just try to call f1 with 5 arguments?
Is there a way of restructuring ma to permit calling it without parentheses? (Adding at most two helper functions would be permissible, e.g. maStart ma ma maStop (\_ i -> i) [1,2,3,4] [], if necessary.)
The function you want in your head-scratching paragraph is possible directly -- though a bit noisily. With GADTs and DataKinds, values can be parameterized by numbers. You won't be able to use lists directly, because they don't mention their length in their type, but a straightforward variant that does works great. Here's how it looks.
{-# Language DataKinds #-}
{-# Language GADTs #-}
{-# Language ScopedTypeVariables #-}
{-# Language StandaloneDeriving #-}
{-# Language TypeOperators #-}
import GHC.TypeLits
infixr 5 :+
data Vec n a where
O :: Vec 0 a -- O is supposed to look a bit like a mix of 0 and []
(:+) :: a -> Vec n a -> Vec (n+1) a
data FullTree n a where
Leaf :: a -> FullTree 0 a
Branch :: [FullTree n a] -> FullTree (n+1) a
deriving instance Show a => Show (Vec n a)
deriving instance Show a => Show (FullTree n a)
ma :: forall n a. ([Int] -> a) -> Vec n Int -> FullTree n a
ma f = go [] where
go :: [Int] -> Vec n' Int -> FullTree n' a
go is O = Leaf (f is)
go is (l :+ ls) = Branch [go (i:is) ls | i <- [0..l-1]]
Try it out in ghci:
> ma (\_ -> 0) (2 :+ 2 :+ 2 :+ O)
Branch [Branch [Branch [Leaf 0,Leaf 0],Branch [Leaf 0,Leaf 0]],Branch [Branch [Leaf 0,Leaf 0],Branch [Leaf 0,Leaf 0]]]
> ma (\i -> i) (2 :+ 2 :+ 2 :+ O)
Branch [Branch [Branch [Leaf [0,0,0],Leaf [1,0,0]],Branch [Leaf [0,1,0],Leaf [1,1,0]]],Branch [Branch [Leaf [0,0,1],Leaf [1,0,1]],Branch [Leaf [0,1,1],Leaf [1,1,1]]]]
A low-tech solution:
In Haskell, you can model multi-level lists by using the so-called free monad.
The base definition is:
data Free ft a = Pure a | Free (ft (Free ft a))
where ft can be any functor, but here we are interested in ft being [], that is the list functor.
So we define our multidimensional list like this:
import Control.Monad
import Control.Monad.Free
type Mll = Free [] -- Multi-Level List
The Mll type transformer happens to be an instance of the Functor, Foldable, Traversable classes, which can come handy.
To make an array of arbitrary dimension, we start with:
the list of dimensions, for example [5,2,6]
the filler function, which returns a value for a given set of indices
We can start by making a “grid” object, whose item at indices say [x,y,z] is precisely the [x,y,z] list. As we have a functor instance, we can complete the process by just applying fmap filler to our grid object.
This gives the following code:
makeNdArray :: ([Int] -> a) -> [Int] -> Mll a
makeNdArray filler dims =
let
addPrefix x (Pure xs) = Pure (x:xs)
addPrefix x (Free xss) = Free $ map (fmap (x:)) xss
makeGrid [] = Pure []
makeGrid (d:ds) = let base = 0
fn k = addPrefix k (makeGrid ds)
in Free $ map fn [base .. (d-1+base)]
grid = makeGrid dims
in
fmap filler grid -- because we are an instance of the Functor class
To visualize the resulting structure, it is handy to be able to remove the constructor names:
displayMll :: Show a => Mll a -> String
displayMll = filter (\ch -> not (elem ch "Pure Free")) . show
The resulting structure can easily be flattened if need be:
toListFromMll :: Mll a -> [a]
toListFromMll xs = foldr (:) [] xs
For numeric base types, we can get a multidimensional sum function “for free”, so to speak:
mllSum :: Num a => (Mll a) -> a
mllSum = sum -- because we are an instance of the Foldable class
-- or manually: foldr (+) 0
Some practice:
We use [5,2,6] as the dimension set. To visualize the structure, we associate a decimal digit to every index. We can pretend to have 1-base indexing by adding 111, because that way all the resulting numbers are 3 digits long, which makes the result easier to check. Extra newlines added manually.
$ ghci
GHCi, version 8.8.4: https://www.haskell.org/ghc/ :? for help
λ>
λ> dims = [5,2,6]
λ> filler = \[x,y,z] -> (100*x + 10*y + z + 111)
λ>
λ> mxs = makeNdArray filler dims
λ>
λ> displayMll mxs
"[[[111,112,113,114,115,116],[121,122,123,124,125,126]],
[[211,212,213,214,215,216],[221,222,223,224,225,226]],
[[311,312,313,314,315,316],[321,322,323,324,325,326]],
[[411,412,413,414,415,416],[421,422,423,424,425,426]],
[[511,512,513,514,515,516],[521,522,523,524,525,526]]]"
λ>
As mentioned above, we can flatten the structure:
λ>
λ> xs = toListFromMll mxs
λ> xs
[111,112,113,114,115,116,121,122,123,124,125,126,211,212,213,214,215,216,221,222,223,224,225,226,311,312,313,314,315,316,321,322,323,324,325,326,411,412,413,414,415,416,421,422,423,424,425,426,511,512,513,514,515,516,521,522,523,524,525,526]
λ>
or take its overall sum:
λ>
λ> sum mxs
19110
λ>
λ> sum xs
19110
λ>
λ>
λ> length mxs
60
λ>
λ> length xs
60
λ>

Haskell `randoms` function not behaving well with my library

I'm trying to write a Haskell library for cryptographically secure random numbers. The code follows:
module URandom (URandom, initialize) where
import qualified Data.ByteString.Lazy as B
import System.Random
import Data.Word
newtype URandom = URandom [Word8]
instance RandomGen URandom where
next (URandom (x : xs)) = (fromIntegral x, URandom xs)
split (URandom l) = (URandom (evens l), URandom (odds l))
where evens (x : _ : xs) = x : evens xs
odds (_ : x : xs) = x : odds xs
genRange _ = (fromIntegral (minBound :: Word8), fromIntegral (maxBound :: Word8))
initialize :: IO URandom
initialize = URandom . B.unpack <$> B.readFile "/dev/urandom"
Unfortunately, it's not behaving like I want. In particular, performing
take 10 . randoms <$> initialize
yields (something similar to)
[-4611651379516519433,-4611644973572935887,-31514321567846,9223361179177989878,-4611732094835278236,9223327886739677537,4611709625714976418,37194416358963,4611669560113361421,-4611645373004878170,-9223329383535098640,4611675323959360258,-27021785867556,9223330964083681227,4611705212636167666]
which to my, albiet untrained, eye, does not appear very random. A lot of 46... and 92... in there.
What could be going wrong? Why doesn't this produce well-distributed numbers? It's worth noting that even if I concatenate together Word8s to form Ints the distribution does not improve, I didn't think it was worth including that code here.
Edit: here's some evidence that's not distributed correctly. I've written a function called histogram:
histogram :: ∀ t . (Integral t, Bounded t)
=> [t] -> Int -> S.Seq Int
histogram [] buckets = S.replicate buckets 0
histogram (x : xs) buckets = S.adjust (+ 1) (whichBucket x) (histogram xs buckets)
where whichBucket x = fromIntegral $ ((fromIntegral x * fromIntegral buckets) :: Integer) `div` fromIntegral (maxBound :: t)
and when I run
g <- initialize
histogram (take 1000000 $ randoms g :: [Word64]) 16
I get back
fromList [128510,0,0,121294,129020,0,0,122090,127873,0,0,120919,128637,0,0,121657]
Some of the buckets are completely empty!
The issue is a bug in random-1.0.1.1 that was fixed in random-1.1. The changelog points to this ticket. In particular, referring to the older version:
It also assumes that all RandomGen implementations produce the same range of random values as StdGen.
Here randomness is produced 8 bits at a time, and that caused the observed behavior.
random-1.1 fixed this:
This implementation also works with any RandomGen, even ones that produce as little as a single bit of entropy per next call or have a minimum bound other than zero.

Chinese Remainder Theorem Haskell

I need to write a function or functions in Haskell that can solve the Chinese Remainder Theorem. It needs to be created with the following definition:
crt :: [(Integer, Integer)] -> (Integer, Integer)
That the answer looks like
>crt [(2,7), (0,3), (1,5)]
(51, 105)
I think I have the overall idea, I just don't have the knowledge to write it. I know that the crt function must be recursive. I have created a helper function to split the list of tuples into a tuple of two lists:
crtSplit xs = (map fst xs, product(map snd xs))
Which, in this example, gives me:
([2,0,1],105)
I think what I need to do know is create a list for each of the elements in the first list. How would I begin to do this?
Chinese remainder theorem has an algebraic solution, based on the fact that x = r1 % m1 and x = r2 % m2 can be reduced to one modular equation if m1 and m2 are coprime.
To do so you need to know what modular inverse is and how it can efficiently be calculated using extended Euclidean algorithm.
If you put these pieces together, you can solve Chinese remainder theorem with a right fold:
crt :: (Integral a, Foldable t) => t (a, a) -> (a, a)
crt = foldr go (0, 1)
where
go (r1, m1) (r2, m2) = (r `mod` m, m)
where
r = r2 + m2 * (r1 - r2) * (m2 `inv` m1)
m = m2 * m1
-- Modular Inverse
a `inv` m = let (_, i, _) = gcd a m in i `mod` m
-- Extended Euclidean Algorithm
gcd 0 b = (b, 0, 1)
gcd a b = (g, t - (b `div` a) * s, s)
where (g, s, t) = gcd (b `mod` a) a
then:
\> crt [(2,7), (0,3), (1,5)]
(51,105)
\> crt [(2,3), (3,4), (1,5)] -- wiki example
(11,60)
Without going into algebra, you can also solve this with brute force. Perhaps that's what you've been asked to do.
For your example, create a list for each mod independent of the other two (upper bound will be least common multiple of the mod, assuming they are co-prime as a precondition, product, i.e. 105). These three list should have one common element which will satisfy all constraints.
m3 = [3,6,9,12,15,...,105]
m5 = [6,11,16,21,...,101]
m7 = [9,16,23,30,...,100]
you can use Data.Set to find the intersection of these lists. Now, extend this logic to n number of terms using recursion or fold.
Update
Perhaps an easier approach is defining a filter to create a sequence with a fixed remainder for a modulus and repeatedly apply for the given pairs
Prelude> let rm (r,m) = filter (\x -> x `mod` m == r)
verify that it works,
Prelude> take 10 $ rm (1,5) [1..]
[1,6,11,16,21,26,31,36,41,46]
now, for the given example use it repeatedly,
Prelude> take 3 $ rm (1,5) $ rm (0,3) $ rm (2,7) [1..]
[51,156,261]
of course we just need the first element, change to head instead
Prelude> head $ rm (1,5) $ rm (0,3) $ rm (2,7) [1..]
51
which we can generalize with fold
Prelude> head $ foldr rm [1..] [(1,5),(0,3),(2,7)]
51

memoizing a function that takes a set as parameter

I am using Data.MemoCombinators (https://hackage.haskell.org/package/data-memocombinators-0.3/docs/Data-MemoCombinators.html) to memoize a function that takes a set as its parameter and returns a set (this is a contrived example that does nothing but takes a long time to finish):
test s = case Set.toList s of
[] -> Set.singleton 0
[x] -> Set.singleton 1
(x:xs) -> test (Set.singleton x) `Set.union` test (Set.fromList xs)
Since Data.MemoCombinators does not implement a table for sets, I wanted to write my own:
{-# LANGUAGE RankNTypes #-}
import Data.MemoCombinators (Memo)
import qualified Data.MemoCombinators as Memo
import Data.Set (Set)
import qualified Data.Set as Set
set :: Ord a => Memo a -> ((Set a) -> r) -> (Set a) -> r
set m f = Memo.list m (f . Set.fromList) . Set.toList
and here is my test that was supposed to be memoized:
test s = set Memo.integral test' s
where
test' s = case Set.toList s of
[] -> Set.singleton 0
[x] -> Set.singleton 1
(x:xs) -> test (Set.singleton x) `Set.union` test (Set.fromList xs)
There is no documentation for Data.MemoCombinators that is clear to me, so basically I do not know exactly what I am doing.
My questions are:
what is the second parameter to the Memo.list function? Is it a memoizer for the elements of the list?
how to implement a table for a set directly, without using Memo.list? Here is would like to figure out how to implement memoization manually without using someone's library. For example, using a Map. I have seen examples that memoize integers using an infinite list but in case of a map I cannot figure out how to initialize the map and how to insert into it.
Thanks for any help.
what is the second parameter to the Memo.list function? Is it a memoizer for the elements of the list?
The first parameter m is the memoizer for the elements of the list. The second parameter f is the function that you want to apply to the list (and that will be memoized too).
how to implement a table for a set directly, without using Memo.list? Here is would like to figure out how to implement
memoization manually without using someone's library. For example,
using a Map. I have seen examples that memoize integers using an
infinite list but in case of a map I cannot figure out how to
initialize the map and how to insert into it.
Using the same strategy of Data.MemoCombinators, you can do something similar to want they do for lists. This approach does not use an explicit data structure for that, but explores the way Haskell keep things in memory and lazy evaluation.
set :: Ord a => Memo a -> Memo (Set a)
set m f = table (f Set.empty) (m (\x -> set m (f . (x `Set.insert`))))
where
table nil cons set | Set.null set = nil
| otherwise = uncurry cons (Set.deleteFindMin set)
You can also use memoization in Haskell using an explicit data structure (like a Map). I will use the Fibonacci example to demonstrate that, because it easier to benchmark, but it would be similar for other functions.
Let's start with the naive implementation:
fib0 :: Integer -> Integer
fib0 0 = 0
fib0 1 = 1
fib0 x = fib0 (x-1) + fib0 (x-2)
Then Data.MemoCombinators proposes this implementation:
import qualified Data.MemoCombinators as Memo
fib1 :: Integer -> Integer
fib1 = Memo.integral fib'
where
fib' 0 = 0
fib' 1 = 1
fib' x = fib1 (x-1) + fib1 (x-2)
And finally, my version using Map:
import Data.Map (Map)
import qualified Data.Map as Map
fib2 :: Integer -> Integer
fib2 = fst . fib' (Map.fromList [(0, 0),(1, 1)])
where
fib' m0 x | x `Map.member` m0 = (Map.findWithDefault 0 x m0, m0)
| otherwise = let (v1, m1) = fib' m0 (x-1)
(v2, m2) = fib' m1 (x-2)
y = v1 + v2
in (y, Map.insert x y m2)
Now, let's see how they perform:
fib0 40: 13.529371s
fib1 40: 0.000121s
fib2 40: 0.000048s
The fib0 was already too slow. Let's do a proper test with the other two:
fib1 400000: 6.234243s
fib2 400000: 4.022798s
fib1 500000: 8.683649s
fib2 500000: 5.781104s
The Map solution seem actually to outperform the Memo solution for all tests I performed. But I think the greatest advantage of Data.MemoCombinators is actually having this great performance without having to write much more code than the naive solution.
Updated: I changed the conclusions, because I was not doing the benchmark properly. I was doing several calls in the same execution, and in the case of 500000, whatever was the second call (either fib1 or fib2), that was taking too long.
What you have for test is fine, although normally you would define test as a function on sets using Set operations. Here is an example of what I'm talking about:
-- memoize a function on Set Int
foo = set M.integral foo'
where foo' s | Set.null s = 0
foo' s = let a = Set.findMin s
b = Set.findMax s
m = (a+b) `div` 2
(lo,found,hi) = Set.splitMember m s
in if a >= b
then 1
else (if found then 1 else 0) + foo lo + foo hi
This is a very inefficient way of counting the number of elements in a set, but note how foo' is defined in terms of Set operations.
Re your other questions:
what is the second parameter to the Memo.list function? Is it a memoizer for the elements of the list?
Memo.list has signature Memo a -> Memo [a], so in the expression Memo.list m f we have:
m :: Memo a
f :: [a] -> r -- some type r
Memo.list m f :: [a] -> r
So f is the function on [a] that you are memoizing, and m is a memoizer for functions taking a parameter of type a.
how to implement a table for a set directly?
It depends on what you mean by "directly". Memoizing in this fashion is going to involving creating an (possibly infinite) lazy data structure. The string, integral and list memoizers all use some form a lazy trie. This is very different from memoization in imperative languages where you explicitly check a hash map to see if you've already computed something and update that hash map with the function's value, etc. (Btw - you can do that sort of memoization in the ST or IO monads and it might work even better than the Data.Memocombinators approach - something to consider.)
Your idea of memoizing a Set a -> r function by passing to a list is a fine idea, but I would use to/from AscList:
set m f = Memo.list m (f . Set.fromAscList) . Set.toAscList
That way the set Set.fromList [3,4,5] will re-use the same part of the trie that was created to memoize the value for Set.fromList [3,4].

Interleaving list functions

Lets say I'm given two functions:
f :: [a] -> b
g :: [a] -> c
I want to write a function that is the equivalent of this:
h x = (f x, g x)
But when I do that, for large lists inevitably I run out of memory.
A simple example is the following:
x = [1..100000000::Int]
main = print $ (sum x, product x)
I understand this is the case because the list x is being stored in memory without being garbage collected. It would be better instead of f and g worked on x in, well, "parallel".
Assuming I can't change f and g, nor want to make a separate copy of x (assume x is expensive to produce) how can I write h without running into out of memory issues?
A short answer is you can't. Since you have no control over f and g, you have no guarantee that the functions process their input sequentially. Such a function can as well keep the whole list stored in memory before producing the final result.
However, if your functions are expressed as folds, the situation is different. This means that we know how to incrementally apply each step, so we can parallelize those steps in one run.
The are many resources about this area. For example:
Haskell: Can I perform several folds over the same lazy list without keeping list in memory?
Classic Beautiful folding
More beautiful fold zipping
The pattern of consuming a sequence of values with properly defined space bounds is solved more generally with pipe-like libraries such conduit, iteratees or pipes. For example, in conduit, you could express the combination of computing sums and products as
import Control.Monad.Identity
import Data.Conduit
import Data.Conduit.List (fold, sourceList)
import Data.Conduit.Internal (zipSinks)
product', sum' :: (Monad m, Num a) => Sink a m a
sum' = fold (+) 0
product' = fold (*) 1
main = print . runIdentity $ sourceList (replicate (10^6) 1) $$
zipSinks sum' product'
If you can turn your functions into folds, you can then just use them with a scan:
x = [1..100000000::Int]
main = mapM_ print . tail . scanl foo (a0,b0) . takeWhile (not.null)
. unfoldr (Just . splitAt 1000) -- adjust the chunk length as needed
$ x
foo (a,b) x = let a2 = f' a $ f x ; b2 = g' b $ g x
in a2 `seq` b2 `seq` (a2, b2)
f :: [t] -> a -- e.g. sum
g :: [t] -> b -- (`rem` 10007) . product
f' :: a -> a -> a -- e.g. (+)
g' :: b -> b -> b -- ((`rem` 10007) .) . (*)
we consume the input in chunks for better performance. Compiled with -O2, this should run in a constant space. The interim results are printed as indication of progress.
If you can't turn your function into a fold, this means it has to consume the whole list to produce any output and this trick doesn't apply.
You can use multiple threads to evaluate f x and g x in parallel.
E.g.
x :: [Int]
x = [1..10^8]
main = print $ let a = sum x
b = product x
in a `par` b `pseq` (a,b)
Its a nice way to exploit GHC's parallel runtime to prevent a space leak by doing two things at once.
Alternatively, you need to fuse f and g into a single pass.

Resources