Improve complexity of association list generation - haskell

I currently have a list of the form:
[(foo, bar), (foo, baz), (qux, quux)]
I would like to convert this into a list of the form:
[(foo, [bar, baz]), (qux, [quxx])]
In my actual use case, the list contains around 1 million of these tuples.
Currently, I'm solving this in the following way, which, while entirely pure and free of side-effects, also is (as I understand it) O(n^2):
import qualified Data.HashMap.Strict as M
foo xs = M.fromListWith (++) $ xs
Is there a better way to do this?

The fromListWith algorithm has an O(n*log n) time complexity. This is the best you can get with no other constraints. The idea is that you need to traverse the list (O(n)) and foreach element insert (and check for duplicates) the key in the hash (O(log(n))).
With other constraints and with more space complexity you might be able to achieve a linear complexity. For example if the range of the keys is "compact" and the keys are integers, then you can use a vector/array and maybe pay more in terms of space, but get a O(1) lookup and insertion.

No, you're fine, except for a small error in your implementation[1]. As Jeffrey pointed out, fromListWith has O(n log n) complexity, which is quite good.
The potential issue you might face is appending, which could possibly be O(n^2) if all the keys were the same and you appended to the end of each list. However, a little experiment shows
data Tree a = Branch (Tree a) (Tree a) | Leaf a
deriving (Show)
ghci> M.fromListWith Branch [(1, Leaf 1), (1, Leaf 2), (1, Leaf 3)]
fromList [(1,Branch (Leaf 3) (Branch (Leaf 2) (Leaf 1)))]
that fromListWith gives the new element as the first argument to the combining function, so you will be prepending (which is O(1)) rather than appending (which is O(n)), so you're okay there.
[1]: You have forgotten to make singleton lists out of the values before passing to M.fromListWith.

Related

Manipulating Tuples in Haskell

I'm new to Haskell, I have a question regarding tuples. Is there not a way to traverse a tuple? I understand that traversal is very easy with lists but if the input is given as a tuple is there not a way to check the entire tuple as you do with a list? If that's not the case would it possible to just extract the values from the tuple into a list and perform traversal that way?
In Haskell, it’s not considered idiomatic (nor is it really possible) to use the tuple as a general-purpose traversable container. Any tuple you deal with is going to have a fixed number of elements, with the types of these elements also being fixed. (This is quite different from how tuples are idiomatically used in, for example, Python.) You ask about a situation where “the input is given as a tuple” but if the input is going to have a flexible number of elements then it definitely won’t be given as a tuple—a list is a much more likely choice.
This makes tuples seem less flexible than in some other languages. The upside is that you can examine them using pattern matching. For example, if you want to evaluate some predicate for each element of a tuple and return True if the predicate passes for all of them, you would write something like
all2 :: (a -> Bool) -> (a, a) -> Bool
all2 predicate (x, y) = predicate x && predicate y
Or, for three-element tuples,
all3 :: (a -> Bool) -> (a, a, a) -> Bool
all3 predicate (x, y, z) = predicate x && predicate y && predicate z
You might be thinking, “Wait, you need a separate function for each tuple size?!” Yes, you do, and you can start to see why there’s not a lot of overlap between the use cases for tuples and the use cases for lists. The advantages of tuples are exactly that they are kind of inflexible: you always know how many values they contain, and what type those values have. The former is not really true for lists.
Is there not a way to traverse a tuple?
As far as I know, there’s no built-in way to do this. It would be easy enough to write down instructions for traversing a 2-tuple, traversing a 3-tuple, and so on, but this would have the big limitation that you’d only be able to deal with tuples whose elements all have the same type.
Think about the map function as a simple example. You can apply map to a list of type [a] as long as you have a function a -> b. In this case map looks at each a value in turn, passes it to the function, and assembles the list of resulting b values. But with a tuple, you might have three elements whose values are all different types. Your function for converting as to bs isn’t sufficient if the tuple consists of two a values and a c! If you try to start writing down the Foldable instance or the Traversable instance even just for two-element tuples, you quickly realize that those typeclasses aren’t designed to handle containers whose values might have different types.
Would it be possible to just extract the values from the tuple into a list?
Yes, but you would need a separate function for each possible size of the input tuple. For example,
tupleToList2 :: (a, a) -> [a]
tupleToList2 (x, y) = [x, y]
tupleToList3 :: (a, a, a) -> [a]
tupleToList3 (x, y, z) = [x, y, z]
The good news, of course, is that you’re never going to get a situation where you have to deal with tuples of arbitrary size, because that isn’t a thing that can happen in Haskell. Think about the type signature of a function that accepted a tuple of any size: how could you write that?
In any situation where you’re accepting a tuple as input, it’s probably not necessary to convert the tuple to a list first, because the pattern-matching syntax means that you can just address each element of the tuple individually—and you always know exactly how many such elements there are going to be.
If your tuple is a homogeneous tuple, and you don't mind to use the third-party package, then lens provides some functions to traverse each elements in an arbitrary tuple.
ghci> :m +Control.Lens
ghci> over each (*10) (1, 2, 3, 4, 5) --traverse each element
(10,20,30,40,50)
Control.Lens.Tuple provides some lens to get and set the nth element up to 19th.
You can explore the lens package for more information. If you want to learn the lens package, Optics by examples by Chris Penner is a good book.

The simplest way to generically traverse a tree in haskell

Suppose I used language-javascript library to build AST in Haskell. The AST has nodes of different types, and each node can have fields of those different types.
And each type can have numerous constructors. (All the types instantiate Data, Eq and Show).
I would like to count each type's constructor occurrence in the tree. I could use toConstr to get the constructor, and ideally I'd make a Tree -> [Constr] function fisrt (then counting is easy).
There are different ways to do that. Obviously pattern matching is too verbose (imagine around 3 types with 9-28 constructors).
So I'd like to use a generic traversal, and I tried to find the solution in SYB library.
There is an everywhere function, which doesn't suit my needs since I don't need a Tree -> Tree transformation.
There is gmapQ, which seems suitable in terms of its type, but as it turns out it's not recursive.
The most viable option so far is everywhereM. It still does the useless transformation, but I can use a Writer to collect toConstr results. Still, this way doesn't really feel right.
Is there an alternative that will not perform a useless (for this task) transformation and still deliver the list of constructors? (The order of their appearance in the tree doesn't matter for now)
Not sure if it's the simplest, but:
> data T = L | B T T deriving Data
> everything (++) (const [] `extQ` (\x -> [toConstr (x::T)])) (B L (B (B L L) L))
[B,L,B,B,L,L,L]
Here ++ says how to combine the results from subterms.
const [] is the base case for subterms who are not of type T. For those of type T, instead, we apply \x -> [toConstr (x::T)].
If you have multiple tree types, you'll need to extend the query using
const [] `extQ` (handleType1) `extQ` (handleType2) `extQ` ...
This is needed to identify the types for which we want to take the constructors. If there are a lot of types, probably this can be made shorter in some way.
Note that the code above is not very efficient on large trees since using ++ in this way can lead to quadratic complexity. It would be better, performance wise, to return a Data.Map.Map Constr Int. (Even if we do need to define some Ord Constr for that)
universe from the Data.Generics.Uniplate.Data module can give you a list of all the sub-trees of the same type. So using Ilya's example:
data T = L | B T T deriving (Data, Show)
tree :: T
tree = B L (B (B L L) L)
λ> import Data.Generics.Uniplate.Data
λ> universe tree
[B L (B (B L L) L),L,B (B L L) L,B L L,L,L,L]
λ> fmap toConstr $ universe tree
[B,L,B,B,L,L,L]

Haskell: Should I get "Stack space overflow" when constructing an IntMap from a list with a million values?

My problem is that when using any of the Map implementations in Haskell that I always get a "Stack space overflow" when working with a million values.
What I'm trying to do is process a list of pairs. Each pair contains two Ints (not Integers, I failed miserably with them so I tried Ints instead). I want to go through each pair in the list and use the first Int as a key. For each unique key I want to build up a list of second elements where each of the second elements are in a pair that have the same first element. So what I want at the end is a "Map" from an Int to a list of Ints. Here's an example.
Given a list of pairs like this:
[(1,10),(2,11),(1,79),(3,99),(1,42),(3,18)]
I would like to end up with a "Map" like this:
{1 : [42,79,10], 2 : [11], 3 : [18,99]}
(I'm using a Python-like notation above to illustrate a "Map". I know it ain't Haskell. It's just there for illustrative purposes.)
So the first thing I tried was my own hand built version where I sorted the list of pairs of Ints and then went through the list building up a new list of pairs but this time the second element was a list. The first element is the key i.e. the unique Int values of the first element of each pair and the second element is a list of the second values of each original pair which have the key as the first element.
So given a list of pairs like this:
[(1,10),(2,11),(1,79),(3,99),(1,42),(3,18)]
I end up with a list of pairs like this:
[(1, [42,79,10], (2, [11]), (3, [18,99])]
This is easy to do. But there is one problem. The performance of the "sort" function on the original list (of 10 million pairs) is shockingly bad. I can generate the original list of pairs in less than a second. I can process the sorted list into my hand built map in less than a second. However, sorting the original list of pairs takes 40 seconds.
So I thought about using one of the built-in "Map" data structures in Haskell to do the job. The idea being I build my original list of pairs and then using standard Map functions to build a standard Map.
And that's where it all went pear-shaped. It works well on a list of 100,000 values but when I move to 1 million values, I get a "Stack space overflow" error.
So here's some example code that suffers from the problem. Please, please note that is not the actual code that I want to implement. It is just a very simplified version of code for which the same problem exists. I don't really want to separate a million consecutive numbers into odd and even partitions!!
import Data.IntMap.Strict(IntMap, empty, findWithDefault, insert, size)
power = 6
ns :: [Int]
ns = [1..10^power-1]
mod2 n = if odd n then 1 else 0
mod2Pairs = zip (map mod2 ns) ns
-- Takes a list of pairs and returns a Map where the key is the unique Int values
-- of the first element of each pair and the value is a list of the second values
-- of each pair which have the key as the first element.
-- e.g. makeMap [(1,10),(2,11),(1,79),(3,99),(1,42),(3,18)] =
-- 1 -> [42,79,10], 2 -> [11], 3 -> [18,99]
makeMap :: [(Int,a)] -> IntMap [a]
makeMap pairs = makeMap' empty pairs
where
makeMap' m [] = m
makeMap' m ((a, b):cs) = makeMap' m' cs
where
bs = findWithDefault [] a m
m' = insert a (b:bs) m
mod2Map = makeMap mod2Pairs
main = do
print $ "Yowzah"
print $ "length mod2Pairs="++ (show $ length mod2Pairs)
print $ "size mod2Map=" ++ (show $ size mod2Map)
When I run this, I get:
"Yowzah"
"length mod2Pairs=999999"
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
From the above output, it should be clear that the stack space overflow happens when I try to do "makeMap mod2Pairs".
But to my naive eye all this seems to do is go through a list of pairs and for each pair lookup a key (the first element of each pair) and A) if it doesn't find a match return an empty list or B) if it does find a match, return the list that has previously been inserted. In either case it "cons"'s the second element of the pair to the "found" list and inserts that back into the Map with the same key.
(PS instead of findWithDefault, I've also tried lookup and handled the Just and Nothing using case but to no avail.)
I've had a look through the Haskell documentation on the various Map implementations and from the point of view of performance in terms of CPU and memory (especially stack memory), it seems that A) a strict implementation and B) one where the keys are Ints would be the best. I have also tried Data.Map and Data.Strict.Map and they also suffer from the same problem.
I am convinced the problem is with the "Map" implementation. Am I right? Why would I get a stack overflow error i.e. what is the Map implementation doing in the background that is causing a stack overflow? Is it making lots and lots of recursive calls behind the scenes?
Can anyone help explain what is going on and how to get around the problem?
I don't have an old enough GHC to check (this works just fine for me, and I don't have 7.6.3 as you do), but my guess would be that your makeMap' is too lazy. Probably this will fix it:
makeMap' m ((a, b):cs) = m `seq` makeMap' m' cs
Without it, you are building up a million-deep nested thunk, and deeply-nested thunks is the traditional way to cause stack overflows in Haskell.
Alternately, I would try just replacing the whole makeMap implementation with fromListWith:
makeMap pairs = fromListWith (++) [(k, [v]) | (k, v) <- pairs]

What datatype to choose for a dungeon map

As part of a coding challenge I have to implement a dungeon map.
I have already designed it using Data.Map as a design choice because printing the map was not required and sometimes I had to update an map tile, e.g. when an obstacle was destroyed.
type Dungeon = Map Pos Tile
type Pos = (Int,Int) -- cartesian coordinates
data Tile = Wall | Destroyable | ...
But what if I had to print it too - then I would have to use something like
elaboratePrint . sort $ fromList dungeon where elaboratePrint takes care of the linebreaks and makes nice unicode symbols from the tileset.
Another choice I considered would be a nested list
type Dungeon = [[Tile]]
This would have the disadvantage, that it is hard to update a single element in such a data structure. But printing then would be a simple one liner unlines . map show.
Another structure I considered was Array, but as I am not used to arrays a short glance at the hackage docs - i only found a map function that operated on indexes and one that worked on elements, unless one is willing to work with mutable arrays updating one element is not easy at first glance. And printing an array is also not clear how to do that fast and easily.
So now my question - is there a better data structure for representing a dungeon map that has the property of easy printing and easy updating single elements.
How about an Array? Haskell has real, 2-d arrays.
import Data.Array.IArray -- Immutable Arrays
Now an Array is indexed by any Ix a => a. And luckily, there is an instance (Ix a, Ix b) => Ix (a, b). So we can have
type Dungeon = Array (Integer, Integer) Tile
Now you construct one of these with any of several functions, the simplest to use being
array :: Ix i => (i, i) -> [(i, a)] -> Array i a
So for you,
startDungeon = array ( (0, 0), (100, 100) )
[ ( (x, y), Empty ) | x <- [0..100], y <- [0..100]]
And just substitute 100 and Empty for the appropriate values.
If speed becomes a concern, then it's a simple fix to use MArray and ST. I'd suggest not switching unless speed is actually a real concern here.
To address the pretty printing
import Data.List
import Data.Function
pretty :: Array (Integer, Integer) Tile -> String
pretty = unlines . map show . groupBy ((==) `on` snd.fst) . assoc
And map show can be turned in to however you want to format [Tile] into a row. If you decide that you really want these to be printed in an awesome and efficient manner (Console game maybe) you should look at a proper pretty printing library, like this one.
First — tree-likes such as Data.Map and lists remain the natural data structures for functional languages. Map is a bit of an overkill structure-wise if you only need rectangular maps, but [[Tile]] may actually be pretty fine. It has O(√n) for both random-access and updates, that's not too bad.
In particular, it's better than pure-functional updates of a 2D array (O(n))! So if you need really good performance, there's no way around using mutable arrays. Which isn't necessarily bad though, after all a game is intrinsically concerned with IO and state. What is good about Data.Array, as noted by jozefg, is the ability to use tuples as Ix indexes, so I would go with MArray.
Printing is easy with arrays. You probably only need rectangular parts of the whole map, so I'd just extract such slices with a simple list comprehension
[ [ arrayMap ! (x,y) | x<-[21..38] ] | y<-[37..47] ]
You already know how to print lists.

Sort a list of pairs using sort Haskell

I have a function (frequency) which that counts how many times each distinct value in a list occurs in that list. For example,
frequency "ababca"
should return:
[(3, 'a'), (2, 'b'), (1, 'c')].
This works fine but now I need to sort the list using the first element within the list of the list using this function.
results :: [Party ] -> [(Int, Party)]
results xs = ??? frequency (sort xs) ???
example of desired output:
[(1, "Green"), (2, "Red"), (3, "Blue")]
The above does not work, I have no idea what I can do.
using regular 'sort'
Thank you in Advance.
import Data.Function (on)
import Data.List (sortBy)
results xs = sortBy (compare `on` fst) (frequency xs)
-- or, if you prefer
results xs = sort (frequency xs)
Links to documentation for on, sortBy, compare, fst.
The difference is that sort sorts in ascending order of the first element of each pair, breaking tie-breaks with the second elements of the pairs, while sortBy (compare `on` fst) explicitly only looks at the first element of each pair.
If you can only use sort and not sortBy (for some reason!) then you need to make sure that the items are of a type that is an instance of Ord. As it happens, all tuples (up to size 15) have Ord instances, provided that all the positions in the tuple also have Ord instances.
The example you give of (1, "Green"), (2, "Red"), (3, "Blue")] should sort fine (though reversed), since both Int and String have Ord instances.
However, in the code snippet, you also mention a Party type without actually saying what it is. If it's not just an alias for something like String, you may have to define an Ord instance for it, to satisfy the built-in Ord instances for tuples.
You can have Haskell create instances for you, using deriving when you declare the type
data Party = P1 | P2 | P3 | P4 -- e.g.
deriving (Eq,Ord)
or declare it yourself:
instance Ord Party where
-- you don't care about the ordering of the party values
compare a b = EQ
But, as dave4420 says, it's much better to just use sortBy, so I would do that unless you have a specific reason not to (ie it's a class assignment with restrictions).

Resources