Assuming I have a list of values like this:
["abc","abd","aab","123"]
I want to group those values into a MultiMap (conceptually, not limited to a specific data structure) in Haskell by using a function that maps any element to a key.
For this example, we shall use take 2 as a mapper.
The result I intend to get is (conceptually, as JSON):
{"ab":["abc","abd"], "aa":["aab"], "12":["123"]}
In this example I will use [(String, [String])] as a Multimap data structure.
My basic idea (conceptually):
let datalist = ["abc","abd","aab","123"]
let mapfn = take 2
let keys = nub $ map mapfn datalist
let valuesForKey key = filter ((==key).mapfn) datalist
let resultMultimap = zip keys $ map valuesForKey keys
My question:
Is there any better way (in base or external packages) to do this? I want to avoid custom code.
If 1) is not applicable, is there any guarantee that GHC will optimize this so one pass over the data list is sufficient to generate the full multimap (as opposed to one filter run per key)?
Conceptually, this question is similar to the SQL GROUP BY statement.
Using fromListWith from Data.Map:
> let xs = ["abc","abd","aab","123"]
> let f = take 2
> Data.Map.fromListWith (++) [(f x, [x]) | x <- xs]
fromList [("12",["123"]),("aa",["aab"]),("ab",["abd","abc"])]
Edit 2014-03-28: My functions have now been published on Hackage, see group-with
Pull requests are welcome!
Based on hammar's excellent answer I put together two reusable functions to solve this problem.
groupWith solves exactly what I asked for. groupWithMulti generalizes the concept by allowing the identifier-generating function (e.g. take 2 in my example) to return multiple identifiers for a single value (where the value is, in my example, one of ["abc","abd","aab","123"]), or none at all.
The value will be added to the Map value for any identifier generated by f.
import Data.Map (Map)
import qualified Data.Map as Map
-- | Group values in a list by their identifier, being returned
-- by a given function. The resulting map contains,
-- for each generated identifier the values (from the original list)
-- that yielded said identifier by using the function
groupWith :: (Ord b) => (a -> b) -> [a] -> (Map b [a])
groupWith f xs = Map.fromListWith (++) [(f x, [x]) | x <- xs]
-- | Like groupWith, but the identifier-generating function
-- may generate multiple outputs (or even none).
-- The corresponding value from the original list will be placed
-- in the identifier-corresponding map entry for each generated
-- identifier
groupWithMulti :: (Ord b) => (a -> [b]) -> [a] -> (Map b [a])
groupWithMulti f xs =
let identifiers x = [(val, [x]) | val <- f x]
in Map.fromListWith (++) $ concat [identifiers x | x <- xs]
Simply use Map.toList to convert the results of these functions back to a tuple list.
When I have some spare time, I will attempt to create a generalized library on Hackage out of this approach on in-memory data grouping.
Related
I have a program with two data structures I wish to combine. The use of Data.Map here is incidental because I'm using it elsewhere for a related purpose. If a solution never uses Data.Map, that's fine (probably better). I've simplified the problem to the below script that has all the essential elements.
My actual program is in a different domain, but in the analogy various "interviewers" are assigned to interview all the people in given households (named by index position of the "house"). I would like to determine which interviewers will need to conduct multiple interviews.
If an interviewer is assigned multiple households, she automatically must interview multiple people (in the scenario, all households are occupied). However, if she is assigned only one household, she might also need to interview the several people there.
The initial wrong approach I found (misled by my wrong assumption about the domain) produces the result below. However, I'm having trouble formulating the correct solution. For my purpose, the order in which the interviews occur in the result is not important.
import Data.Map.Strict (Map)
import qualified Data.Map.Strict as Map
-- Create Map from list of pairs, making dup vals into a list
fromListWithDuplicates :: Ord k => [(k, v)] -> Map k [v]
fromListWithDuplicates pairs =
Map.fromListWith (++) [(k, [v]) | (k, v) <- pairs]
data Person = Person {
name :: String
} deriving (Show, Eq)
households = [[Person "Alice", Person "Bob"],
[Person "Carlos"],
[Person "Dabir", Person "Eashan"],
[Person "Fatima"] ]
interviewers = [("Agent1", [0]), ("Agent2", [1,2]), ("Agent3", [3])]
multiInterviewsWRONG households interviewers =
let assignments = [(agent, name person) |
(agent, houseIndices) <- interviewers,
index <- houseIndices,
person <- (households !! index),
length houseIndices > 1 ]
in Map.assocs $ fromListWithDuplicates assignments
main :: IO ()
main = do
-- Prints: [("Agent2", ["Eashan","Dabir","Carlos"])]
putStrLn $ show (multiInterviewsWRONG households interviewers)
-- Correct: [("Agent2", ["Eashan","Dabir","Carlos"]),
-- ("Agent1", ["Alice","Bob"]]
Followup: this solution is just Willem Van Onsem's below, but putting it in one place:
import Util (lengthExceeds)
multiInterviews households interviewers =
let assignments = [(agent, name person) |
(agent, houseIndices) <- interviewers,
index <- houseIndices,
person <- (households !! index) ]
in filter (flip lengthExceeds 1 . snd)
(Map.assocs $ fromListWithDuplicates assignments)
Obviously Willem's answer is great, but I think it can't hurt to also offer one without a list comprehension:
atLeastTwo :: [a] -> Bool
atLeastTwo (_:_:_) = True
atLeastTwo _ = False
transformSnd :: (b -> c) -> (a, b) -> (a, c)
transformSnd fun (f, s) = (f, fun s)
-- or transformSnd = second (from Control.Arrow; h/t Willem)
-- or transformSnd = fmap (from (,)'s Functor instance; h/t Will Ness)
-- or transformSnd = second (from Data.Bifunctor)
mult :: [(String, [String])]
mult = filter (atLeastTwo . snd) . map (transformSnd toInterviewees) $ interviewers
where toInterviewees = map name . concatMap (households !!)
-- mult == [("Agent1",["Alice","Bob"]),("Agent2",["Carlos","Dabir","Eashan"])]
I'm reasonably sure the two versions run equally fast; which one is more readable depends on who's doing the reading.
There are a couple of functional differences. First, with Willem's answer, you get a map, while with this one you get a list (but the difference is mainly cosmetic, and you said you didn't care much).
Second, the two versions behave differently if there are two pairs in the interviewers list that have the same first element. Doing it Willem's way will do what you probably want, i. e. treat them as one pair with a longer second element; doing it this way will give you two pairs in the result list which have the same first element.
Also, you probably know this, but: if you find yourself combining lists a lot, you might want sets instead.
You should remove the length houseIndices > 1 constraints, since that means that it will only retain agents, given they have to interview two or more households. You thus should use as list comprehension:
multiInterviews households interviewers =
let assignments = [
(agent, name person) |
(agent, houseIndices) <- interviewers,
index <- houseIndices,
person <- households !! index
]
# …
The given list comprehension will produce a list that looks like:
Prelude> :{
Prelude| [
Prelude| (agent, name person) |
Prelude| (agent, houseIndices) <- interviewers,
Prelude| index <- houseIndices,
Prelude| person <- households !! index
Prelude| ]
Prelude| :}
[("Agent1","Alice"),("Agent1","Bob"),("Agent2","Carlos"),("Agent2","Dabir"),("Agent2","Eashan"),("Agent3","Fatima")]
We however need to filter, we can look at the assocs with lists that contain at least two items. We can implement an efficient function to determine if the list has at least two items:
atLeastTwo :: [a] -> Bool
atLeastTwo (_:_:_) = True
atLeastTwo _ = False
and apply this filter to the assocs of the Map:
multiInterviews households interviewers =
let assignments = …
in filter (atLeastTwo . snd) (Map.assocs (fromListWithDuplicates assignments))
Haskell's expressiveness enables us to rather easily define a powerset function:
import Control.Monad (filterM)
powerset :: [a] -> [[a]]
powerset = filterM (const [True, False])
To be able to perform my task it is crucial for said powerset to be sorted by a specific function, so my implementation kind of looks like this:
import Data.List (sortBy)
import Data.Ord (comparing)
powersetBy :: Ord b => ([a] -> b) -> [a] -> [[a]]
powersetBy f = sortBy (comparing f) . powerset
Now my question is whether there is a way to only generate a subset of the powerset given a specific start and endpoint, where f(start) < f(end) and |start| < |end|. For example, my parameter is a list of integers ([1,2,3,4,5]) and they are sorted by their sum. Now I want to extract only the subsets in a given range, lets say 3 to 7. One way to achieve this would be to filter the powerset to only include my range but this seems (and is) ineffective when dealing with larger subsets:
badFunction :: Ord b => b -> b -> ([a] -> b) -> [a] -> [[a]]
badFunction start end f = filter (\x -> f x >= start && f x <= end) . powersetBy f
badFunction 3 7 sum [1,2,3,4,5] produces [[1,2],[3],[1,3],[4],[1,4],[2,3],[5],[1,2,3],[1,5],[2,4],[1,2,4],[2,5],[3,4]].
Now my question is whether there is a way to generate this list directly, without having to generate all 2^n subsets first, since it will improve performance drastically by not having to check all elements but rather generating them "on the fly".
If you want to allow for completely general ordering-functions, then there can't be a way around checking all elements of the powerset. (After all, how would you know the isn't a special clause built in that gives, say, the particular set [6,8,34,42] a completely different ranking from its neighbours?)
However, you could make the algorithm already drastically faster by
Only sorting after filtering: sorting is O (n · log n), so you want keep n low here; for the O (n) filtering step it matters less. (And anyway, number of elements doesn't change through sorting.)
Apply the ordering-function only once to each subset.
So
import Control.Arrow ((&&&))
lessBadFunction :: Ord b => (b,b) -> ([a]->b) -> [a] -> [[a]]
lessBadFunction (start,end) f
= map snd . sortBy (comparing fst)
. filter (\(k,_) -> k>=start && k<=end)
. map (f &&& id)
. powerset
Basically, let's face it, powersets of anything but a very small basis are infeasible. The particular application “sum in a certain range” is pretty much a packaging problem; there are quite efficient ways to do that kind of thing, but you'll have to give up the idea of perfect generality and of quantification over general subsets.
Since your problem is essentially a constraint satisfaction problem, using an external SMT solver might be the better alternative here; assuming you can afford the extra IO in the type and the need for such a solver to be installed. The SBV library allows construction of such problems. Here's one encoding:
import Data.SBV
-- c is the cost type
-- e is the element type
pick :: (Num e, SymWord e, SymWord c) => c -> c -> ([SBV e] -> SBV c) -> [e] -> IO [[e]]
pick begin end cost xs = do
solutions <- allSat constraints
return $ map extract $ extractModels solutions
where extract ts = [x | (t, x) <- zip ts xs, t]
constraints = do tags <- mapM (const free_) xs
let tagged = zip tags xs
finalCost = cost [ite t (literal x) 0 | (t, x) <- tagged]
solve [finalCost .>= literal begin, finalCost .<= literal end]
test :: IO [[Integer]]
test = pick 3 7 sum [1,2,3,4,5]
We get:
Main> test
[[1,2],[1,3],[1,2,3],[1,4],[1,2,4],[1,5],[2,5],[2,3],[2,4],[3,4],[3],[4],[5]]
For large lists, this technique will beat out generating all subsets and filtering; assuming the cost function generates reasonable constraints. (Addition will be typically OK, if you've multiplications, the backend solver will have a harder time.)
(As a side note, you should never use filterM (const [True, False]) to generate power-sets to start with! While that expression is cute and fun, it is extremely inefficient!)
I want to write program that takes array of Ints and length and returns array that consist in position i all elements, that equals i, for example
[0,0,0,1,3,5,3,2,2,4,4,4] 6 -> [[0,0,0],[1],[2,2],[3,3],[4,4,4],[5]]
[0,0,4] 7 -> [[0,0],[],[],[],[4],[],[]]
[] 3 -> [[],[],[]]
[2,2] 3 -> [[],[],[2,2]]
So, that's my solution
import Data.List
import Data.Function
f :: [Int] -> Int -> [[Int]]
f ls len = g 0 ls' [] where
ls' = group . sort $ ls
g :: Int -> [[Int]] -> [[Int]] -> [[Int]]
g val [] accum
| len == val = accum
| otherwise = g (val+1) [] (accum ++ [[]])
g val (x:xs) accum
| len == val = accum
| val == head x = g (val+1) xs (accum ++ [x])
| otherwise = g (val+1) (x:xs) (accum ++ [[]])
But query f [] 1000000 works really long, why?
I see we're accumulating over some data structure. I think foldMap. I ask "Which Monoid"? It's some kind of lists of accumulations. Like this
newtype Bunch x = Bunch {bunch :: [x]}
instance Semigroup x => Monoid (Bunch x) where
mempty = Bunch []
mappend (Bunch xss) (Bunch yss) = Bunch (glom xss yss) where
glom [] yss = yss
glom xss [] = xss
glom (xs : xss) (ys : yss) = (xs <> ys) : glom xss yss
Our underlying elements have some associative operator <>, and we can thus apply that operator pointwise to a pair of lists, just like zipWith does, except that when we run out of one of the lists, we don't truncate, rather we just take the other. Note that Bunch is a name I'm introducing for purposes of this answer, but it's not that unusual a thing to want. I'm sure I've used it before and will again.
If we can translate
0 -> Bunch [[0]] -- single 0 in place 0
1 -> Bunch [[],[1]] -- single 1 in place 1
2 -> Bunch [[],[],[2]] -- single 2 in place 2
3 -> Bunch [[],[],[],[3]] -- single 3 in place 3
...
and foldMap across the input, then we'll get the right number of each in each place. There should be no need for an upper bound on the numbers in the input to get a sensible output, as long as you are willing to interpret [] as "the rest is silence". Otherwise, like Procrustes, you can pad or chop to the length you need.
Note, by the way, that when mappend's first argument comes from our translation, we do a bunch of ([]++) operations, a.k.a. ids, then a single ([i]++), a.k.a. (i:), so if foldMap is right-nested (which it is for lists), then we will always be doing cheap operations at the left end of our lists.
Now, as the question works with lists, we might want to introduce the Bunch structure only when it's useful. That's what Control.Newtype is for. We just need to tell it about Bunch.
instance Newtype (Bunch x) [x] where
pack = Bunch
unpack = bunch
And then it's
groupInts :: [Int] -> [[Int]]
groupInts = ala' Bunch foldMap (basis !!) where
basis = ala' Bunch foldMap id [iterate ([]:) [], [[[i]] | i <- [0..]]]
What? Well, without going to town on what ala' is in general, its impact here is as follows:
ala' Bunch foldMap f = bunch . foldMap (Bunch . f)
meaning that, although f is a function to lists, we accumulate as if f were a function to Bunches: the role of ala' is to insert the correct pack and unpack operations to make that just happen.
We need (basis !!) :: Int -> [[Int]] to be our translation. Hence basis :: [[[Int]]] is the list of images of our translation, computed on demand at most once each (i.e., the translation, memoized).
For this basis, observe that we need these two infinite lists
[ [] [ [[0]]
, [[]] , [[1]]
, [[],[]] , [[2]]
, [[],[],[]] , [[3]]
... ...
combined Bunchwise. As both lists have the same length (infinity), I could also have written
basis = zipWith (++) (iterate ([]:) []) [[[i]] | i <- [0..]]
but I thought it was worth observing that this also is an example of Bunch structure.
Of course, it's very nice when something like accumArray hands you exactly the sort of accumulation you need, neatly packaging a bunch of grungy behind-the-scenes mutation. But the general recipe for an accumulation is to think "What's the Monoid?" and "What do I do with each element?". That's what foldMap asks you.
The (++) operator copies the left-hand list. For this reason, adding to the beginning of a list is quite fast, but adding to the end of a list is very slow.
In summary, avoid adding things to the end of a list. Try to always add to the beginning instead. One simple way to do that is to build the list backwards, and then reverse it at the end. A more devious trick is to use "difference lists" (Google it). Another possibility is to use Data.Sequence rather than a list.
The first thing that should be noted is the most obvious way to implement this is use a data structure that allows random access, an array is an obviously choice. Note that you need to add the elements to the array multiple times and somehow "join them".
accumArray is perfect for this.
So we get:
f l i = elems $ accumArray (\l e -> e:l) [] (0,i-1) (map (\e -> (e,e)) l)
And we're good to go (see full code here).
This approach does involve converting the final array back into a list, but that step is very likely faster than say sorting the list, which often involves scanning the list at least a few times for a list of decent size.
Whenever you use ++ you have to recreate the entire list, since lists are immutable.
A simple solution would be to use :, but that builds a reversed list. However that can be fixed using reverse, which results in only building two lists (instead of 1 million in your case).
Your concept of glomming things onto an accumulator is a very useful one, and both MathematicalOrchid and Guvante show how you can use that concept reasonably efficiently. But in this case, there is a simpler approach that is likely also faster. You started with
group . sort $ ls
and this was a very good place to start! You get a list that's almost the one you want, except that you need to fill in some blanks. How can we figure those out? The simplest way, though probably not quite the most efficient, is to work with a list of all the numbers you want to count up to: [0 .. len-1].
So we start with
f ls len = g [0 .. len-1] (group . sort $ ls)
where
?
How do we define g? By pattern matching!
f ls len = g [0 .. len-1] (group . sort $ ls)
where
-- We may or may not have some lists left,
-- but we counted as high as we decided we
-- would
g [] _ = []
-- We have no lists left, so the rest of the
-- numbers are not represented
g ns [] = map (const []) ns
-- This shouldn't be possible, because group
-- doesn't make empty lists.
g _ ([]:_) = error "group isn't working!"
-- Finally, we have some work to do!
g (n:ns) xls#(xl#(x:_):xls')
| n == x = xl : g ns xls'
| otherwise = [] : g ns xls
That was nice, but making the list of numbers isn't free, so you might be wondering how you can optimize it. One method I invite you to try is using your original technique of keeping a separate counter, but following this same sort of structure.
I want to print those elements that appear more than once in the list. can you please tell me how can I do that.. I am new to haskell.
for example if I have [1,2,3,3,2,4,5,6,5] that i want to get only [2,3,5] because these are the repeated elements in list.
Another solution: First sort the list, then group equal elements and take only the ones that appear multiple times:
>>> :m + Data.Maybe Data.List
>>> let xs = [1..100000] ++ [8,18..100] ++ [10,132,235]
>>> let safeSnd = listToMaybe . drop 1
>>> mapMaybe safeSnd $ group $ sort xs
[8,10,18,28,38,48,58,68,78,88,98,132,235]
group $ sort xs is a list of lists where each list contains all equal elements.
mapMaybe safe2nd returns only those lists that have a 2nd element (= the orignal element occured more than once in the orginal list).
This is method should be faster than the one using nub, especially for large lists.
Data.Map.Lazy and Data.Map.Strict are host to a bunch of interesting functions for constructing maps (association maps, dictionaries, whatever you want to call them). One of them is fromListWith
fromListWith :: Ord k => (a -> a -> a) -> [(k, a)] -> Map k a
What you want to build is a map that tells you, for each value in your input list, how often it occurs. The values would be the keys of the map (type k), their counts would be the values associated with the keys (type a). You could use the following expression for that:
fromListWith (+) . map (\x -> (x, 1))
First, all values in the list are put into a tuple, together with a count of one. Then, fromListWith builds a map from the list; if a key already exists, it computes a new count using (+).
Once you've done this, you're only interested in the elements that occur more than once. For this, you can use filter (> 1) from Data.Map.
Finally, you just want to know all keys that remain in the map. Use the function keys for this.
In the end, you get the following module:
import qualified Data.Map.Strict as M
findDuplicates :: (Ord a) => [a] -> [a]
findDuplicates
= M.keys
. M.filter (> 1)
. M.fromListWith (+)
. map (\x -> (x, 1 :: Integer))
It's common practice to import certain packages like Data.Map qualified, to avoid name conflicts between modules (e.g. filter from Data.Map and the one from Prelude are very different). In this situation, it's best to choose Data.Map.Strict; see the explanation at the top of Data.Map.
The complexity of this method should be O(n log n).
I thought it could be optimized by using a boolean flag to indicate that the value is a duplicate. However, this turned out to be about 20% slower.
You're basically looking for the list of elements that are not unique, or in other words, the difference between the original list and the list of unique elements. In code:
xs \\ (nub xs)
If you don't want to have duplicates in the result list, you'll want to call nub again:
nub $ xs \\ (nub xs)
Suppose one wants to map over a collection, but only collect results of the mapped function if the mapped-upon value meets certain criteria. I am currently doing this as such:
func = foldl (\acc x, -> (maybeGrab x):acc) []
maybeGrab a
| a > 5 = [someFunc a]
| otherwise = []
While this works, I am sure there is a more idiomatic 'right/common/more recognisable' way to do this.
mapMaybe :: (a -> Maybe b) -> [a] -> [b]
mapMaybe from the Data.Maybe package looks like it does the job. The documentation says:
The mapMaybe function is a version of map which can throw out elements. In particular, the functional argument returns something of type Maybe b. If this is Nothing, no element is added on to the result list. If it just Just b, then b is included in the result list.
Personally, I would do this in two stages: first, eliminate the values you don't care about, then map.
func = map someFunc . filter (>5)
This can also be expressed nicely as a list comprehension.
func xs = [someFunc x | x <- xs, x > 5]
Hmm. This definitely seems like a place where a fold is just fine. What about:
func = foldl (\acc x -> let a = g x in if a > 5 then a:acc else acc) []
Here g is the function you are trying to map over the list.
I can't think of any function that natively combines map and filter without folding.
[EDIT]
Oh, apparently there is a mapMaybe. Never used that before. I stand corrected. Ha, learn something all the time.