Recursively merge list of lists based on shared elements

Recursively merge list of lists based on shared elements - haskell

I don't know what the official technical name is for what I'm trying to do so I'll try to explain it as best I can.
Given a list of lists:
[[2,3,4,5], [1,5,6], [7,8,9]]
I want to union only the lists that have atleast one common element. So basically something like this:
simUnion :: [[Int]] -> [[Int]]
simUnion list = --...
--Result
-- [[1,2,3,4,5,6], [7,8,9]]
The problem I'm running into is running a match process between each element. Basically this is like the old math class problem where each person in a room must shake the hand of each other person. Ordinarily I'd accomplish this with a nested for loop, but how can I do this using Haskell's recursion?
Any help at all would be great!

If there is a finite number of distinct elements, you can turn the task inside out and make a Ord elem => Map elem [[elem]] out of your [[elem]] and then start iteratively merging the elements by the next algorithm:
while map isn't empty, take away a key, put it in the queue
get all the groups containing key popped from the queue
concat them and put into the queue (and in some accumulator, too)
if the queue got empty, the group is finished; take another key from the map

Note: The following post is written in literate Haskell. Save it as *.lhs and load it in GHCi. Also note that the discussed algorithm has runtime O(n²) and isn't optimal. A better approach would use union find or similar.
First, let us think about the tools we need if we want to group a single list x with the rest of the lists xs. We need to separate between the lists from xs that have an element in common with x, and we need to build the union of such lists. Therefore, we should import some functions from Data.List:
> import Data.List (partition, union)
Next, we need to check whether two lists are suitable to get merged:
> intersects :: Eq a => [a] -> [a] -> Bool
> intersects xs ys = any (`elem` ys) xs
Now we have all the tools at hand to define simUnion. The empty case is clear: if we don't have any lists, the result doesn't have any list either:
> simUnion :: Eq a => [[a]] -> [[a]]
> simUnion [] = []
Suppose we have at least two lists. We take the first one and check whether they have any element in common with any other list. We can do so by using partition:
> simUnion (x:xs) =
> let (common, noncommon) = partition (intersects x) xs
Now, common :: [[a]] will only contain those lists that have at least one element in common. There can be two cases now: either common is empty, and our list x has no element in common with any list from xs:
> in if null common
> then x : simUnion xs
We ignore uncommon here, since xs == uncommon in this case. In the other case, we need to build the union of all lists in common and x. This can be done with foldr union. However, this new list must be used in simUnion again, since it may have new intersections. For example, in
simUnion [[1,2], [2,3], [3,4]]
you want to end up with [[1,2,3,4]], not [[1,2,3],[3,4]]:
> else simUnion (foldr union x common : noncommon)
Note that the result will be unsorted, but you can map sort over it as a last step.

I have two main recommendations:
Don't think of it in terms of recursion! Instead, make liberal use of library utility functions.
Use appropriate data structures! Since you're talking about membership tests and unions, sets (from the Data.Set module) sound like they would be a better choice.
Applying those ideas, here's a fairly simple (though perhaps very naïve and suboptimal) solution:
import Data.Set (Set)
import qualified Data.Set as Set
simUnion :: Set (Set Int) -> Set (Set Int)
simUnion sets = Set.map outer sets
where outer :: Set Int -> Set Int
outer set = unionMap middle set
where middle :: Int -> Set Int
middle i = unionMap inner sets
where inner :: Set Int -> Set Int
inner set
| i `Set.member` set = set
| otherwise = Set.empty
-- | Utility function analogous to the 'concatMap' list function, but
-- for sets.
unionMap :: (Ord a, Ord b) => (a -> Set b) -> Set a -> Set b
unionMap f as = Set.unions (map f (Set.toList as))
Now using your example:
-- | This evaluates to:
--
-- >>> simUnion sampleData
-- fromList [fromList [1,2,3,4,5,6],fromList [7,8,9]]
sampleData :: Set (Set Int)
sampleData = Set.fromList (map Set.fromList sampleData')
where sampleData' :: [[Int]]
sampleData' = [[2,3,4,5], [1,5,6], [7,8,9]]
Ordinarily I'd accomplish this with a nested for loop, but how can I do this using Haskell's recursion?
You don't use recursion directly. You use higher-order functions like Set.map and unionMap. Note that these functions are analogous to loops, and that we're using them in a nested manner. Rule of thumb: imperative for loops very often translate to functional map, filter, reduce or similar operations. Nested imperative loops correspondingly often translate to nested use of such functions.

Related

case-of / case expression with or in pattern matching possible?

I am learning haskell on my own. And was working on implementing a custom List data type using basic lists and case of.
So data structure is something similar to this
data List = List [String] | EmptyList deriving Show
now if I am doing case expressions for base case, I have to do two matchings. A simple example would be the size function
size :: List -> Int
size lst = case lst of
(List []) -> 0
EmptyList -> 0
(List (x:xs)) -> 1 + size (List xs)
Can't I do something like combining the two base cases of list being empty (List []) and EmptyList somehow to reduce redundancy?
size :: List -> Int
size lst = case lst of
(List []) | EmptyList -> 0
(List (x:xs)) -> 1 + size (List xs)
I have tried searching all over the net for this, but unfortunately wasn't able to find anything concrete over matching multiple patterns in one case.

First of all you should consider why you have separate constructors for List and EmptyList in the first place. The empty list clearly is already a special case of a list anyway, so this is an awkward redundancy. If anything, you should make it
import Data.List.NonEmpty
data List' a = NEList (NonEmpty a) | EmptyList
Another option that would work for this specific example is to make the empty case into a “catch-all pattern”:
size :: List -> Int
size lst = case lst of
(List (x:xs)) -> 1 + size (List xs)
_ -> 0
BTW there's no reason to use case here, you can also just write two function clauses:
size :: List -> Int
size (List (x:xs)) = 1 + size (List xs)
size _ = 0
Anyways – this is generally discouraged, because catch-all clauses are an easy place for hard to detect bugs to creep in if you extend your data type in the future.
Also possible, but even worse style is to use a boolean guard match – this can easily use lookups in a list of options, like
size lst | lst`elem`[EmptyList, List []] = 0
size (List (x:xs)) = 1 + size (List xs)
Equality checks should be avoided if possible; they introduce an Eq constraint which, quite needlessly, will require the elements to be equality-comparable. And often equality check is also more computationally expensive than a pattern match.
Another option if you can't change the data structure itself but would like to work with it as if List [] and EmptyList were the same thing would be to write custom pattern synonyms. This is a relatively recent feature of Haskell; it kind of pretends the data structure is actually different – like List' – from how it's really layed out.

In the comments, you say
there are no such functions [which should return different results for EmptyList and List []]
therefore I recommend merging these two constructors in the type itself:
data List = List [String] deriving Show
Now you no longer need to distinguish between EmptyList and List [] in functions that consume a List.
...in point of fact, I would go even further and elide the definition entirely, simply using [String] everywhere instead. There is one exception to this: if you need to define an instance for a class that differs in behavior from [String]'s existing instance. In that exceptional case, defining a new type is sensible; but I would use newtype instead of data, for the usual efficiency and semantics reasons.

Functional Programming-Style Map Function that adds elements?

I know and love my filter, map and reduce, which happen to be part of more and more languages that are not really purely functional.
I found myself needing a similar function though: something like map, but instead of one to one it would be one to many.
I.e. one element of the original list might be mapped to multiple elements in the target list.
Is there already something like this out there or do I have to roll my own?

This is exactly what >>= specialized to lists does.
> [1..6] >>= \x -> take (x `mod` 3) [1..]
[1,1,2,1,1,2]
It's concatenating together the results of
> map (\x -> take (x `mod` 3) [1..]) [1..6]
[[1],[1,2],[],[1],[1,2],[]]

You do not have to roll your own. There are many relevant functions here, but I'll highlight three.
First of all, there is the concat function, which already comes in the Prelude (the standard library that's loaded by default). What this function does, when applied to a list of lists, is return the list that contains concatenated contents of the sublists.
EXERCISE: Write your own version of concat :: [[a]] -> [a].
So using concat together with map, you could write this function:
concatMap :: (a -> [b]) -> [a] -> [b]
concatMap f = concat . map f
...except that you don't actually need to write it, because it's such a common pattern that the Prelude already has it (at a more general type than what I show here—the library version takes any Foldable, not just lists).
Finally, there is also the Monad instance for list, which can be defined this way:
instance Monad [] where
return a = [a]
as >>= f = concatMap f as
So the >>= operator (the centerpiece of the Monad class), when working with lists, is exactly the same thing as concatMap.
EXERCISE: Skim through the documentation of the Data.List module. Figure out how to import the module into your code and play around with some of the functions.

Uniqueness and other restrictions for Arbitrary in QuickCheck

I'm trying to write a modified Arbitrary instance for my data type, where (in my case) a subcomponent has a type [String]. I would ideally like to bring uniqueness in the instance itself, that way I don't need ==> headers / prerequisites for every test I write.
Here's my data type:
data Foo = Vars [String]
and the trivial Arbitrary instance:
instance Arbitrary Foo where
arbitrary = Vars <$> (:[]) <$> choose ('A','z')
This instance is strange, I know. In the past, I've had difficulty when quickcheck combinatorically explodes, so I'd like to keep these values small. Another request - how can I make an instance where the generated strings are under 4 characters, for instance?
All of this, fundamentally requires (boolean) predicates to augment Arbitrary instances. Is this possible?

Definitely you want the instance to produce only instances that match the intention of the data type. If you want all the variables to be distinct, the Arbitrary instance must reflect this. (Another question is if in this case it wouldn't make more sense to define Vars as a set, like newtype Vars = Set [String].)
I'd suggest to check for duplicates using Set or Hashtable, as nub has O(n^2) complexity, which might slow down your test considerably for larger inputs. For example:
import Control.Applicative
import Data.List (nub)
import qualified Data.Set as Set
import Test.QuickCheck
newtype Foo = Vars [String]
-- | Checks if a given list has no duplicates in _O(n log n)_.
hasNoDups :: (Ord a) => [a] -> Bool
hasNoDups = loop Set.empty
where
loop _ [] = True
loop s (x:xs) | s' <- Set.insert x s, Set.size s' > Set.size s
= loop s' xs
| otherwise
= False
-- | Always worth to test if we wrote `hasNoDups` properly.
prop_hasNoDups :: [Int] -> Property
prop_hasNoDups xs = hasNoDups xs === (nub xs == xs)
Your instance then needs to create a list of list, and each list should be randomized. So instead of (: []), which creates just a singleton list (and just one level), you need to call listOf twice:
instance Arbitrary Foo where
arbitrary = Vars <$> (listOf . listOf $ choose ('A','z'))
`suchThat` hasNoDups
Also notice that choose ('A', 'z') allows to use all characters between A and z, which includes many control characters. My guess is that you rather want something like
oneof [choose ('A','Z'), choose ('a','z')]
If you really want, you could also make hasNoDups O(n) using hash tables in the ST monad.
Concerning limiting the size: you could always have your own parametrized functions that produce different Gen Foo, but I'd say in most cases it's not necessary. Gen has it's own internal size parameter, which is increased throughout the tests (see this answer), so different sizes (as generated using listOf) of lists are covered.
But I'd suggest you to implement shrink, as this will give you much nicer counter-examples. For example, if we define (a wrong test) that tried to verify that no instance of Var contains 'a' in any of its variable:
prop_Foo_hasNoDups :: Foo -> Property
prop_Foo_hasNoDups (Vars xs) = all (notElem 'a') xs === True
we'll get ugly counter-examples such as
Vars ["RhdaJytDWKm","FHHhrqbI","JVPKGTqNCN","awa","DABsOGNRYz","Wshubp","Iab","pl"]
But adding
shrink (Vars xs) = map Vars $ shrink xs
to Arbitrary Foo makes the counter-example to be just
Vars ["a"]

suchThat :: Gen a -> (a -> Bool) -> Gen a is a way to embed Boolean predicates in a Gen. See the haddocks for more info.
Here's how you would make the instance unique:
instance Arbitrary Foo where
arbitrary = Vars <$> (:[]) <$> (:[]) <$> choose ('A','z')
`suchThat` isUnique
where
isUnique x = nub x == x

Does there exist something like (xs:x)

I'm new to Haskell. I know I can create a reverse function by doing this:
reverse :: [a] -> [a]
reverse [] = []
reverse (x:xs) = (Main.reverse xs) ++ [x]
Is there such a thing as (xs:x) (a list concatenated with an element, i.e. x is the last element in the list) so that I put the last list element at the front of the list?
rotate :: [a] -> [a]
rotate [] = []
rotate (xs:x) = [x] ++ xs
I get these errors when I try to compile a program containing this function:
Occurs check: cannot construct the infinite type: a = [a]
When generalising the type(s) for `rotate'

I'm also new to Haskell, so my answer is not authoritative. Anyway, I would do it using last and init:
Prelude> last [1..10] : init [1..10]
[10,1,2,3,4,5,6,7,8,9]
or
Prelude> [ last [1..10] ] ++ init [1..10]
[10,1,2,3,4,5,6,7,8,9]

The short answer is: this is not possible with pattern matching, you have to use a function.
The long answer is: it's not in standard Haskell, but it is if you are willing to use an extension called View Patterns, and also if you have no problem with your pattern matching eventually taking longer than constant time.
The reason is that pattern matching is based on how the structure is constructed in the first place. A list is an abstract type, which have the following structure:
data List a = Empty | Cons a (List a)
deriving (Show) -- this is just so you can print the List
When you declare a type like that you generate three objects: a type constructor List, and two data constructors: Empty and Cons. The type constructor takes types and turns them into other types, i.e., List takes a type a and creates another type List a. The data constructor works like a function that returns something of type List a. In this case you have:
Empty :: List a
representing an empty list and
Cons :: a -> List a -> List a
which takes a value of type a and a list and appends the value to the head of the list, returning another list. So you can build your lists like this:
empty = Empty -- similar to []
list1 = Cons 1 Empty -- similar to 1:[] = [1]
list2 = Cons 2 list1 -- similar to 2:(1:[]) = 2:[1] = [2,1]
This is more or less how lists work, but in the place of Empty you have [] and in the place of Cons you have (:). When you type something like [1,2,3] this is just syntactic sugar for 1:2:3:[] or Cons 1 (Cons 2 (Cons 3 Empty)).
When you do pattern matching, you are "de-constructing" the type. Having knowledge of how the type is structured allows you to uniquely disassemble it. Consider the function:
head :: List a -> a
head (Empty) = error " the empty list have no head"
head (Cons x xs) = x
What happens on the type matching is that the data constructor is matched to some structure you give. If it matches Empty, than you have an empty list. If if matches Const x xs then x must have type a and must be the head of the list and xs must have type List a and be the tail of the list, cause that's the type of the data constructor:
Cons :: a -> List a -> List a
If Cons x xs is of type List a than x must be a and xs must be List a. The same is true for (x:xs). If you look to the type of (:) in GHCi:
> :t (:)
(:) :: a -> [a] -> [a]
So, if (x:xs) is of type [a], x must be a and xs must be [a] . The error message you get when you try to do (xs:x) and then treat xs like a list, is exactly because of this. By your use of (:) the compiler infers that xs have type a, and by your use of
++, it infers that xs must be [a]. Then it freaks out cause there's no finite type a for which a = [a] - this is what he's trying to tell you with that error message.
If you need to disassemble the structure in other ways that don't match the way the data constructor builds the structure, than you have to write your own function. There are two functions in the standard library that do what you want: last returns the last element of a list, and init returns all-but-the-last elements of the list.
But note that pattern matching happens in constant time. To find out the head and the tail of a list, it doesn't matter how long the list is, you just have to look to the outermost data constructor. Finding the last element is O(N): you have to dig until you find the innermost Cons or the innermost (:), and this requires you to "peel" the structure N times, where N is the size of the list.
If you frequently have to look for the last element in long lists, you might consider if using a list is a good idea after all. You can go after Data.Sequence (constant time access to first and last elements), Data.Map (log(N) time access to any element if you know its key), Data.Array (constant time access to an element if you know its index), Data.Vector or other data structures that match your needs better than lists.
Ok. That was the short answer (:P). The long one you'll have to lookup a bit by yourself, but here's an intro.
You can have this working with a syntax very close to pattern matching by using view patterns. View Patterns are an extension that you can use by having this as the first line of your code:
{-# Language ViewPatterns #-}
The instructions of how to use it are here: http://hackage.haskell.org/trac/ghc/wiki/ViewPatterns
With view patterns you could do something like:
view :: [a] -> (a, [a])
view xs = (last xs, init xs)
someFunction :: [a] -> ...
someFunction (view -> (x,xs)) = ...
than x and xs will be bound to the last and the init of the list you provide to someFunction. Syntactically it feels like pattern matching, but it is really just applying last and init to the given list.

If you're willing to use something different from plain lists, you could have a look at the Seq type in the containers package, as documented here. This has O(1) cons (element at the front) and snoc (element at the back), and allows pattern matching the element from the front and the back, through use of Views.

"Is there such a thing as (xs:x) (a list concatenated with an element, i.e. x is the last element in the list) so that I put the last list element at the front of the list?"
No, not in the sense that you mean. These "patterns" on the left-hand side of a function definition are a reflection of how a data structure is defined by the programmer and stored in memory. Haskell's built-in list implementation is a singly-linked list, ordered from the beginning - so the pattern available for function definitions reflects exactly that, exposing the very first element plus the rest of the list (or alternatively, the empty list).
For a list constructed in this way, the last element is not immediately available as one of the stored components of the list's top-most node. So instead of that value being present in pattern on the left-hand side of the function definition, it's calculated by the function body onthe right-hand side.
Of course, you can define new data structures, so if you want a new list that makes the last element available through pattern-matching, you could build that. But there's be some cost: Maybe you'd just be storing the list backwards, so that it's now the first element which is not available by pattern matching, and requires computation. Maybe you're storing both the first and last value in the structures, which would require additional storage space and bookkeeping.
It's perfectly reasonable to think about multiple implementations of a single data structure concept - to look forward a little bit, this is one use of Haskell's class/instance definitions.

Reversing as you suggested might be much less efficient. Last is not O(1) operation, but is O(N) and that mean that rotating as you suggested becomes O(N^2) alghorhim.
Source:
http://www.haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/src/GHC-List.html#last
Your first version has O(n) complexity. Well it is not, becuase ++ is also O(N) operation
you should do this like
rotate l = rev l []
where
rev [] a = a
rev (x:xs) a = rev xs (x:a)
source : http://www.haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/src/GHC-List.html#reverse

In your latter example, x is in fact a list. [x] becomes a list of lists, e.g. [[1,2], [3,4]].
(++) wants a list of the same type on both sides. When you are using it, you're doing [[a]] ++ [a] which is why the compiler is complaining. According to your code a would be the same type as [a], which is impossible.
In (x:xs), x is the first item of the list (the head) and xs is everything but the head, i.e., the tail. The names are irrelevant here, you might as well call them (head:tail).
If you really want to take the last item of the input list and put that in the front of the result list, you could do something like:
rotate :: [a] -> [a]
rotate [] = []
rotate lst = (last lst):(rotate $ init lst)
N.B. I haven't tested this code at all as I don't have a Haskell environment available at the moment.

What to call a function that splits lists?

I want to write a function that splits lists into sublists according to what items satisfy a given property p. My question is what to call the function. I'll give examples in Haskell, but the same problem would come up in F# or ML.
split :: (a -> Bool) -> [a] -> [[a]] --- split lists into list of sublists
The sublists, concatenated, are the original list:
concat (split p xss) == xs
Every sublist satisfies the initial_p_only p property, which is to say (A) the sublist begins with an element satisfying p—and is therefore not empty, and (B) no other elements satisfy p:
initial_p_only :: (a -> Bool) -> [a] -> Bool
initial_p_only p [] = False
initial_p_only p (x:xs) = p x && all (not . p) xs
So to be precise about it,
all (initial_p_only p) (split p xss)
If the very first element in the original list does not satisfy p, split fails.
This function needs to be called something other than split. What should I call it??

I believe the function you're describing is breakBefore from the list-grouping package.
Data.List.Grouping: http://hackage.haskell.org/packages/archive/list-grouping/0.1.1/doc/html/Data-List-Grouping.html
ghci> breakBefore even [3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3,2,3,8,4,6,2,6]
[[3,1],[4,1,5,9],[2],[6,5,3,5],[8,9,7,9,3],[2,3],[8],[4],[6],[2],[6]]

I quite like some name based on the term "break" as adamse suggests. There are quite a few possible variants of the function. Here is what I'd expect (based on the naming used in F# libraries).
A function named just breakBefore would take an element before which it should break:
breakBefore :: Eq a => a -> [a] -> [[a]]
A function with the With suffix would take some kind of function that directly specifies when to break. In case of brekaing this is the function a -> Bool that you wanted:
breakBeforeWith :: (a -> Bool) -> [a] -> [[a]]
You could also imagine a function with By suffix would take a key selector and break when the key changes (which is a bit like group by, but you can have multiple groups with the same key):
breakBeforeBy :: Eq k => (a -> k) -> [a] -> [[a]]
I admit that the names are getting a bit long - and maybe the only function that is really useful is the one you wanted. However, F# libraries seem to be using this pattern quite consistently (e.g. there is sort, sortBy taking key selector and sortWith taking comparer function).
Perhaps it is possible to have these three variants for more of the list processing functions (and it's quite good idea to have some consistent naming pattern for these three types).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Recursively merge list of lists based on shared elements - haskell

Related

case-of / case expression with or in pattern matching possible?

Functional Programming-Style Map Function that adds elements?

Uniqueness and other restrictions for Arbitrary in QuickCheck

Does there exist something like (xs:x)

What to call a function that splits lists?

Categories

Resources