Haskell text encoder - haskell

I am new to Haskell and would like some direction to solving my problem. I wanted to have a text encode function that list in which each word of the text is represented by its index. For e.g. :
["The more I like, the more I love.","The more I love, the more I hate."]
the output might be
(["The", "more", "I", "like", "the", "love.", "love,", "hate."],
[1, 2, 3, 4, 5, 2, 3, 6, 1, 2, 3, 7, 1, 2, 3, 8])
I have done the remove duplication part
removeDuplicates :: Eq a => [a] -> [a]
removeDuplicates = rdHelper []
where rdHelper seen [] = seen
rdHelper seen (x:xs)
| x `elem` seen = rdHelper seen xs
| otherwise = rdHelper (seen ++ [x]) xs

You can just iterate over the list of words and accumulate the unique words and its indexes. If the element is in the accumulated list, append the index to the accumulated list of indexes. If the element isn't in the list, append the new index (length of the list of words + 1).
To be honest, Haskell code is more understandable, than my description:
import Data.List (findIndex)
build :: ([String], [Int]) -> String -> ([String], [Int])
build (words, indexes) word =
let
maybeIndex = findIndex (== word) words
in
case maybeIndex of
Just index ->
(words, indexes ++ [index + 1])
Nothing ->
(words ++ [word], indexes ++ [(+1) . length $ words])
buildIndexes =
let
listOfWords = words "The more I like, the more I love. The more I love, the more I hate."
in
foldl build ([], []) listOfWords
Here I have a concatenated string as an input
"The more I like, the more I love. The more I love, the more I hate."
Feel free to tailor the code for your needs.
By the way, it might be more performant to insert the elements at the beginning of the lists and then reverse the resulting lists.
import Data.List (findIndex)
build :: ([String], [Int]) -> String -> ([String], [Int])
build (words, indexes) word =
let
maybeIndex = findIndex (== word) words
in
case maybeIndex of
Just index ->
(words, (index + 1) : indexes)
Nothing ->
(word : words, ((+1) . length $ words) : indexes)
buildIndexes =
let
listOfWords = words "The more I like, the more I love. The more I love, the more I hate."
(listOfUniqueWords, listOfIndexes) = foldl build ([], []) listOfWords
in
(reverse listOfUniqueWords, reverse listOfIndexes)

I guess the Data.Map and Data.Set packages are ideal tools to solve this job efficiently. My implementation would be as follows;
import qualified Data.Map.Lazy as Map
import qualified Data.Set as Set
encode :: [String] -> ([String],[[Int]])
encode wss = let dict = Map.fromList . zip (Set.toList . Set.unions . map (Set.fromList . words) $ wss) $ [1..]
in (map fst $ Map.toList dict, map (map (flip (Map.findWithDefault 0) dict) . words) wss)
*Main> encode ["Are you allright", "Hey there how are you", "Hello there", "Do you like coffee"]
(["Are","Do","Hello","Hey","allright","are","coffee","how","like","there","you"],[[1,11,5],[4,10,8,6,11],[3,10],[2,11,9,7]])

Related

haskell words from alphabet of a given length

I have this function that generates a list of all words with a min length 0 and max length n, equals given as an input to the function:
import Data.List
words :: Int -> String -> [String]
words 0 alph = [[]]
words n alph = words (n-1) alph ++ [ ch:w | w <-words (n-1) alph, ch <- alph]
When I run this, the output is following:
> words 3 "AB"
["","A","B","A","B","AA","BA","AB","BB","A","B","AA","BA","AB","BB","AA","BA","AB","BB","AAA","BAA","ABA","BBA","AAB","BAB","ABB","BBB"]
The problem here is, that there are some words repeating, in this example, especially the words of length of 2 ("AA" is 3 times there). Can you see what am I doing wrong in my function, or do you have any idea how to solve it?
This is because the words (n-1) alph in the list comprehension does not only yield words of length n-1 but also n-2, n-3, etc., since that is how you defined the words function.
It might be better to make a helper function that only generates words of length n and then use that in an extra function that constructs strings with lengths up to n:
words :: Int -> String -> [String]
words 0 alph = [[]]
words n alph = [ ch:w | w <-words (n-1) alph, ch <- alph]
wordsUpTo :: Int -> String -> [String]
wordsUpTo n alph = concatMap (flip words alph) [0 .. n]
However words already exists, this is just a special case of replicateM :: Applicative m = > Int -> m a -> m [a], so we can write this as:
import Control.Monad(replicateM)
wordsUpTo :: Int -> String -> [String]
wordsUpTo n alph = [0 .. n] >>= (`replicateM` alph)
which will produce:
Prelude Control.Monad> wordsUpTo 3 "AB"
["","A","B","AA","AB","BA","BB","AAA","AAB","ABA","ABB","BAA","BAB","BBA","BBB"]
The Applicative instance for lists effectively computes a cross-product,
> (,) <$> ["A", "B"] <*> ["C", "D"]
[("A","C"),("A","D"),("B","C"),("B","D")]
elements of which can be joined with (++) instead of (,):
> (++) <$> ["A", "B"] <*> ["C", "D"]
["AC","AD","BC","BD"]
If you repeatedly apply this operation, you'll get the strings you want:
> (++) <$> ["A", "B"] <*> [""] -- base case
["A","B"]
> (++) <$> ["A", "B"] <*> ["A","B"]
["AA","AB","BA","BB"]
> (++) <$> ["A", "B"] <*> ["AA","AB","BA","BB"]
["AAA","AAB","ABA","ABB","BAA","BAB","BBA","BBB"]
The function you want to repeat is the section ((++) <$> ["A", "B"] <*>), which from now on we'll refer to as f:
> f = ((++) <$> ["A", "B"] <*>)
This repeated application is captured by the iterate function, which repeatedly feeds the output of one function application as the input of the next.
> take 3 $ iterate f [""]
[[""],["A","B"],["AA","AB","BA","BB"]]
We'll want to concatenate the results into a single list:
> take 7 $ concat $ iterate f [""]
["","A","B","AA","AB","BA","BB"]
So all combinations is just
allWords alph = concat $ iterate f [""]
where f = ((++) <$> alph <*>)
To get the elements with some maximum length, we can either
Use takeWhile (\x -> length x <= n), or
Use take (2^(n+1) - 1) (given the order in which items are generated, all the strings of a given length occur before longer strings, and we can compute the total number of strings with a given maximum length)
So we can define either
words n = takeWhile p . allWords
where p x = length x < 4
or
words n = take n' . allWords

Haskell - Filtering a list of tuples

Consider this list of tuples:
[(57,48),(58,49),(59,50),(65,56),(65,47),(65,57),(65,49), (41, 11)]
I want to remove a tuple (a, b) if its second element b is equal to the first element of another tuple and all the tuples with the same a that come after it. For example:
The second element of (65,57) is 57 and the first tuple in the list (57,48)has 57 as its first element, so (65,57) should be removed and all tuples that come after it that start with 65, namely (65,49). The tuples that come before it, (65,56) and (65,47), should stay in the list.
Does anyone have an idea how to do this?
For efficiency (single pass), you should create two sets, one for elements you've seen as the first elements of tuples, the other for elements you've seen both as first and second elements (ie. delete if matches first element).
Something like,
{-# LANGUAGE PackageImports #-}
import "lens" Control.Lens (contains, (.~), (^.), (&))
import "yjtools" Data.Function.Tools (applyUnless, applyWhen)
import qualified "containers" Data.IntSet as Set
filterTuples :: Foldable t => t (Int, Int) -> [(Int, Int)]
filterTuples = flip (foldr go $ const []) (Set.empty, Set.empty)
where
go p#(x,y) go' (fsts, deletes) =
let seenFst = fsts ^. contains y
shouldDelete = seenFst || deletes ^. contains x
fsts' = fsts & contains x .~ True
deletes' = deletes & applyWhen seenFst (contains y .~ True)
in applyUnless shouldDelete (p:) $ go' (fsts', deletes')
EDITs: for correctness, clarity, spine-laziness
You could start by creating a distinct set of all the first elements, e.g.:
Prelude Data.List> firsts = nub $ fst <$>
[(57,48),(58,49),(59,50),(65,56),(65,47),
(65,57),(65,49), (41, 11)]
Prelude Data.List> firsts
[57,58,59,65,41]
You could use break or span as Robin Zigmond suggests. You'll need a predicate for that. You could use elem, like this:
Prelude Data.List> elem 48 firsts
False
Prelude Data.List> elem 49 firsts
False
...
Prelude Data.List> elem 57 firsts
True
If you're concerned that elem is too inefficient, you could experiment with creating a Set and use the member function instead.
Perhaps try using mapAccumL starting with the initial list as the accumulator. Then maintain a Predicate as a parameter too which acts as a decider for what has been seen, and this will determine if you can output or not at each step in the traversal.
I'm an absolute beginner in haskell, so there probably is a much more elegant/efficient solution for this. But anyways I wanted to share the solution I came up with:
filterTuples :: [(Int, Int)] -> [(Int,Int)]
filterTuples [] = []
filterTuples (x:xs) = x:filterTuples(concat ((fst temp) : [filter (\z -> fst z /= del) (snd temp)]))
where del = fst (head (snd temp))
temp = break (\y -> (snd y == fst x)) xs
(Glad for feedback on how to improve this)
f consumes a list of pairs: xs; it produces a new list of pairs: ys. ys contains every pair: (a, b) in xs, except the pair whose second element b: previously occurred as first elements: a. When such a pair: (a, b) is encountered, subsequent pairs that have a as their first elements are excluded from ys.
f xs = go xs [] []
where
go [] ys zs = ys
go (x#(a,b):xs) ys zs
| b `elem` as = go xs ys (a:zs)
| a `elem` zs = go xs ys zs
| otherwise = [x] ++ go xs ys zs
as = (nub . fst . unzip) xs

Haskell powerset sublists with fixed length

It's well known that the powerset of a list:
{1,2,3,4} is {{},{1},{2},{1,2},{3},{1,3},{2,3},{1,2,3},{4},{1,4},{2,4},{1,2,4},{3,4},{1,3,4},{2,3,4},{1,2,3,4}}
the haskell code I got for that problem is:
potencia [] = [[]]
potencia (a:bs) = potencia bs ++ map (a:) (potencia bs)
Now, how would I get a list of sublists of the same length?, for example, the list above would generate the next list of sublists of length 3 = {{1,2,3},{1,2,4},{1,3,4}}
I'm a student sorry for my english, thanks in advance... XD
How about
sublists _ 0 = [[]]
sublists [] _ = []
sublists (x:xs) n = sublists xs n ++ map (x:) (sublists xs $ n - 1)
Which is very similar to the code you had but just has two decreasing parameters, the length and the list.
Also, for more advanced Haskellers
powerset = flip runCont id . foldM step [[]]
where step xs x = cont $ \c -> c xs ++ c (map (x:) xs)
is a powerset implementation without recursion using continuations. Doing the same with the sublists function is an interesting challenge.
I'm thinking just
subsequencesOf :: Int -> [a] -> [[a]]
subsequencesOf n = filter ((== n) . length) . subsequences
Which will give you
> subsequencesOf 3 [1, 2, 3, 4]
[[1,2,3],[1,2,4],[1,3,4],[2,3,4]]
Although I find it weird that this isn't an operation in Data.Set, and that Set isn't a monad (and therefore has its own version of replicateM.) I guess there might be obstacles in the way there.

Delete list elements by looking on another list

I have two lists. One list contains some random data and other list contains the index of first list which needs to be deleted.
For example, let us consider two lists:
let a = [3,4,5,6,6,7,8]
let b = [1,3]
Then, the resultant output should be [3,5,6,7,8]. The number 4 and 6 are deleted since they are on index positions 1 and 3 respectively.
I'm new to Haskell, so finding it difficult to find the solution.
Update: Following code makes it work
import Data.List
dele :: Eq a => [a] -> [Int] -> [a]
dele [] _ = []
dele x [] = x
dele x (y:ys) = dele (delete (x !! y) x) ys
I was just wondering, is there a way to solve it through map/fold way ?
deleteByIndex :: (Enum a, Eq a, Num a) => [a] -> [b] -> [b]
deleteByIndex r = map snd . filter (\(i, _) -> notElem i r) . zip [0..]
[0..] produces an infinite list [0, 1, 2, 3, ...]
zip constructs a list of pairs with the values of this list and your input list in the form [(0,x), (1, y), ...]
filter takes a function a -> Bool. The lambda checks if the index (first element of the pair) is in your input list r.
map snd returns the second element of each pair of the zip list.
zip,filter, map and notElem are documented here
Off the top of my head:
removeByIndex :: [Integer] -> [a] -> [a]
removeByIndex indices = map snd . filter notInIndices . zip [0..]
where notInIndices (i,_) = i `notElem` indices
An alternative answer using the lens library which has received considerable attention recently
import Control.Lens
>let a = [3,4,5,6,6,7,8]
>let b = [1,3]
>a^..elements (`notElem`b)
[3,5,6,7,8]
(^..) is jus the infix for of toListOf which can be used to traverse a structure and make a list out of its parts. The elements function just lets you choose which ones to include.
Other options are 'traverse' to traverse a traversables, 'both' to traverse a (,) and they compose together with (.) so traverse.both would traverse [(1,2), (3,4)] for example.
[(1,2), (3,4)]^..traverse.both
[1,2,3,4]

How to find all substrings of a String with start and end indices

I've recently written some Scala code which processes a String, finding all its sub-strings and retaining a list of those which are found in a dictionary. The start and end of the sub-strings within the overall string also have to be retained for later use, so the easiest way to do this seemed to be just to use nested for loops, something like this:
for (i <- 0 until word.length)
for (j <- i until word.length) {
val sub = word.substring(i, j + 1)
// lookup sub in dictionary here and add new match if found
}
As an exercise, I decided to have a go at doing the same thing in Haskell. It seems straightforward enough without the need for the sub-string indices - I can use something like this approach to get the sub-strings, then call a recursive function to accumulate the matches. But if I want the indices too it seems trickier.
How would I write a function which returns a list containing each continuous sub-string along with its start and end index within the "parent" string?
For example tokens "blah" would give [("b",0,0), ("bl",0,1), ("bla",0,2), ...]
Update
A great selection of answers and plenty of new things to explore. After messing about a bit, I've gone for the first answer, with Daniel's suggestion to allow the use of [0..].
data Token = Token String Int Int
continuousSubSeqs = filter (not . null) . concatMap tails . inits
tokenize xs = map (\(s, l) -> Token s (head l) (last l)) $ zip s ind
where s = continuousSubSeqs xs
ind = continuousSubSeqs [0..]
This seemed relatively easy to understand, given my limited Haskell knowledge.
import Data.List
continuousSubSeqs = filter (not . null) . concatMap inits . tails
tokens xs = map (\(s, l) -> (s, head l, last l)) $ zip s ind
where s = continuousSubSeqs xs
ind = continuousSubSeqs [0..length(xs)-1]
Works like this:
tokens "blah"
[("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]
The two nested loops you wrote are an excellent starting point. That is, we can write a function tokens that delegates its work to two recursive functions outer and inner that correspond to your loops:
type Token a = ([a], Int, Int)
tokens :: [a] -> [Token a]
tokens = outer 0
where
outer _ [] = []
outer i l#(_ : xs) = inner i [] l ++ outer (i + 1) xs
where
inner _ _ [] = []
inner j acc (x : xs) =
(acc ++ [x], i, j) : inner (j + 1) (acc ++ [x]) xs
Here, outer iterates over the string and, for each start position within that string, calls inner to collect all the segments that start at that position together with their end positions.
Although this function meets your requirements,
> tokens "blah"
[("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]
it is quite inefficient due to the repeated list concatenation. A more efficient version would accumulate its results in so-called difference lists:
type Token a = ([a], Int, Int)
tokens :: [a] -> [Token a]
tokens l = outer 0 l []
where
outer _ [] = id
outer i l#(_ : xs) = inner i id l . outer (i + 1) xs
where
inner _ _ [] = id
inner j acc (x : xs) =
((acc [x], i, j) :) . inner (j + 1) (acc . (x :)) xs
How to construct the dictionary of course depends on how you choose to represent it. Here's an approach that uses simple ordered association lists,
type Dict a = [([a], [(Int, Int)])]
empty :: Dict a
empty = []
update :: Ord a => Token a -> Dict a -> Dict a
update (xs, i, j) [] = [(xs, [(i, j)])]
update (xs, i, j) ((ys, ns) : dict) = case compare xs ys of
LT -> (xs, [(i, j)]) : (ys, ns) : dict
EQ -> (ys, (i, j) : ns) : dict
GT -> (ys, ns) : update (xs, i, j) dict
toDict :: Ord a => [a] -> Dict a
toDict = foldr update empty . tokens
but as your keys are strings, tries (a.k.a. prefix trees) are probably a better choice.
If it's efficient substring queries that you're after, I would recommend looking into suffix trees, although their implementation is somewhat involved. You may want to check out
Robert Giegerich and Stefan Kurtz. A comparison of imperative and purely functional suffix tree constructions. Science of Computer Programming 25(2–3):187–218, 1995
and Bryan O'Sullivan's suffixtree package on Hackage.
Another version easier to read left to right, similar to unix pipes
import Data.List
import Control.Category
tokens =
tailsWithIndex
>>> concatMap (\(i,str) -> zip (repeat i) (initsWithIndex str))
>>> map adjust
where
tailsWithIndex = tails >>> init >>> zip [0..]
initsWithIndex = inits >>> tail >>> zip [0..]
adjust (i, (j, str)) = (str, i, i+j)
Sample run
>tokens "blah"
[("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]
If concatMap is lazy, then the whole computation is lazy and will be efficient, except for the use of Data.List functions instead of raw list access.
My version:
import Data.List
tokens =
map join . filter (not . null) . concatMap inits . tails . zip [0..]
where
join s#((i, _):t) =
(map snd s, i, foldl' (\_ i -> i) i (map fst t))
main =
putStrLn $ show $ tokens "blah"
-- [("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]
UPDATE:
import Control.Arrow
...
tokens =
map join . filter (not . null) . concatMap inits . tails . zip [0..] where
join s = (s', i, j) where
((i, j), s') = (first (head &&& last)) $ unzip s
...

Resources