Optimising Haskell data reading from file - haskell

I am trying to implement Kosaraju's graph algorithm, on a 3.5m line file where each row is two (space separated) Ints representing a graph edge. To start I need to create a summary data structure that has the node and lists of its incoming and outgoing edges. The code below achieves that, but takes over a minute, whereas I can see from posts on the MOOC forum that people using other languages are completing in <<10s. (getLines is taking 10s compared to under 1s in benchmarks I read about.)
I'm new to Haskell and have implemented an accumulation method using foldl' (the ' was a breakthrough in making it terminate at all), but it feels rather imperative in style, and I'm hoping that that's the reason why it is running slow. Moreover, I'm currently planning to use a similar pattern to conduct the depth-first-search, and I fear it will all just become too slow.
I have found this presentation and blog that talk about these sort of issues but at too expert a level.
import System.IO
import Control.Monad
import Data.Map.Strict as Map
import Data.List as L
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored (Edges, Edges) deriving (Show)
type Graph1 = Map NodeName Node
getLines :: FilePath -> IO [[Int]]
getLines = liftM (fmap (fmap read . words) . lines) . readFile
getLines' :: FilePath -> IO [(Int,Int)]
getLines' = liftM (fmap (tuplify2 . fmap read . words) . lines) . readFile
tuplify2 :: [a] -> (a,a)
tuplify2 [x,y] = (x,y)
main = do
list <- getLines "testdata.txt" -- [String]
--list <- getLines "SCC.txt" -- [String]
let
list' = createGraph list
return list'
createGraph :: [[Int]] -> Graph1
createGraph xs = L.foldl' build Map.empty xs
where
build :: Graph1-> [Int] -> Graph1
build = \acc (x:y:_) ->
let tmpAcc = case Map.lookup x acc of
Nothing -> Map.insert x (Node False ([y],[])) acc
Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False ((y:fwd), bck))) x acc
in case Map.lookup y tmpAcc of
Nothing -> Map.insert y (Node False ([],[x])) tmpAcc
Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False (fwd, (x:bck)))) y tmpAcc

Using maps:
Use IntMap or HashMap when possible. Both are significantly faster for Int keys than Map. HashMap is usually faster than IntMap but uses more RAM and has a less rich library.
Don't do unnecessary lookups. The containers package has a large number of specialized functions. With alter the number of lookups can be halved compared to the createGraph implementation in the question.
Example for createGraph:
import Data.List (foldl')
import qualified Data.IntMap.Strict as IM
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = IM.IntMap Node
createGraph :: [(Int, Int)] -> Graph1
createGraph xs = foldl' build IM.empty xs
where
addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
addFwd y _ = Just (Node False [y] [])
addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
addBwd x _ = Just (Node False [] [x])
build :: Graph1 -> (Int, Int) -> Graph1
build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc
Using vectors:
Consider the efficient construction functions (the accumulators, unfolds, generate, iterate, constructN, etc.). These may use mutation behind the scenes but are considerably more convenient to use than actual mutable vectors.
In the more general case, use the laziness of boxed vectors to enable self-reference when constructing a vector.
Use unboxed vectors when possible.
Use unsafe functions when you're absolutely sure about the bounds.
Only use mutable vectors when there aren't pure alternatives. In that case, prefer the ST monad to IO. Also, avoid creating many mutable heap objects (i. e. prefer mutable vectors to immutable vectors of mutable references).
Example for createGraph:
import qualified Data.Vector as V
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = V.Vector Node
createGraph :: Int -> [(Int, Int)] -> Graph1
createGraph maxIndex edges = graph'' where
graph = V.replicate maxIndex (Node False [] [])
graph' = V.accum (\(Node e f b) x -> Node e (x:f) b) graph edges
graph'' = V.accum (\(Node e f b) x -> Node e f (x:b)) graph' (map (\(a, b) -> (b, a)) edges)
Note that if there are gaps in the range of the node indices, then it'd be wise to either
Contiguously relabel the indices before doing anything else.
Introduce an empty constructor to Node to signify a missing index.
Faster I/O:
Use the IO functions from Data.Text or Data.ByteString. In both cases there are also efficient functions for breaking input into lines or words.
Example:
import qualified Data.ByteString.Char8 as BS
import System.IO
getLines :: FilePath -> IO [(Int, Int)]
getLines path = do
lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
let pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines
return [(a, b) | [a, b] <- pairs]
Benchmarking:
Always do it, unlike me in this answer. Use criterion.

Based pretty much on András' suggestions, I've reduced a 113 second task down to 24 (measured by stopwatch as I can't quite get Criterion to do anything yet) (and then down to 10 by compiling -O2)!!! I've attended some courses this last year that talked about the challenge of optimising for large datasets but this was the first time I faced a question that actually involved one, and it was as non-trivial as my instructors' suggested. This is what I have now:
import System.IO
import Control.Monad
import Data.List (foldl')
import qualified Data.IntMap.Strict as IM
import qualified Data.ByteString.Char8 as BS
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = IM.IntMap Node
-- DFS uses a stack to store next points to explore, a list can do this
type Stack = [(NodeName, NodeName)]
getBytes :: FilePath -> IO [(Int, Int)]
getBytes path = do
lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
let
pairs = (map . map) (maybe (error "Can't read integers") fst . BS.readInt) lines
return [(a,b) | [a,b] <- pairs]
main = do
--list <- getLines' "testdata.txt" -- [String]
list <- getBytes "SCC.txt" -- [String]
let list' = createGraph' list
putStrLn $ show $ list' IM.! 66
-- return list'
bmark = defaultMain [
bgroup "1" [
bench "Sim test" $ whnf bmark' "SCC.txt"
]
]
bmark' :: FilePath -> IO ()
bmark' path = do
list <- getLines path
let
list' = createGraph list
putStrLn $ show $ list' IM.! 2
createGraph' :: [(Int, Int)] -> Graph1
createGraph' xs = foldl' build IM.empty xs
where
addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
addFwd y _ = Just (Node False [y] [])
addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
addBwd x _ = Just (Node False [] [x])
build :: Graph1 -> (Int, Int) -> Graph1
build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc
And now on with the rest of the exercise....

This is not really an answer, I would rather comment András Kovács post, if I add those 50 points...
I have implemented the loading of the graph in both IntMap and MVector, in a attempt to benchmark mutability vs. immutability.
Both program use Attoparsec for the parsing. There is surely more economic way to do it, but Attoparsec is relatively fast compared to its high abstraction level (the parser can stand in one line). The guideline is to avoid String and read. read is partial and slow, [Char] is slow and not memory efficient, unless properly fused.
As András Kovács noted, IntMap is better than Map for Int keys. My code provides another example of alter usage. If the node identifier mapping is dense, you may also want to use Vector and Array. They allow O(1) indexing by the identifier.
The mutable version handle on demand the exponential growth of the MVector. This avoid to precise an upper bound on node identifiers, but introduce more complexity (the reference on the vector may change).
I benchmarked with a file of 5M edges with identifiers in the range [0..2^16]. The MVector version is ~2x faster than the IntMap code (12s vs 25s on my computer).
The code is here [Gist].
I will edit when more profiling is done on my side.

Related

Get all string splits

Say I have a string:
"abc7de7f77ghij7"
I want to split it by a substring, 7 in this case, and get all the left-right splits:
[ ("abc", "de7f77ghij7")
, ("abc7de", "f77ghij7")
, ("abc7de7f", "7ghij7")
, ("abc7de7f7", "ghij7")
, ("abc7de7f77ghij", "")
]
Sample implementation:
{-# LANGUAGE OverloadedStrings #-}
module StrSplits where
import qualified Data.Text as T
splits :: T.Text -> T.Text -> [(T.Text, T.Text)]
splits d s =
let run a l r =
case T.breakOn d r of
(x, "") -> reverse a
(x, y) ->
let
rn = T.drop (T.length d) y
an = (T.append l x, rn) : a
ln = l `T.append` x `T.append` d
in run an ln rn
in run [] "" s
main = do
print $ splits "7" "abc7de7f77ghij7"
print $ splits "8" "abc7de7f77ghij7"
with expected result:
[("abc","de7f77ghij7"),("abc7de","f77ghij7"),("abc7de7f","7ghij7"),("abc7de7f7","ghij7"),("abc7de7f77ghij","")]
[]
I'm not too happy about the manual recursion and let/case/let nesting. If my feeling that it doesn't look too good is right, is there a better way to write it?
Is there a generalized approach to solving these kinds of problems in Haskell similar to how recursion can be replaced with fmap and folds?
How about this?
import Data.Bifunctor (bimap)
splits' :: T.Text -> T.Text -> [(T.Text, T.Text)]
splits' delimiter string = mkSplit <$> [1..numSplits]
where
sections = T.splitOn delimiter string
numSplits = length sections - 1
mkSplit n = bimap (T.intercalate delimiter) (T.intercalate delimiter) $ splitAt n sections
I like to believe there's a way that doesn't involve indices, but you get the general idea. First split the string by the delimiter. Then split that list of strings at in two everywhere possible, rejoining each side with the delimiter.
Not the most efficient, though. You can probably do something similar with indices from Data.Text.Internal.Search if you want it to be fast. In this case, you wouldn't need to do the additional rejoining. I didn't experiment with it since I didn't understand what the function was returning.
Here's an indexless one.
import Data.List (isPrefixOf, unfoldr)
type ListZipper a = ([a],[a])
moveRight :: ListZipper a -> Maybe (ListZipper a)
moveRight (_, []) = Nothing
moveRight (ls, r:rs) = Just (r:ls, rs)
-- As Data.List.iterate, but generates a finite list ended by Nothing.
unfoldr' :: (a -> Maybe a) -> a -> [a]
unfoldr' f = unfoldr (\x -> (,) x <$> f x)
-- Get all ways to split a list with nonempty suffix
-- Prefix is reversed for efficiency
-- [1,2,3] -> [([],[1,2,3]), ([1],[2,3]), ([2,1],[3])]
splits :: [a] -> [([a],[a])]
splits xs = unfoldr' moveRight ([], xs)
-- This is the function you want.
splitsOn :: (Eq a) => [a] -> [a] -> [([a],[a])]
splitsOn sub xs = [(reverse l, drop (length sub) r) | (l, r) <- splits xs, sub `isPrefixOf` r]
Try it online!
Basically, traverse a list zipper to come up with a list of candidates for the split. Keep only those that are indeed splits on the desired item, then (un)reverse the prefix portion of each passing candidate.

How to randomly shuffle a list

I have random number generator
rand :: Int -> Int -> IO Int
rand low high = getStdRandom (randomR (low,high))
and a helper function to remove an element from a list
removeItem _ [] = []
removeItem x (y:ys) | x == y = removeItem x ys
| otherwise = y : removeItem x ys
I want to shuffle a given list by randomly picking an item from the list, removing it and adding it to the front of the list. I tried
shuffleList :: [a] -> IO [a]
shuffleList [] = []
shuffleList l = do
y <- rand 0 (length l)
return( y:(shuffleList (removeItem y l) ) )
But can't get it to work. I get
hw05.hs:25:33: error:
* Couldn't match expected type `[Int]' with actual type `IO [Int]'
* In the second argument of `(:)', namely
....
Any idea ?
Thanks!
Since shuffleList :: [a] -> IO [a], we have shuffleList (xs :: [a]) :: IO [a].
Obviously, we can't cons (:) :: a -> [a] -> [a] an a element onto an IO [a] value, but instead we want to cons it onto the list [a], the computation of which that IO [a] value describes:
do
y <- rand 0 (length l)
-- return ( y : (shuffleList (removeItem y l) ) )
shuffled <- shuffleList (removeItem y l)
return y : shuffled
In do notation, values to the right of <- have types M a, M b, etc., for some monad M (here, IO), and values to the left of <- have the corresponding types a, b, etc..
The x :: a in x <- mx gets bound to the pure value of type a produced / computed by the M-type computation which the value mx :: M a denotes, when that computation is actually performed, as a part of the combined computation represented by the whole do block, when that combined computation is performed as a whole.
And if e.g. the next line in that do block is y <- foo x, it means that a pure function foo :: a -> M b is applied to x and the result is calculated which is a value of type M b, denoting an M-type computation which then runs and produces / computes a pure value of type b to which the name y is then bound.
The essence of Monad is thus this slicing of the pure inside / between the (potentially) impure, it is these two timelines going on of the pure calculations and the potentially impure computations, with the pure world safely separated and isolated from the impurities of the real world. Or seen from the other side, the pure code being run by the real impure code interacting with the real world (in case M is IO). Which is what computer programs must do, after all.
Your removeItem is wrong. You should pick and remove items positionally, i.e. by index, not by value; and in any case not remove more than one item after having picked one item from the list.
The y in y <- rand 0 (length l) is indeed an index. Treat it as such. Rename it to i, too, as a simple mnemonic.
Generally, with Haskell it works better to maximize the amount of functional code at the expense of non-functional (IO or randomness-related) code.
In your situation, your “maximum” functional component is not removeItem but rather a version of shuffleList that takes the input list and (as mentioned by Will Ness) a deterministic integer position. List function splitAt :: Int -> [a] -> ([a], [a]) can come handy here. Like this:
funcShuffleList :: Int -> [a] -> [a]
funcShuffleList _ [] = []
funcShuffleList pos ls =
if (pos <=0) || (length(take (pos+1) ls) < (pos+1))
then ls -- pos is zero or out of bounds, so leave list unchanged
else let (left,right) = splitAt pos ls
in (head right) : (left ++ (tail right))
Testing:
λ>
λ> funcShuffleList 4 [0,1,2,3,4,5,6,7,8,9]
[4,0,1,2,3,5,6,7,8,9]
λ>
λ> funcShuffleList 5 "#ABCDEFGH"
"E#ABCDFGH"
λ>
Once you've got this, you can introduce randomness concerns in simpler fashion. And you do not need to involve IO explicitely, as any randomness-friendly monad will do:
shuffleList :: MonadRandom mr => [a] -> mr [a]
shuffleList [] = return []
shuffleList ls =
do
let maxPos = (length ls) - 1
pos <- getRandomR (0, maxPos)
return (funcShuffleList pos ls)
... IO being just one instance of MonadRandom.
You can run the code using the default IO-hosted random number generator:
main = do
let inpList = [0,1,2,3,4,5,6,7,8]::[Integer]
putStrLn $ "inpList = " ++ (show inpList)
-- mr automatically instantiated to IO:
outList1 <- shuffleList inpList
putStrLn $ "outList1 = " ++ (show outList1)
outList2 <- shuffleList outList1
putStrLn $ "outList2 = " ++ (show outList2)
Program output:
$ pickShuffle
inpList = [0,1,2,3,4,5,6,7,8]
outList1 = [6,0,1,2,3,4,5,7,8]
outList2 = [8,6,0,1,2,3,4,5,7]
$
$ pickShuffle
inpList = [0,1,2,3,4,5,6,7,8]
outList1 = [4,0,1,2,3,5,6,7,8]
outList2 = [2,4,0,1,3,5,6,7,8]
$
The output is not reproducible here, because the default generator is seeded by its launch time in nanoseconds.
If what you need is a full random permutation, you could have a look here and there - Knuth a.k.a. Fisher-Yates algorithm.

How to memoize the repeated subtrees of a game tree (a potentially infinite rose tree)?

I am attempting to implement the Negamax algorithm in Haskell.
For this, I am representing the future possibilities a game might take in a rose tree (Data.Tree.Forest (depth, move, position)). However, often there are positions that can be reached with two different sequences of moves. It is a waste (and quickly becomes very slow) to re-evaluate (the subtrees of) repeated positions.
Here is what I tried so far:
Implement a variant of Tying the Knot to share common sub-results. However, I have only been able to find explanations of tying the knot for (potentially infinite) lists, and nothing about re-using subtrees.
Another approach I have considered was to build a tree inside the State monad, where the state to keep would be a Map (depth, position) (Forest (depth, move, position)) to perform explicit memoization but I have so far not been able to set this up properly either.
I think that both approaches might have the problem that a game tree can only be built in a corecursive way: We do not build the tree up to the root from the leaves, but build a (potentially infinite) tree lazily from the root down.
EDIT: To give you an example of the code I am currently using (that is too slow):
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE FlexibleContexts #-}
module ZeroSumGame where
import qualified Control.Arrow
import Data.Tree
import Numeric.Natural (Natural)
(|>) :: a -> (a -> b) -> b
x |> f = f x
infixl 0 |>
{-# INLINE (|>) #-}
class Ord s => Game s where
data Move s
initial :: s -- | Beginning of the game
applyMove :: Natural -> s -> Move s -> s -- | Moving from one game state to the next
possibleMoves :: Natural -> s -> [Move s] -- | Lists moves the current player is able to do.
isGameOver :: s -> Bool -- | True if the game has ended. TODO: Maybe write default implementation using `possibleMoves state == []`?
scorePosition :: Natural -> Move s -> s -> Int -- | Turns a position in an integer, for the Negamax algorithm to decide which position is the best.
type Trimove state = (Natural, Move state, state) -- | Depth since start of game, move to next position, new position
gameforest :: Game s => Natural -> s -> Forest (Trimove s)
gameforest start_depth start_state = unfoldForest buildNode (nextpositions start_depth start_state)
where
buildNode (depth, move, current_state) =
if
isGameOver current_state
then
((depth, move, current_state), [])
else
((depth, move, current_state), nextpositions depth current_state)
nextpositions depth current_state =
current_state
|> possibleMoves depth
|> fmap (\move -> (succ depth, move, applyMove depth current_state move))
scoreTree :: Game s => Ord (Move s) => Natural -> Tree (Trimove s) -> (Move s, Int)
scoreTree depth node =
case (depth, subForest node) of
(0, _) ->
node |> rootLabel |> uncurry3dropFirst scorePosition
(_, []) ->
node |> rootLabel |> uncurry3dropFirst scorePosition
(_, children) ->
children
|> scoreForest (pred depth)
|> map (Control.Arrow.second negate)
|> maximum
uncurry3dropFirst :: (a -> b -> c -> d) -> (a, b, c) -> (b, d)
uncurry3dropFirst fun (a, b, c) = (b, fun a b c)
scoreForest :: Game s => Ord (Move s) => Natural -> Forest (Trimove s) -> [(Move s, Int)]
scoreForest depth forest =
forest
|> fmap (scoreTree depth)
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE FlexibleContexts #-}
module TicTacToe where
import qualified Control.Monad.State
import Control.Monad.State (State)
import qualified Data.Map
import Data.Map (Map)
import qualified Control.Arrow
import Data.Tree
import Data.Array (Array)
import qualified Data.Array
import qualified Data.Maybe
import qualified Data.Foldable
import Numeric.Natural (Natural)
import ZeroSumGame
data CurrentPlayer = First | Second
deriving (Eq, Ord, Show)
instance Enum CurrentPlayer where
fromEnum First = 1
fromEnum Second = -1
toEnum 1 = First
toEnum (-1) = Second
toEnum _ = error "Improper player"
newtype TicTacToe = TicTacToe (Array (Int, Int) (Maybe CurrentPlayer))
deriving (Eq, Ord)
instance Game TicTacToe where
data Move TicTacToe = TicTacToeMove (Int, Int)
deriving (Eq, Ord, Show, Bounded)
initial = TicTacToe initialTicTacToeBoard
possibleMoves _depth = possibleTicTacToeMoves
applyMove depth (TicTacToe board) (TicTacToeMove (x, y)) =
TicTacToe newboard
where
newboard = board Data.Array.// [((x, y), Just player)]
player = case depth `mod` 2 of
0 -> First
_ -> Second
isGameOver state = Data.Maybe.isJust (findFilledLines state)
scorePosition _ _ state =
state
|> findFilledLines
|> fmap fromEnum
|> Data.Maybe.fromMaybe 0
|> (* (-10000))
findFilledLines :: TicTacToe -> Maybe CurrentPlayer
findFilledLines (TicTacToe board) =
(rows ++ columns ++ diagonals)
|> map winner
|> Data.Foldable.asum
where
rows = vals rows_indexes
columns = vals columns_indexes
diagonals = vals diagonals_indexes
rows_indexes = [[(i, j) | i <- [0..2]]| j <- [0..2]]
columns_indexes = [[(i, j) | j <- [0..2]]| i <- [0..2]]
diagonals_indexes = [[(i, i) ]| i <- [0..2]] ++ [[(i, 2 - i) ]| i <- [0..2]]
vals = map (map (\index -> board Data.Array.! index))
winner :: Eq a => [Maybe a] -> Maybe a
winner [x,y,z] =
if x == y && x == z then x else Nothing
winner _ = Nothing
initialTicTacToeBoard :: (Array (Int, Int) (Maybe CurrentPlayer))
initialTicTacToeBoard =
Data.Array.array ((0, 0), (2, 2)) [((i, j), Nothing) | i <- [0..2], j <- [0..2]]
possibleTicTacToeMoves :: TicTacToe -> [Move TicTacToe]
possibleTicTacToeMoves (TicTacToe board) = foldr checkSquareForMove [] (Data.Array.assocs board)
where
checkSquareForMove (index, val) acc = case val of
Nothing -> TicTacToeMove index : acc
Just _ -> acc
printBoard :: TicTacToe -> String
printBoard (TicTacToe board) =
unlines [unwords [showTile (board Data.Array.! (y, x)) | x <- [0..2]] | y <- [0..2]]
where
showTile loc =
case loc of
Nothing -> " "
Just Second -> "X"
Just First -> "O"
(TypeFamilies is used to allow each Game implementation to have their own notion of a Move, and FlexibleContexts is then required to enforce Move s to implement Ord.
Problem reformulation
If I understand the question correctly, you have a function that returns the possible next moves in a game, and one to take that move:
start :: Position
moves :: Position -> [Move]
act :: Position -> Move -> Position
and how you want to build the infinite tree of states (please allow me to ignore the Depth field, for simplicity. If you consider the depth counter as part of the Position type, you see that no generality is lost here):
states :: Forest (Position, Move)
states = forest start
forest :: Position -> Forest (Position, Move)
forest p = [ Node (m, p') (states p') | m <- moves p, let p' = act p m ]
but you want to achieve that in a way that identical subtrees of forest are shared.
Towards Memoization
The general technique is here is that we want to memoize forest: This way, for identical Positions, we get shared subtrees. So the recipe is:
forest :: Position -> Forest (Position, Move)
forest = memo forest'
forest' :: Position -> Forest (Position, Move)
forest' p = [ Node (m, p') (states p') | m <- moves p, let p' = act p m ]
And we need a suitable memo-function:
memo :: (Position -> a) -> (Position -> a)
At this point, we need to know more about Position in order to know how to implement that using an equivalent of the “lazy list” trick… But you see that you do not need to memoize functions that involve Rose trees.
I would try to do this by normalizing board positions based on some "canonical" sequence of moves to reach that position. Then each child is assigned the value of traversing its individual normalized sequence through the tree. (no code because I'm on my phone and this is a big task.)
How well this works depends on the ease of calculating normalized move sequences in the game you're playing. But it's a way to introduce sharing by tying the knot, making use of a shared reference to the root of the game tree. Maybe it will serve as inspiration for other ideas that fit your specific case.

Intersperse values into separate Vectors using generate

I am trying to generate a tuple of Vectors by using a function that creates a custom data type (or a tuple) of values from an index. Here is an approach that achieves the desired result:
import Prelude hiding (map, unzip)
import Data.Vector hiding (map)
import Data.Array.Repa
import Data.Functor.Identity
data Foo = Foo {fooX :: Int, fooY :: Int}
unfoo :: Foo -> (Int, Int)
unfoo (Foo x y) = (x, y)
make :: Int -> (Int -> Foo) -> (Vector Int, Vector Int)
make n f = unzip $ generate n getElt where
getElt i = unfoo $ f i
Except that I would like to do it in a single iteration per Vector, almost like it is shown below, but avoiding multiple evaluation of function f:
make' :: Int -> (Int -> Foo) -> (Vector Int, Vector Int)
make' n f = (generate n getElt1, generate n getElt2) where
getElt1 i = fooX $ f i
getElt2 i = fooY $ f i
Just as a note, I understand that Vector library supports fusion, and the first example is already pretty efficient. I need a solution to generate concept, other libraries have very similar constructors (Repa has fromFunction for example), and I am using Vectors here simply to demonstrate a problem.
Maybe some sort of memoizing of f function call would work, but I cannot think of anything.
Edit:
Another demonstration of the problem using Repa:
makeR :: Int -> (Int -> Foo) -> (Array U DIM1 Int, Array U DIM1 Int)
makeR n f = runIdentity $ do
let arr = fromFunction (Z :. n) (\ (Z :. i) -> unfoo $ f i)
arr1 <- computeP $ map fst arr
arr2 <- computeP $ map snd arr
return (arr1, arr2)
Same as with vectors, fusion saves the day on performance, but an intermediate array arr of tuples is still required, which I am trying to avoid.
Edit 2: (3 years later)
In the Repa example above it will not create an intermediate array, since fromFunction creates a delayed array. Instead it will be even worse, it will evaluate f twice for each index, one for the first array, second time for the second array. Delayed array must be computed in order to avoid such duplication of work.
Looking back at my own question from a few years ago I can now easily show what I was trying to do back than and how to get it done.
In short, it can't be done purely, therefore we need to resort to ST monad and manual mutation of two vectors, but in the end we do get this nice and pure function that creates only two vectors and does not rely on fusion.
import Control.Monad.ST
import Data.Vector.Primitive
import Data.Vector.Primitive.Mutable
data Foo = Foo {fooX :: Int, fooY :: Int}
make :: Int -> (Int -> Foo) -> (Vector Int, Vector Int)
make n f = runST $ do
let n' = max 0 n
mv1 <- new n'
mv2 <- new n'
let fillVectors i
| i < n' = let Foo x y = f i
in write mv1 i x >> write mv2 i y >> fillVectors (i + 1)
| otherwise = return ()
fillVectors 0
v1 <- unsafeFreeze mv1
v2 <- unsafeFreeze mv2
return (v1, v2)
And the we use it in a similar fashion it is done with generate:
λ> make 10 (\ i -> Foo (i + i) (i * i))
([0,2,4,6,8,10,12,14,16,18],[0,1,4,9,16,25,36,49,64,81])
The essential thing you're trying to write is
splat f = unzip . fmap f
which shares the results of evaluating f between the two result vectors, but you want to avoid the intermediate vector. Unfortunately, I'm pretty sure you can't have it both ways in any meaningful sense. Consider a vector of length 1 for simplicity. In order for the result vectors to share the result of f (v ! 0), each will need a reference to a thunk representing that result. Well, that thunk has to be somewhere, and it really might as well be in a vector.

Why does my Haskell program ends with out of memory error?

I'm trying to write a Haskell program to parse huge text file (about 14Gb), but i can't understand how to make it free unused data from memory or not to make stack overflow during foldr. Here is the program source:
import qualified Data.ByteString.Lazy.Char8 as LBS
import qualified Data.ByteString.Lex.Lazy.Double as BD
import System.Environment
data Vertex =
Vertex{
vertexX :: Double,
vertexY :: Double,
vertexZ :: Double}
deriving (Eq, Show, Read)
data Extent =
Extent{
extentMax :: Vertex,
extentMin :: Vertex}
deriving (Eq, Show, Read)
addToExtent :: Extent -> Vertex -> Extent
addToExtent ext vert = Extent vertMax vertMin where
(vertMin, vertMax) = (makeCmpVert max (extentMax ext) vert, makeCmpVert min (extentMin ext) vert) where
makeCmpVert f v1 v2 = Vertex(f (vertexX v1) (vertexX v2))
(f (vertexY v1) (vertexY v2))
(f (vertexZ v1) (vertexZ v2))
readCoord :: LBS.ByteString -> Double
readCoord l = case BD.readDouble l of
Nothing -> 0
Just (value, _) -> value
readCoords :: LBS.ByteString -> [Double]
readCoords l | LBS.length l == 0 = []
| otherwise = let coordWords = LBS.split ' ' l
in map readCoord coordWords
parseLine :: LBS.ByteString -> Vertex
parseLine line = Vertex (head coords) (coords!!1) (coords!!2) where
coords = readCoords line
processLines :: [LBS.ByteString] -> Extent -> Extent
processLines strs ext = foldr (\x y -> addToExtent y (parseLine x)) ext strs
processFile :: String -> IO()
processFile name = do
putStrLn name
content <- LBS.readFile name
let (countLine:recordsLines) = LBS.lines content
case LBS.readInt countLine of
Nothing -> putStrLn "Can't read records count"
Just (recordsCount, _) -> do
print recordsCount
let vert = parseLine (head recordsLines)
let ext = Extent vert vert
print $ processLines recordsLines ext
main :: IO()
main = do
args <- getArgs
case args of
[] -> do
putStrLn "Missing file path"
xs -> do
processFile (head xs)
return()
Text file contains lines with three floating point numbers delimited with space character. This program always tries to occupy all free memory on a computer and crashes with out of memory error.
You are being too lazy. Vertex and Extent have non-strict fields, and all your functions returning a Vertex return
Vertex thunk1 thunk2
without forcing the components to be evaluated. Also addToExtent directly returns an
Extent thunk1 thunk2
without evaluating the components.
Thus none of the ByteStrings actually is released early to be garbage-collected, since the Doubles are not parsed from them yet.
When that is fixed by making the fields of Vertex and Extent strict - or the functions returning a Vertex resp. Extent forcing all parts of their input, you have the problem that
processLines strs ext = foldr (\x y -> addToExtent y (parseLine x)) ext strs
can't start assembling the result before the end of the list of lines is reached because then
(\x y -> addToExtent y (parseLine x))
is strict in its second argument.
However, barring NaNs and undefined values, if I didn't miss something, the result would be the same if you use a (strict!) left fold, so
processLines strs ext = foldl' (\x y -> addToExtent x (parseLine y)) ext strs
should produce the desired result without holding on to the data if Vertex and Extent get strict fields.
Ah, I did miss something:
addToExtent ext vert = Extent vertMax vertMin
where
(vertMin, vertMax) = (makeCmpVert max (extentMax ext) vert, makeCmpVert min (extentMin ext)
If that isn't a typo (what I expect it is), fixing that would be somewhat difficult.
I think it should be
(vertMax, vertMin) = ...
addToExtent is too lazy. A possible alternative definition is
addToExtent :: Extent -> Vertex -> Extent
addToExtent ext vert = vertMax `seq` vertMin `seq` Extent vertMax vertMin where
(vertMin, vertMax) = (makeCmpVert max (extentMax ext) vert, makeCmpVert min (extentMinext) vert) where
makeCmpVert f v1 v2 = Vertex(f (vertexX v1) (vertexX v2))
(f (vertexY v1) (vertexY v2))
(f (vertexZ v1) (vertexZ v2))
data Vertex =
Vertex{
vertexX :: {-# UNPACK #-} !Double,
vertexY :: {-# UNPACK #-} !Double,
vertexZ :: {-# UNPACK #-} !Double}
deriving (Eq, Show, Read)
The problem is that vertMin and vertMax are never evaluated until the entire file is processed - resulted in two huge thunks in Extent.
I also recommend changing the definition of Extent to
data Extent =
Extent{
extentMax :: !Vertex,
extentMin :: !Vertex}
deriving (Eq, Show, Read)
(though with these changes, the seq calls in addToExtent become redundant).

Resources