So I have about a 8mb file of each with 6 ints seperated by a space.
my current method for parsing this is:
tuplify6 :: [a] -> (a, a, a, a, a, a)
tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q)
toInts :: String -> (Int, Int, Int, Int, Int, Int)
toInts line =
tuplify6 $ map read stringNumbers
where stringNumbers = split " " line
and mapping toInts over
liftM lines . readFile
which will return me a list of tuples. However, When i run this, it takes nearly 25 seconds to load the file and parse it. Any way I can speed this up? The file is just plain text.
You can speed it up by using ByteStrings, e.g.
module Main (main) where
import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as C
import Data.Char
main :: IO ()
main = do
args <- getArgs
mapM_ doFile args
doFile :: FilePath -> IO ()
doFile file = do
bs <- C.readFile file
let tups = buildTups 0 [] $ C.dropWhile (not . isDigit) bs
print (length tups)
buildTups :: Int -> [Int] -> C.ByteString -> [(Int,Int,Int,Int,Int,Int)]
buildTups 6 acc bs = tuplify6 acc : buildTups 0 [] bs
buildTups k acc bs
| C.null bs = if k == 0 then [] else error ("Bad file format " ++ show k)
| otherwise = case C.readInt bs of
Just (i,rm) -> buildTups (k+1) (i:acc) $ C.dropWhile (not . isDigit) rm
Nothing -> error ("No Int found: " ++ show (C.take 100 bs))
tuplify6:: [a] -> (a, a, a, a, a, a)
tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q)
runs pretty fast:
$ time ./fileParse IntList
200000
real 0m0.119s
user 0m0.115s
sys 0m0.003s
for an 8.1 MiB file.
On the other hand, using Strings and your conversion (with a couple of seqs to force evaluation) also took only 0.66s, so the bulk of the time seems to be spent not parsing, but working with the result.
Oops, missed a seq so the reads were not actually evaluated for the String version. Fixing that, String + read takes about four seconds, a bit above one with the custom Int parser from #Rotsor's comment
foldl' (\a c -> 10*a + fromEnum c - fromEnum '0') 0
so parsing apparently did take a significant amount of the time.
Related
I have the following code. The M prefix designates functions from Data.Map.Strict, and Table is a type alias for Data.Map.Strict.Map Mapping Bool, where Mapping is an arbitrary opaque structure.
computeCoverage :: Table -> Expr -> Maybe Coverage
computeCoverage t e = go t True M.empty
where go src flag targ
| null src = if flag
then Nothing
else Just (M.size t, targ)
| otherwise = let ((m, b), rest) = M.deleteFindMin src
result = interpret e m
flag' = result && flag in
go rest flag' (if b == result then targ else M.insert m b targ)
I would like to be able to use Control.Parallel to perform this with as much parallelism as possible. However, I'm not sure how to do this. Based on reading Data.Map.Strict, it seems what you're supposed to do is call splitRoot, then do whatever parallel stuff you want on the resulting list, then recombine (I guess?). Have I basically got the right idea? If not, what should I do instead to parallelize the code above?
Here's a contrived example. You just use parMap over M.splitRoot m:
import qualified Data.Map.Strict as M
import Control.Parallel.Strategies
import System.Environment
fib 0 = 0
fib 1 = 1
fib n = fib (n-2) + fib (n-1)
theMap :: Int -> M.Map Int Int
theMap n = M.fromList [ (x, 33 + mod x 3) | x <- [1..n] ]
isInteresting n = mod (fib n) 2 == 0
countInteresting :: M.Map Int Int -> Int
countInteresting m = length $ filter isInteresting (M.elems m)
doit :: Int -> [Int]
doit n = parMap rseq countInteresting (M.splitRoot $ theMap n)
main :: IO ()
main = do
( arg1 : _) <- getArgs
let n = read arg1
print $ doit n
Note, however these caveats:
the splits may not be of equal size
use splitRoot if working with a Map is helpful for your computation; this particular example doesn't benefit from the Map structure of root - it could have just parMapped over the elements.
I am trying to write a very large amount of data to a file in constant memory.
import qualified Data.ByteString.Lazy as B
{- Creates and writes num grids of dimensions aa x aa -}
writeGrids :: Int -> Int -> IO ()
writeGrids num aa = do
rng <- newPureMT
let (grids,shuffleds) = createGrids rng aa
createDirectoryIfMissing True "data/grids/"
B.writeFile (gridFileName num aa)
(encode (take num grids))
B.writeFile (shuffledFileName num aa)
(encode (take num shuffleds))
However this consumes memory proportional to the size of num. I know createGrids is a sufficiently lazy function because I have tested it by appending error "not lazy enough" (as suggested by the Haskell wiki here) to the end of the lists it returns and no errors are raised. take is a lazy function that is defined in Data.List. encode is also a lazy function defined in Data.Binary. B.writeFile is defined in Data.ByteString.Lazy.
Here is the complete code so you can execute it:
import Control.Arrow (first)
import Data.Binary
import GHC.Float (double2Float)
import System.Random (next)
import System.Random.Mersenne.Pure64 (PureMT, newPureMT, randomDouble)
import System.Random.Shuffle (shuffle')
import qualified Data.ByteString.Lazy as B
main :: IO ()
main = writeGrids 1000 64
{- Creates and writes num grids of dimensions aa x aa -}
writeGrids :: Int -> Int -> IO ()
writeGrids num aa = do
rng <- newPureMT
let (grids,shuffleds) = createGrids rng aa
B.writeFile "grids.bin" (encode (take num grids))
B.writeFile "shuffleds.bin" (encode (take num shuffleds))
{- a random number generator, dimension of grids to make
returns a pair of lists, the first is a list of grids of dimensions
aa x aa, the second is a list of the shuffled grids corresponding to the first list -}
createGrids :: PureMT -> Int -> ([[(Float,Float)]],[[(Float,Float)]])
createGrids rng aa = (grids,shuffleds) where
rs = randomFloats rng
grids = map (getGridR aa) (chunksOf (2 * aa * aa) rs)
shuffleds = shuffler (aa * aa) rng grids
{- length of each grid, a random number generator, a list of grids
returns a the list with each grid shuffled -}
shuffler :: Int -> PureMT -> [[(Float,Float)]] -> [[(Float,Float)]]
shuffler n rng (xs:xss) = shuffle' xs n rng : shuffler n (snd (next rng)) xss
shuffler _ _ [] = []
{- divides list into chunks of size n -}
chunksOf :: Int -> [a] -> [[a]]
chunksOf n = go
where go xs = case splitAt n xs of
(ys,zs) | null ys -> []
| otherwise -> ys : go zs
{- dimension of grid, list of random floats [0,1]
returns a list of (x,y) points of length n^2 such that all
points are in the range [0,1] and the points are a randomly
perturbed regular grid -}
getGridR :: Int -> [Float] -> [(Float,Float)]
getGridR n rs = pts where
nn = n * n
(irs,jrs) = splitAt nn rs
n' = fromIntegral n
grid = [ (p,q) | p <- [0..n'-1], q <- [0..n'-1] ]
pts = zipWith (\(p,q) (ir,jr) -> ((p+ir)/n',(q+jr)/n')) grid (zip irs jrs)
{- an infinite list of random floats in range [0,1] -}
randomFloats :: PureMT -> [Float]
randomFloats rng = let (d,rng') = first double2Float (randomDouble rng)
in d : randomFloats rng'
The required packages are:
, bytestring
, binary
, random
, mersenne-random-pure64
, random-shuffle
Two reasons for the memory usage:
First, Data.Binary.encode doesn't seem to run in constant space. The following program uses 910 MB memory:
import Data.Binary
import qualified Data.ByteString.Lazy as B
len = 10000000 :: Int
main = B.writeFile "grids.bin" $ encode [0..len]
If we leave a 0 out from len we get 97 MB memory usage.
In contrast, the following program uses 1 MB:
import qualified Data.ByteString.Lazy.Char8 as B
main = B.writeFile "grids.bin" $ B.pack $ show [0..(1000000::Int)]
Second, in your program shuffleds contains references to contents of grids, which prevents garbage collection of grids. So when we print grids, we also evaluate it and then it has to sit in memory until we finish printing shuffleds. The following version of your program still consumes lots of memory, but it uses constant space if we comment out one of the two lines with B.writeFile.
import qualified Data.ByteString.Lazy.Char8 as B
writeGrids :: Int -> Int -> IO ()
writeGrids num aa = do
rng <- newPureMT
let (grids,shuffleds) = createGrids rng aa
B.writeFile "grids.bin" (B.pack $ show (take num grids))
B.writeFile "shuffleds.bin" (B.pack $ show (take num shuffleds))
For what it's worth, here is a full solution combining the ideas of everyone here. Memory consumption is constant at ~6MB (compiled with -O2).
import Control.Arrow (first)
import Control.Monad.State (state, evalState)
import Data.Binary
import GHC.Float (double2Float)
import System.Random (next)
import System.Random.Mersenne.Pure64 (PureMT, newPureMT, randomDouble)
import System.Random.Shuffle (shuffle')
import qualified Data.ByteString as B (hPut)
import qualified Pipes.Binary as P (encode)
import qualified Pipes.Prelude as P (zip, mapM, drain)
import Pipes (runEffect, (>->))
import System.IO (withFile, IOMode(AppendMode))
main :: IO ()
main = writeGrids 1000 64
{- Creates and writes num grids of dimensions aa x aa -}
writeGrids :: Int -> Int -> IO ()
writeGrids num aa = do
rng <- newPureMT
let (grids, shuffleds) = createGrids rng aa
gridFile = "grids.bin"
shuffledFile = "shuffleds.bin"
encoder = P.encode . SerList . take num
writeFile gridFile ""
writeFile shuffledFile ""
withFile gridFile AppendMode $ \hGr ->
withFile shuffledFile AppendMode $ \hSh ->
runEffect
$ P.zip (encoder grids) (encoder shuffleds)
>-> P.mapM (\(ch1, ch2) -> B.hPut hGr ch1 >> B.hPut hSh ch2)
>-> P.drain -- discards the stream of () results.
{- a random number generator, dimension of grids to make
returns a pair of lists, the first is a list of grids of dimensions
aa x aa, the second is a list of the shuffled grids corresponding to the first list -}
createGrids :: PureMT -> Int -> ( [[(Float,Float)]], [[(Float,Float)]] )
createGrids rng aa = unzip gridsAndShuffleds where
rs = randomFloats rng
grids = map (getGridR aa) (chunksOf (2 * aa * aa) rs)
gridsAndShuffleds = shuffler (aa * aa) rng grids
{- length of each grid, a random number generator, a list of grids
returns a the list with each grid shuffled -}
shuffler :: Int -> PureMT -> [[(Float,Float)]] -> [( [(Float,Float)], [(Float,Float)] )]
shuffler n rng xss = evalState (traverse oneShuffle xss) rng
where
oneShuffle xs = state $ \r -> ((xs, shuffle' xs n r), snd (next r))
newtype SerList a = SerList { runSerList :: [a] }
deriving (Show)
instance Binary a => Binary (SerList a) where
put (SerList (x:xs)) = put False >> put x >> put (SerList xs)
put _ = put True
get = do
stop <- get :: Get Bool
if stop
then return (SerList [])
else do
x <- get
SerList xs <- get
return (SerList (x : xs))
{- divides list into chunks of size n -}
chunksOf :: Int -> [a] -> [[a]]
chunksOf n = go
where go xs = case splitAt n xs of
(ys,zs) | null ys -> []
| otherwise -> ys : go zs
{- dimension of grid, list of random floats [0,1]
returns a list of (x,y) points of length n^2 such that all
points are in the range [0,1] and the points are a randomly
perturbed regular grid -}
getGridR :: Int -> [Float] -> [(Float,Float)]
getGridR n rs = pts where
nn = n * n
(irs,jrs) = splitAt nn rs
n' = fromIntegral n
grid = [ (p,q) | p <- [0..n'-1], q <- [0..n'-1] ]
pts = zipWith (\(p,q) (ir,jr) -> ((p+ir)/n',(q+jr)/n')) grid (zip irs jrs)
{- an infinite list of random floats in range [0,1] -}
randomFloats :: PureMT -> [Float]
randomFloats rng = let (d,rng') = first double2Float (randomDouble rng)
in d : randomFloats rng'
Comments on the changes:
shuffler is now a traversal with the State functor. It produces, in a single pass through the input list, a list of pairs, in which each grid is paired with its shuffled version. createGrids then (lazily) unzips this list.
The files are written to using pipes machinery, in a way loosely inspired by this answer (I originally wrote this using P.foldM). Note that the hPut I used is the strict bytestring one, for it acts on strict chunks supplied by the producer made with P.zip (which, in spirit, is a pair of lazy bytestrings that supplies chunks in pairs).
SerList is there to hold the custom Binary instance Thomas M. DuBuisson alludes to. Note that I haven't thought too much about laziness and strictness in the get method of the instance. If that causes you trouble, this question looks useful.
Here I am trying to generate a list of Integers where I want to add random numbers to the list while the condition where md<=1 is not satisfied.
I tried several times but with no luck.
fx :: Int -> Int -> IO [Int]
fx md s = do
x <- randomRIO (1,min md s)
if md<=1
then return [md]
else return md:(fx (md-x) s)
The error is at the last line:
else return md:(fx (md-x) s)
The result of fx is in the IO monad, and with that result you want to perform a pure operation of prepending md to it. This is exactly what liftM does, your last line should be
else liftM (md :) (fx (md-x) s)
Other small improvements:
You don't really need IO, you only need to generate random numbers within the function. For this you could use the MonadRandom type class (residing in the package with the same name). This will make the function more specific, type-safe and flexible.
It's redundant to generate a random number and only after then check if you need it - the comparison md <= 1 doesn't need x at all.
So an improved version might look like:
import Control.Monad
import Control.Monad.Random
fx :: (MonadRandom m) => Int -> Int -> m [Int]
fx md s | md <= 1 = return [md]
| otherwise = do
x <- getRandomR (1, min md s)
liftM (md :) (fx (md-x) s)
alternative without liftM
maybe this is easier to understand:
fx :: (MonadRandom m) => Int -> Int -> m [Int]
fx md s | md <= 1 = return [md]
| otherwise = do
x <- getRandomR (1, min md s)
xs <- fx (md-x) s
return (md : xs)
Is there a way to read an integer from the console in Haskell? I'm asking for something pretty much like C++'s cin or Java's Scanner.nextInt().
And by that I mean that given this input:
1 2 3
2 3
4 25 12 7
1
I should be able to read them all, not at the same time (maybe reading 4 of them, doing some calculations and then read the rest) ignoring the fact that they are in separate lines.
The easiest solution is probably
getAll :: Read a => IO [a]
getAll = fmap (fmap read . words) getContents
getInts :: IO [Int]
getInts = getAll
which will read all input into a single list.
When in doubt, use Parsec! (not always, and not really, but who cares)
import Text.ParserCombinators.Parsec
import Text.Parsec.Numbers
value = do
spaces
num <- parseFloat
return num
line = many value
then "rinse and repeat", with getLine until you EOF.
Note: you can do it without Parsec using read and friends, but this way is more extendable and preferred for more complicated grammars.
Using Parsec:
import Text.ParserCombinators.Parsec
import Text.Parsec.Numbers
import Control.Applicative ((*>), (<*))
line = spaces *> many1 (parseFloat <* spaces)
main = putStrLn "Enter numbers:" >> fmap (parse line "") getLine >>= print
Running it:
$ ghc parsenums.hs
$ ./parsenums
Enter numbers:
345 23 654 234
[345.0,23.0,654.0,234.0]
A more "manual" way to do it would be something like:
import Data.Char (isDigit, isSpace)
getInts :: String -> [Int]
getInts s = case span isDigit (dropWhile isSpace s) of
("", "") -> []
("", s) -> error $ "Invalid input: " ++ s
(digits, rest) -> (read digits :: Int) : getInts rest
Which might be much clearer to see how it works. In fact, here's one that's completely from the ground up:
getInts :: String -> [Int]
getInts s = case span isDigit (dropWhile isSpace s) of
("", "") -> []
("", s) -> error $ "Invalid input: " ++ s
(digits, rest) -> strToInt digits : getInts rest
isDigit :: Char -> Bool
isDigit c = '0' <= c && c <= '9'
isSpace :: Char -> Bool
isSpace c = c `elem` " \t\n\r"
charToInt :: Char -> Int
charToInt c = fromEnum c - 48
strToInt :: String -> Int
strToInt s = go 0 s where
go n [] = n
go n (c:rest) = go (n * 10 + charToInt c) rest
Here is my first Haskell program. What parts would you write in a better way?
-- Multiplication table
-- Returns n*n multiplication table in base b
import Text.Printf
import Data.List
import Data.Char
-- Returns n*n multiplication table in base b
mulTable :: Int -> Int -> String
mulTable n b = foldl (++) (verticalHeader n b w) (map (line n b w) [0..n])
where
lo = 2* (logBase (fromIntegral b) (fromIntegral n))
w = 1+fromInteger (floor lo)
verticalHeader :: Int -> Int -> Int -> String
verticalHeader n b w = (foldl (++) tableHeader columnHeaders)
++ "\n"
++ minusSignLine
++ "\n"
where
tableHeader = replicate (w+2) ' '
columnHeaders = map (horizontalHeader b w) [0..n]
minusSignLine = concat ( replicate ((w+1)* (n+2)) "-" )
horizontalHeader :: Int -> Int -> Int -> String
horizontalHeader b w i = format i b w
line :: Int -> Int -> Int -> Int -> String
line n b w y = (foldl (++) ((format y b w) ++ "|" )
(map (element b w y) [0..n])) ++ "\n"
element :: Int -> Int -> Int -> Int -> String
element b w y x = format (y * x) b w
toBase :: Int -> Int -> [Int]
toBase b v = toBase' [] v where
toBase' a 0 = a
toBase' a v = toBase' (r:a) q where (q,r) = v `divMod` b
toAlphaDigits :: [Int] -> String
toAlphaDigits = map convert where
convert n | n < 10 = chr (n + ord '0')
| otherwise = chr (n + ord 'a' - 10)
format :: Int -> Int -> Int -> String
format v b w = concat spaces ++ digits ++ " "
where
digits = if v == 0
then "0"
else toAlphaDigits ( toBase b v )
l = length digits
spaceCount = if (l > w) then 0 else (w-l)
spaces = replicate spaceCount " "
Here are some suggestions:
To make the tabularity of the computation more obvious, I would pass the list [0..n] to the line function rather than passing n.
I would further split out the computation of the horizontal and vertical axes so that they are passed as arguments to mulTable rather than computed there.
Haskell is higher-order, and almost none of the computation has to do with multiplication. So I would change the name of mulTable to binopTable and pass the actual multiplication in as a parameter.
Finally, the formatting of individual numbers is repetitious. Why not pass \x -> format x b w as a parameter, eliminating the need for b and w?
The net effect of the changes I am suggesting is that you build a general higher-order function for creating tables for binary operators. Its type becomes something like
binopTable :: (i -> String) -> (i -> i -> i) -> [i] -> [i] -> String
and you wind up with a much more reusable function—for example, Boolean truth tables should be a piece of cake.
Higher-order and reusable is the Haskell Way.
You don't use anything from import Text.Printf.
Stylistically, you use more parentheses than necessary. Haskellers tend to find code more readable when it's cleaned of extraneous stuff like that. Instead of something like h x = f (g x), write h = f . g.
Nothing here really requires Int; (Integral a) => a ought to do.
foldl (++) x xs == concat $ x : xs: I trust the built-in concat to work better than your implementation.
Also, you should prefer foldr when the function is lazy in its second argument, as (++) is – because Haskell is lazy, this reduces stack space (and also works on infinite lists).
Also, unwords and unlines are shortcuts for intercalate " " and concat . map (++ "\n") respectively, i.e. "join with spaces" and "join with newlines (plus trailing newline)"; you can replace a couple things by those.
Unless you use big numbers, w = length $ takeWhile (<= n) $ iterate (* b) 1 is probably faster. Or, in the case of a lazy programmer, let w = length $ toBase b n.
concat ( (replicate ((w+1)* (n+2)) "-" ) == replicate ((w+1) * (n+2)) '-' – not sure how you missed this one, you got it right just a couple lines up.
You do the same thing with concat spaces, too. However, wouldn't it be easier to actually use the Text.Printf import and write printf "%*s " w digits?
Norman Ramsey gave excellent high-level (design) suggestions; Below are some low-level ones:
First, consult with HLint. HLint is a friendly program that gives you rudimentary advice on how to improve your Haskell code!
In your case HLint gives 7 suggestions. (mostly about redundant brackets)
Modify your code according to HLint's suggestions until it likes what you feed it.
More HLint-like stuff:
concat (replicate i "-"). Why not replicate i '-'?
Consult with Hoogle whenever there is reason to believe that a function you need is already available in Haskell's libraries. Haskell comes with tons of useful functions so Hoogle should come in handy quite often.
Need to concatenate strings? Search for [String] -> String, and voila you found concat. Now go replace all those folds.
The previous search also suggested unlines. Actually, this even better suits your needs. It's magic!
Optional: pause and thank in your heart to Neil M for making Hoogle and HLint, and thank others for making other good stuff like Haskell, bridges, tennis balls, and sanitation.
Now, for every function that takes several arguments of the same type, make it clear which means what, by giving them descriptive names. This is better than comments, but you can still use both.
So
-- Returns n*n multiplication table in base b
mulTable :: Int -> Int -> String
mulTable n b =
becomes
mulTable :: Int -> Int -> String
mulTable size base =
To soften the extra characters blow of the previous suggestion: When a function is only used once, and is not very useful by itself, put it inside its caller's scope in its where clause, where it could use the callers' variables, saving you the need to pass everything to it.
So
line :: Int -> Int -> Int -> Int -> String
line n b w y =
concat
$ format y b w
: "|"
: map (element b w y) [0 .. n]
element :: Int -> Int -> Int -> Int -> String
element b w y x = format (y * x) b w
becomes
line :: Int -> Int -> Int -> Int -> String
line n b w y =
concat
$ format y b w
: "|"
: map element [0 .. n]
where
element x = format (y * x) b w
You can even move line into mulTable's where clause; imho, you should.
If you find a where clause nested inside another where clause troubling, then I suggest to change your indentation habits. My recommendation is to use consistent indentation of always 2 or always 4 spaces. Then you can easily see, everywhere, where the where in the other where is at. ok
Below's what it looks like (with a few other changes in style):
import Data.List
import Data.Char
mulTable :: Int -> Int -> String
mulTable size base =
unlines $
[ vertHeaders
, minusSignsLine
] ++ map line [0 .. size]
where
vertHeaders =
concat
$ replicate (cellWidth + 2) ' '
: map horizontalHeader [0 .. size]
horizontalHeader i = format i base cellWidth
minusSignsLine = replicate ((cellWidth + 1) * (size + 2)) '-'
cellWidth = length $ toBase base (size * size)
line y =
concat
$ format y base cellWidth
: "|"
: map element [0 .. size]
where
element x = format (y * x) base cellWidth
toBase :: Integral i => i -> i -> [i]
toBase base
= reverse
. map (`mod` base)
. takeWhile (> 0)
. iterate (`div` base)
toAlphaDigit :: Int -> Char
toAlphaDigit n
| n < 10 = chr (n + ord '0')
| otherwise = chr (n + ord 'a' - 10)
format :: Int -> Int -> Int -> String
format v b w =
spaces ++ digits ++ " "
where
digits
| v == 0 = "0"
| otherwise = map toAlphaDigit (toBase b v)
spaces = replicate (w - length digits) ' '
0) add a main function :-) at least rudimentary
import System.Environment (getArgs)
import Control.Monad (liftM)
main :: IO ()
main = do
args <- liftM (map read) $ getArgs
case args of
(n:b:_) -> putStrLn $ mulTable n b
_ -> putStrLn "usage: nntable n base"
1) run ghc or runhaskell with -Wall; run through hlint.
While hlint doesn't suggest anything special here (only some redundant brackets), ghc will tell you that you don't actually need Text.Printf here...
2) try running it with base = 1 or base = 0 or base = -1
If you want multiline comments use:
{- A multiline
comment -}
Also, never use foldl, use foldl' instead, in cases where you are dealing with large lists which must be folded. It is more memory efficient.
A brief comments saying what each function does, its arguments and return value, is always good. I had to read the code pretty carefully to fully make sense of it.
Some would say if you do that, explicit type signatures may not be required. That's an aesthetic question, I don't have a strong opinion on it.
One minor caveat: if you do remove the type signatures, you'll automatically get the polymorphic Integral support ephemient mentioned, but you will still need one around toAlphaDigits because of the infamous "monomorphism restriction."