Read a list of integers lazily as a bytestring - haskell

I'm trying to find the sum of integers in a file. The code using the normal string is:
main = do
contents <- getContents
L.putStrLn (sumFile contents)
where sumFile = sum . map read. words
I tried to change it to use the Data.ByteString.Lazy module like this:
import Data.ByteString.Lazy as L
main = do
contents <- L.getContents
L.putStrLn (sumFile contents)
where sumFile = sum . L.map read. words
But this refused as words was returning a string. Then I tried using Data.ByteString.Char8 but it used a strict ByteString.
How can I make this function completely lazy?

I found a slightly length workaround to reading the file as a ByteString and then as a list of integers. Thanks to #melpomene
import Data.ByteString.Lazy.Char8 as L
main = do
contents <- L.getContents
print (sumFile contents)
where sumFile x = sum $ Prelude.map tups $ Prelude.map L.readInt (L.words x)
where read' = tups.(L.readInt)
tups :: (Num a) => (Maybe (a, b)) -> a
tups (Just (a,b)) = a
tups Nothing = 0

Related

How can I read a sentence, separate the words and apply my function to each word? Haskell

I have a function that reads a word, separates the first and the last letter and the remaining content mixes it and at the end writes the first and last letter of the word but with the mixed content.
Example:
Hello -> H lle o
But I want you to be able to read a phrase and do the same in each word of the sentence. What I can do?
import Data.List
import System.IO
import System.Random
import System.IO.Unsafe
import Data.Array.IO
import Control.Monad
import Data.Array
oracion = do
frase <- getLine
let pL = head frase
let contentR = devContent frase
charDisorder <- aleatorio contentR
let uL = last frase
putStrLn $ [pL] ++ charDisorder ++ [uL]
aleatorio :: [d] -> IO [d]
aleatorio xs = do
ar <- newArray n xs
forM [1..n] $ \i -> do
t <- randomRIO (i,n)
vi <- readArray ar i
vt <- readArray ar t
writeArray ar t vi
return vt
where
n = length xs
newArray :: Int -> [d] -> IO (IOArray Int d)
newArray n xs = newListArray (1,n) xs
devContent :: [Char] -> [Char]
devContent x = init (drop 1 x)
That should go like this:
doStuffOnSentence sentence = mapM aleatorio (words sentence)
Whenever You are dealing with monads (especially IO) mapM is real lifesaver.
What's more, if You want to concatenate the final result You can add:
concatIoStrings = liftM unwords
doStuffAndConcat = concatIoStrings . doStuffOnSentence

What is the fastest way to parse line with lots of Ints?

I'm learning Haskell for two years now and I'm still confused, whats the best (fastest) way to read tons of numbers from a single input line.
For learning I registered into hackerearth.com trying to solve every challenge in Haskell. But now I'm stuck with a challenge because I run into timeout issues. My program is just too slow for beeing accepted by the site.
Using the profiler I found out it takes 80%+ of the time for parsing a line with lots of integers. The percentage gets even higher when the number of values in the line increases.
Now this is the way, I'm reading numbers from an input line:
import qualified Data.ByteString.Char8 as C8
main = do
scores <- fmap (map (fst . fromJust . C8.readInt) . C8.words) C8.getLine :: IO [Int]
Is there any way to get the data faster into the variable?
BTW: The biggest testcase consist of a line with 200.000 9-digits values. Parsing takes incredible long (> 60s).
It's always difficult to declare a particular approach "the fastest", since there's almost always some way to squeeze out more performance. However, an approach using Data.ByteString.Char8 and the general method you suggest should be among the fastest methods for reading numbers. If you encounter a case where performance is poor, the problem likely lies elsewhere.
To give some concrete results, I generated a 191Meg file of 20 million 9-digit numbers, space-separate on a single line. I then tried several general methods of reading a line of numbers and printing their sum (which, for the record, was 10999281565534666). The obvious approach using String:
reader :: IO [Int]
reader = map read . words <$> getLine
sum' xs = sum xs -- work around GHC ticket 10992
main = print =<< sum' <$> reader
took 52secs; a similar approach using Text:
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Read as T
readText = map parse . T.words <$> T.getLine
where parse s = let Right (n, _) = T.decimal s in n
ran in 2.4secs (but note that it would need to be modified to handle negative numbers!); and the same approach using Char8:
import qualified Data.ByteString.Char8 as C
readChar8 :: IO [Int]
readChar8 = map parse . C.words <$> C.getLine
where parse s = let Just (n, _) = C.readInt s in n
ran in 1.4secs. All examples were compiled with -O2 on GHC 8.0.2.
As a comparison benchmark, a scanf-based C implementation:
/* GCC 5.4.0 w/ -O3 */
#include <stdio.h>
int main()
{
long x, acc = 0;
while (scanf(" %ld", &x) == 1) {
acc += x;
}
printf("%ld\n", acc);
return 0;
}
ran in about 2.5secs, on par with the Text implementation.
You can squeeze a bit more performance out of the Char8 implementation. Using a hand-rolled parser:
readChar8' :: IO [Int]
readChar8' = parse <$> C.getLine
where parse = unfoldr go
go s = do (n, s1) <- C.readInt s
let s2 = C.dropWhile C.isSpace s1
return (n, s2)
runs in about 0.9secs -- I haven't tried to determine why there's a difference, but the compiler must be missing an opportunity to perform some optimization of the words-to-readInt pipeline.
Haskell Code for Reference
Make some numbers with Numbers.hs:
-- |Generate 20M 9-digit numbers:
-- ./Numbers 20000000 100000000 999999999 > data1.txt
import qualified Data.ByteString.Char8 as C
import Control.Monad
import System.Environment
import System.Random
main :: IO ()
main = do [n, a, b] <- map read <$> getArgs
nums <- replicateM n (randomRIO (a,b))
let _ = nums :: [Int]
C.putStrLn (C.unwords (map (C.pack . show) nums))
Find their sum with Sum.hs:
import Data.List
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Read as T
import qualified Data.Char8 as C
import qualified Data.ByteString.Char8 as C
import System.Environment
-- work around https://ghc.haskell.org/trac/ghc/ticket/10992
sum' xs = sum xs
readString :: IO [Int]
readString = map read . words <$> getLine
readText :: IO [Int]
readText = map parse . T.words <$> T.getLine
where parse s = let Right (n, _) = T.decimal s in n
readChar8 :: IO [Int]
readChar8 = map parse . C.words <$> C.getLine
where parse s = let Just (n, _) = C.readInt s in n
readHand :: IO [Int]
readHand = parse <$> C.getLine
where parse = unfoldr go
go s = do (n, s1) <- C.readInt s
let s2 = C.dropWhile C.isSpace s1
return (n, s2)
main = do [method] <- getArgs
let reader = case method of
"string" -> readString
"text" -> readText
"char8" -> readChar8
"hand" -> readHand
print =<< sum' <$> reader
where:
./Sum string <data1.txt # 54.3 secs
./Sum text <data1.txt # 2.29 secs
./Sum char8 <data1.txt # 1.34 secs
./Sum hand <data1.txt # 0.91 secs

Expanding the abbreviated words from a file in Haskell

I am new at working with files in haskell.I wrote a code to check for occurence of words in a .c file. words are listed in a .txt file .
for example:
abbreviations.txt
ix=index
ctr=counter
tbl=table
Another file is:
main.c
main ()
{
ix = 1
for (ctr =1; ctr < 10; ctr++)
{
tbl[ctr] = ix
}
}
on encountering ix it should be expanded to index and same for ctr and tbl.
This is the code I wrote to check for occurrences(not yet to replace the encountered words)
import System.Environment
import System.IO
import Data.Char
import Control.Monad
import Data.Set
main = do
s <- getLine
f <- readFile "abbreviations.txt"
g <- readFile s
let dict = fromList (lines f)
mapM_ (spell dict) (words g)
spell d w = when (w `member` d) (putStrLn w)
On executing the code it is giving no output.
Instead of the upper code,I tried reading a file using hgetLine then converting it into list of words using words
getLines' h = do
isEOF <- hIsEOF h
if isEOF then
return ()
else
do
line <- hGetLine h
list<-remove (return (words line))
getLines' h
-- print list
main = do
inH <- openFile "abbreviations.txt" ReadMode
getLines' inH
hClose inH
remove [] = []
remove (x:xs)| x == "=" = remove xs
| otherwise = x:remove (xs)
But its giving me errors relating to IO() ,is there any other way in which I could do the following.
Where am I going wrong?
Thank you for any help.
First, there is a problem with your spell function. It should also have an else clause with it:
spell :: (Show a, Ord a) => Set a -> a -> IO ()
spell d w = if (w `member` d)
then print d
else return ()
Also, note that I have changed your putStrLn to print and added a type signature to your code.
On executing the code it is giving no output.
That's because, it's always going to the else clause in your spell function. If you try to trace up the execution of your program, then you will note that, your dict variable will actually contain this Set: ["ctr=counter","ix=index","tbl=table"] and it doesn't contains the words of the file main.c. I hope this will be sufficient to get you started.

What is the Haskell idiom for walking a file and filling a structure when only some of the data is interesting?

Often I find I need to parse a little bit of text. Usually the text is not lines of uniform data like CSV rather it is more unstructured. So the goal is not to turn each line into a Haskell data type but to gather up data into a structure.
In an imperative language I would write something like this.
values = {} # could just as easily be a class or C struct
for line in input_lines:
if line matches A:
parse out interesting piece
values[A] = parsed chunk
elif line matches B:
parse out interesting piece
values[B] = parsed chunk
...
elif line matches Z:
parse out interesting piece
values[Z] = parsed chunk
break # we know there is nothing else after this
do something with values
I wrote a bit of Haskell this morning to do the same thing using foldr.
This parses the output of rsync --stats. A sample file looks like this.
Number of files: 1
Number of files transferred: 0
Total file size: 4953701 bytes
Total transferred file size: 0 bytes
Literal data: 10 bytes
Matched data: 230 bytes
File list size: 43
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 11
Total bytes received: 57
sent 11 bytes received 57 bytes 12.36 bytes/sec
total size is 4953701 speedup is 72848.54
Small and simple to demonstrate my problem. This particular file format is representative of this recurring style of problem where I want to quickly read 3 or 5 bits from a file and doing something else with the results. In an imperative language I'd just toss them into a few variables, a dictionary, something. The Haskell below is my attempt at a similar approach.
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Map as M
import qualified Data.Text as T
import Data.Text (Text)
import qualified Data.Text.IO as TIO
import Data.Text.Read (decimal)
import System.Environment (getArgs)
stats_map :: M.Map Text Int
stats_map = foldr (uncurry M.insert) M.empty [("Total file size", 1),
("Literal data", 2),
("Matched data", 3)]
getStatsMap :: Text -> M.Map Text Integer -> M.Map Text Integer
getStatsMap t rm = doMatch chunks rm
where
chunks = [ T.strip chunk | chunk <- T.splitOn ":" t ]
doMatch :: [Text] -> M.Map Text Integer -> M.Map Text Integer
doMatch (f1:f2:_) rm' =
case M.lookup f1 stats_map of
(Just _) -> case decimal . head . T.words $ f2 of
Left _ -> rm'
Right (x,_) -> M.insert f1 x rm'
Nothing -> rm'
doMatch _ rm' = rm'
parseStats :: [Text] -> M.Map Text Integer
parseStats ts = foldr getStatsMap M.empty ts
readStats :: FilePath -> IO [Text]
readStats filename = TIO.readFile filename >>= return . T.lines
main :: IO ()
main = do
[filename] <- getArgs
lines <- readStats filename
putStrLn . show . parseStats $ lines
Unlike in the imperative version I cannot break the foldr execution though.
Laziness cannot rescue me here. Parsec, attoparsec and friends are both overkill and not exactly what I am looking for this kind of task.
How can I approach this common imperative task in a more Haskell way?
I've gone for simple data structures to try to emphasise that the behaviour's there in the standard ones if you want it:
First version - using catMaybes and take to ignore irrelevant data and shortcut:
import Data.Maybe (catMaybes)
import Data.Char (isDigit)
import Control.Monad (msum)
-- maybe get an int if the key matches before :
get :: String -> String -> Maybe Int
get key input = let (l,r) = break (==':') input in
if l == key then Just . read . filter isDigit $ r
else Nothing
-- get any that match
getAny :: [String] -> String -> Maybe Int
getAny keys input = msum $ map (flip get input) keys
-- get all that match at least one
getThese :: [String] -> String -> [Int]
getThese keys = take (length keys) . catMaybes . map (getAny keys) . lines
This gives you the output you were after:
fmap (getThese ["Total file size","Literal data","Matched data"]) (readFile "example.txt") >>= print
[4953701,10,230]
and we can check that it's shortcutting by feeding it a bomb to eat:
> getThese ["a"] (unlines ["no","a: 5",undefined])
[5]
Sometimes recursion is simpler
Pick out one element for each predicate in order:
oneEach :: [(a->Bool)] -> [a] -> [a]
oneEach [] _ = []
oneEach _ [] = error "oneEach: run out of input while still looking"
oneEach qs#(p:ps) (i:is) | p i = i : oneEach ps is
| otherwise = oneEach qs is
Compose some functions to split the string and pull out the ones we wanted, then read the data. This assumes you want all the digits to the right of the : as your Int
getInOrder :: [String] -> String -> [Int]
getInOrder keys = map (read.filter isDigit.snd)
. oneEach (map ((.fst).(==)) keys)
. map (break (==':'))
. lines
which works:
main = fmap (getInOrder ["Total file size","Literal data","Matched data"]) (readFile "example.txt") >>= print
[4953701,10,230]
This version is primitive in some ways (hard codes some things, doesn't handle ordering), but may be more readable:
import System.Environment (getArgs)
import Data.List.Utils
import Data.Char
main = do
[filename] <- getArgs
txt <- readFile filename
let ls = lines txt
let ils = filter interestingLine ls
putStrLn $ show $ map fmt (filter (/="") ils)
interestingLine l = startswith "Literal data" l
|| startswith "Matched data" l
|| startswith "Total file size" l
fmt :: String -> (String,Int)
fmt l | startswith "Literal data" l = (take 14 l,(read $ filter isNumber l))
| startswith "Matched data" l = (take 14 l,(read $ filter isNumber l))
| startswith "Total file size" l = (take 17 l,(read $ filter isNumber l))
| otherwise = error "fmt: unmatched line, look also at interestingLine"

How do I parse a matrix of integers in Haskell?

So I've read the theory, now trying to parse a file in Haskell - but am not getting anywhere. This is just so weird...
Here is how my input file looks:
m n
k1, k2...
a11, ...., an
a21,.... a22
...
am1... amn
Where m,n are just intergers, K = [k1, k2...] is a list of integers, and a11..amn is a "matrix" (a list of lists): A=[[a11,...a1n], ... [am1... amn]]
Here is my quick python version:
def parse(filename):
"""
Input of the form:
m n
k1, k2...
a11, ...., an
a21,.... a22
...
am1... amn
"""
f = open(filename)
(m,n) = f.readline().split()
m = int(m)
n = int(n)
K = [int(k) for k in f.readline().split()]
# Matrix - list of lists
A = []
for i in range(m):
row = [float(el) for el in f.readline().split()]
A.append(row)
return (m, n, K, A)
And here is how (not very) far I got in Haskell:
import System.Environment
import Data.List
main = do
(fname:_) <- getArgs
putStrLn fname --since putStrLn goes to IO ()monad we can't just apply it
parsed <- parse fname
putStrLn parsed
parse fname = do
contents <- readFile fname
-- ,,,missing stuff... ??? how can I get first "element" and match on it?
return contents
I am getting confused by monads (and the context that the trap me into!), and the do statement. I really want to write something like this, but I know it's wrong:
firstLine <- contents.head
(m,n) <- map read (words firstLine)
because contents is not a list - but a monad.
Any help on the next step would be great.
So I've just discovered that you can do:
liftM lines . readFile
to get a list of lines from a file. However, still the example only only transforms the ENTIRE file, and doesn't use just the first, or the second lines...
The very simple version could be:
import Control.Monad (liftM)
-- this operates purely on list of strings
-- and also will fail horribly when passed something that doesn't
-- match the pattern
parse_lines :: [String] -> (Int, Int, [Int], [[Int]])
parse_lines (mn_line : ks_line : matrix_lines) = (m, n, ks, matrix)
where [m, n] = read_ints mn_line
ks = read_ints ks_line
matrix = parse_matrix matrix_lines
-- this here is to loop through remaining lines to form a matrix
parse_matrix :: [String] -> [[Int]]
parse_matrix lines = parse_matrix' lines []
where parse_matrix' [] acc = reverse acc
parse_matrix' (l : ls) acc = parse_matrix' ls $ (read_ints l) : acc
-- this here is to give proper signature for read
read_ints :: String -> [Int]
read_ints = map read . words
-- this reads the file contents and lifts the result into IO
parse_file :: FilePath -> IO (Int, Int, [Int], [[Int]])
parse_file filename = do
file_lines <- (liftM lines . readFile) filename
return $ parse_lines file_lines
You might want to look into Parsec for fancier parsing, with better error handling.
*Main Control.Monad> parse_file "test.txt"
(3,3,[1,2,3],[[1,2,3],[4,5,6],[7,8,9]])
An easy to write solution
import Control.Monad (replicateM)
-- Read space seperated words on a line from stdin
readMany :: Read a => IO [a]
readMany = fmap (map read . words) getLine
parse :: IO (Int, Int, [Int], [[Int]])
parse = do
[m, n] <- readMany
ks <- readMany
xss <- replicateM m readMany
return (m, n, ks, xss)
Let's try it:
*Main> parse
2 2
123 321
1 2
3 4
(2,2,[123,321],[[1,2],[3,4]])
While the code I presented is quite expressive. That is, you get work done quickly with little code, it has some bad properties. Though I think if you are still learning haskell and haven't started with parser libraries. This is the way to go.
Two bad properties of my solution:
All code is in IO, nothing is testable in isolation
The error handling is very bad, as you see the pattern matching is very aggressive in [m, n]. What happens if we have 3 elements on the first line of the input file?
liftM is not magic! You would think it does some arcane thing to lift a function f into a monad but it is actually just defined as:
liftM f x = do
y <- x
return (f y)
We could actually use liftM to do what you wanted to, that is:
[m,n] <- liftM (map read . words . head . lines) (readFile fname)
but what you are looking for are let statements:
parseLine = map read . words
parse fname = do
(x:y:xs) <- liftM lines (readFile fname)
let [m,n] = parseLine x
let ks = parseLine y
let matrix = map parseLine xs
return (m,n,ks,matrix)
As you can see we can use let to mean variable assignment rather then monadic computation. In fact let statements are you just let expressions when we desugar the do notation:
parse fname =
liftM lines (readFile fname) >>= (\(x:y:xs) ->
let [m,n] = parseLine x
ks = parseLine y
matrix = map parseLine xs
in return matrix )
A Solution Using a Parsing Library
Since you'll probably have a number of people responding with code that parses strings of Ints into [[Int]] (map (map read . words) . lines $ contents), I'll skip that and introduce one of the parsing libraries. If you were to do this task for real work you'd probably use such a library that parses ByteString (instead of String, which means your IO reads everything into a linked list of individual characters).
import System.Environment
import Control.Monad
import Data.Attoparsec.ByteString.Char8
import qualified Data.ByteString as B
First, I imported the Attoparsec and bytestring libraries. You can see these libraries and their documentation on hackage and install them using the cabal tool.
main = do
(fname:_) <- getArgs
putStrLn fname
parsed <- parseX fname
print parsed
main is basically unchanged.
parseX :: FilePath -> IO (Int, Int, [Int], [[Int]])
parseX fname = do
bs <- B.readFile fname
let res = parseOnly parseDrozzy bs
-- We spew the error messages right here
either (error . show) return res
parseX (renamed from parse to avoid name collision) uses the bytestring library's readfile, which reads in the file packed, in contiguous bytes, instead of into cells of a linked list. After parsing I use a little shorthand to return the result if the parser returned Right result or print an error if the parser returned a value of Left someErrorMessage.
-- Helper functions, more basic than you might think, but lets ignore it
sint = skipSpace >> int
int = liftM floor number
parseDrozzy :: Parser (Int, Int, [Int], [[Int]])
parseDrozzy = do
m <- sint
n <- sint
skipSpace
ks <- manyTill sint endOfLine
arr <- count m (count n sint)
return (m,n,ks,arr)
The real work then happens in parseDrozzy. We get our m and n Int values using the above helper. In most Haskell parsing libraries we must explicitly handle whitespace - so I skip the newline after n to get to our ks. ks is just all the int values before the next newline. Now we can actually use the previously specified number of rows and columns to get our array.
Technically speaking, that final bit arr <- count m (count n sint) doesn't follow your format. It will grab n ints even if it means going to the next line. We could copy Python's behavior (not verifying the number of values in a row) using count m (manyTill sint endOfLine) or we could check for each end of line more explicitly and return an error if we are short on elements.
From Lists to a Matrix
Lists of lists are not 2 dimensional arrays - the space and performance characteristics are completely different. Let's pack our list into a real matrix using Data.Array.Repa (import Data.Array.Repa). This will allow us to access the elements of the array efficiently as well as perform operations on the entire matrix, optionally spreading the work among all the available CPUs.
Repa defines the dimensions of your array using a slightly odd syntax. If your row and column lengths are in variables m and n then Z :. n :. m is much like the C declaration int arr[m][n]. For the one dimensional example, ks, we have:
fromList (Z :. (length ks)) ks
Which changes our type from [Int] to Array DIM1 Int.
For the two dimensional array we have:
let matrix = fromList (Z :. m :. n) (concat arr)
And change our type from [[Int]] to Array DIM2 Int.
So there you have it. A parsing of your file format into an efficient Haskell data structure using production-oriented libraries.
What about something simple like this?
parse :: String -> (Int, Int, [Int], [[Int]])
parse stuff = (m, n, ks, xss)
where (line1:line2:rest) = lines stuff
readMany = map read . words
(m:n:_) = readMany line1
ks = readMany line2
xss = take m $ map (take n . readMany) rest
main :: IO ()
main = do
stuff <- getContents
let (m, n, ks, xss) = parse stuff
print m
print n
print ks
print xss

Resources