Haskell triangle microbenchmark

Haskell triangle microbenchmark - haskell

Consider this simple "benchmark":
n :: Int
n = 1000
main = do
print $ length [(a,b,c) | a<-[1..n],b<-[1..n],c<-[1..n],a^2+b^2==c^2]
and appropriate C version:
#include <stdio.h>
int main(void)
{
int a,b,c, N=1000;
int cnt = 0;
for (a=1;a<=N;a++)
for (b=1;b<=N;b++)
for (c=1;c<=N;c++)
if (a*a+b*b==c*c) cnt++;
printf("%d\n", cnt);
}
Compilation:
Haskell version is compiled as: ghc -O2 triangle.hs (ghc 7.4.1)
C version is compiled as: gcc -O2 -o triangle-c triangle.c (gcc 4.6.3)
Run times:
Haskell: 4.308s real
C: 1.145s real
Is it OK behavior even for such a simple and maybe well optimizable program that Haskell is almost 4 times slower? Where does Haskell waste time?

The Haskell version is wasting time allocating boxed integers and tuples.
You can verify this by for example running the haskell program with the flags +RTS -s. For me the outputted statistics include:
80,371,600 bytes allocated in the heap
A straightforward encoding of the C version is faster since the compiler can use unboxed integers and skip allocating tuples:
n :: Int
n = 1000
main = do
print $ f n
f :: Int -> Int
f max = go 0 1 1 1
where go cnt a b c
| a > max = cnt
| b > max = go cnt (a+1) 1 1
| c > max = go cnt a (b+1) 1
| a^2+b^2==c^2 = go (cnt+1) a b (c+1)
| otherwise = go cnt a b (c+1)
See:
51,728 bytes allocated in the heap
The running time of this version is 1.920s vs. 1.212s for the C version.

I don't know how much your "bench" is relevant.
I agree that the list-comprehension syntax is "nice" to use, but if you want to compare the performances of the two languages, you should maybe compare them on a fairer test.
I mean, creating a list of possibly a lot of elements and then calculating it's length is nothing like incrementing a counter in a (triple loop).
So maybe haskell has some nice optimizations which detects what you are doing and never creates the list, but I wouldn't code relying on that, and you probably shouldn't either.
I don't think you would code your program like that if you needed to count rapidly, so why do it for this bench?

Haskell can be optimized quite well — but you need the proper techniques, and you need to know what you're doing.
This list comprehension syntax is elegant, yet wasteful. You should read the appropriate chapter of RealWorldHaskell in order to find out more about your profiling opportunities. In this exact case, you create a lot of list spines and boxed Ints for no good reason at all. See here:
You should definitely do something about that. EDIT: #opqdonut just posted a good answer on how to make this faster.
Just remember next time to profile your application before comparing any benchmarks. Haskell makes it easy to write idiomatic code, but it also hides a lot of complexity.

Related

Permutation to disjoint cycles in Haskell

I was trying to implement permutation to cycles in Haskell without using Monad. The problem is as follow: given a permutation of numbers [1..n], output the correspondence disjoint cycles. The function is defined like
permToCycles :: [Int] -> [[Int]]
For the input:
permToCycles [3,5,4,1,2]
The output should be
[[3,4,1],[5,2]]
By the definition of cyclic permutation, the algorithm itself is straightforward. Since [3,5,4,1,2] is a permutation of [1,2,3,4,5], we start from the first element 3 and follow the orbit until we get back to 3. In this example, we have two cycles 3 -> 4 -> 1 -> 3. Continue to do so until we traverse all elements. Thus the output is [[3,4,1],[5,2]].
Using this idea, it is fairly easy to implement in any imperative language, but I have trouble with doing it in Haskell. I find something similar in the module Math.Combinat.Permutations, but the implementation of function permutationToDisjointCycles uses Monad, which is not easy to understand as I'm a beginner.
I was wondering if I could implement it without Monad. Any help is appreciated.
UPDATE: Here is the function implemented in Python.
def permToCycles(perm):
pi_dict = {i+1: perm[i]
for i in range(len(perm))} # permutation as a dictionary
cycles = []
while pi_dict:
first_index = next(iter(pi_dict)) # take the first key
this_elem = pi_dict[first_index] # the first element in perm
next_elem = pi_dict[this_elem] # next element according to the orbit
cycle = []
while True:
cycle.append(this_elem)
# delete the item in the dict when adding to cycle
del pi_dict[this_elem]
this_elem = next_elem
if next_elem in pi_dict:
# continue the cycle
next_elem = pi_dict[next_elem]
else:
# end the cycle
break
cycles.append(cycle)
return cycles
print(permToCycles([3, 5, 4, 1, 2]))
The output is
[[3,4,1],[5,2]]
I think the main obstacle when implementing it in Haskell is how to trace the marked (or unmarked) elements. In Python, it can easily be done using a dictionary as I showed above. Also in functional programming, we tend to use recursion to replace loops, but here I have trouble with thinking the recursive structure of this problem.

Let's start with the basics. You hopefully started with something like this:
permutationToDisjointCycles :: [Int] -> [[Int]]
permutationToDisjointCycles perm = ...
We don't actually want to recur on the input list so much as we want to use an index counter. In this case, we'll want a recursive helper function, and the next step is to just go ahead and call it, providing whatever arguments you think you'll need. How about something like this:
permutationToDisjointCycles perm = cycles [] 0
where
cycles :: [Int] -> Int -> [[Int]]
cycles seen ix = ...
Instead of declaring a pi_dict variable like in Python, we'll start with a seen list as an argument (I flipped it around to keeping track of what's been seen because that ends up being a little easier). We do the same with the counting index, which I here called ix. Let's consider the cases:
cycles seen ix
| ix >= length perm = -- we've reached the end of the list
| ix `elem` seen = -- we've already seen this index
| otherwise = -- we need to generate a cycle.
That last case is the interesting one and corresponds to the inner while loop of the Python code. Another while loop means, you guessed it, more recursion! Let's make up another function that we think will be useful, passing along as arguments what would have been variables in Python:
| otherwise = let c = makeCycle ix ix in c : cycles (c ++ seen) (ix+1)
makeCycle :: Int -> Int -> [Int]
makeCycle startIx currentIx = ...
Because it's recursive, we'll need a base case and recursive case (which corresponds to the if statement in the Python code which either breaks the loop or continues it). Rather than use the seen list, it's a little simpler to just check if the next element equals the starting index:
makeCycle startIx currentIx =
if next == start
then -- base case
else -- recursive call, where we attach an index onto the cycle and recur
where next = perm !! i
I left a couple holes that need to be filled in as an exercise, and this version works on 0-indexed lists rather than 1-indexed ones like your example, but the general shape of the algorithm is there.
As a side note, the above algorithm is not super efficient. It uses lists for both the input list and the "seen" list, and lookups in lists are always O(n) time. One very simple performance improvement is to immediately convert the input list perm into an array/vector, which has constant time lookups, and then use that instead of perm !! i at the end.
The next improvement is to change the "seen" list into something more efficient. To match the idea of your Python code, you could change it to a Set (or even a HashSet), which has logarithmic time lookups (or constant with a hashset).
The code you found Math.Combinat.Permutations actually uses an array of Booleans for the "seen" list, and then uses the ST monad to do imperative-like mutation on that array. This is probably even faster than using Set or HashSet, but as you yourself could tell, readability of the code suffers a bit.

Essence of Laziness. Haskell [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am learning a haskell for a few days and the laziness is something like buzzword. Because of the fact I am not familiar with laziness ( I have been working mainly with non-functional languages ) it is not easy concept for me.
So, I am asking for any excerise / example which show me what laziness is in the fact.
Thanks in advance ;)

In Haskell you can create an infinite list. For instance, all natural numbers:
[1,2..]
If Haskell loaded all the items in memory at once that wouldn't be possible. To do so you would need infinite memory.
Laziness allows you to get the numbers as you need them.

Here's something interesting: dynamic programming, the bane of every intro. algorithms student, becomes simple and natural when written in a lazy and functional language. Take the example of string edit distance. This is the problem of measuring how similar two DNA strands are or how many bytes changed between two releases of a binary executable or just how 'different' two strings are. The dynamic programming algorithm, expressed mathematically, is simple:
let:
• d_{i,j} be the edit distance of
the first string at index i, which has length m
and the second string at index j, which has length m
• let a_i be the i^th character of the first string
• let b_j be the j^th character of the second string
define:
d_{i,0} = i (0 <= i <= m)
d_{0,j} = j (0 <= j <= n)
d_{i,j} = d_{i - 1, j - 1} if a_i == b_j
d_{i,j} = min { if a_i != b_j
d_{i - 1, j} + 1 (delete)
d_{i, j - 1} + 1 (insert)
d_{i - 1, j - 1} + 1 (modify)
}
return d_{m, n}
And the algorithm, expressed in Haskell, follows the same shape of the algorithm:
distance a b = d m n
where (m, n) = (length a, length b)
a' = Array.listArray (1, m) a
b' = Array.listArray (1, n) b
d i 0 = i
d 0 j = j
d i j
| a' ! i == b' ! j = ds ! (i - 1, j - 1)
| otherwise = minimum [ ds ! (i - 1, j) + 1
, ds ! (i, j - 1) + 1
, ds ! (i - 1, j - 1) + 1
]
ds = Array.listArray bounds
[d i j | (i, j) <- Array.range bounds]
bounds = ((0, 0), (m, n))
In a strict language we wouldn't be able to define it so straightforwardly because the cells of the array would be strictly evaluated. In Haskell we're able to have the definition of each cell reference the definitions of other cells because Haskell is lazy – the definitions are only evaluated at the very end when d m n asks the array for the value of the last cell. A lazy language lets us set up a graph of standing dominoes; it's only when we ask for a value that we need to compute the value, which topples the first domino, which topples all the other dominoes. (In a strict language, we would have to set up an array of closures, doing the work that the Haskell compiler does for us automatically. It's easy to transform implementations between strict and lazy languages; it's all a matter of which language expresses which idea better.)
The blog post does a much better job of explaining all this.

So, I am asking for any excerise / example which show me what laziness is in the fact.
Click on Lazy on haskell.org to get the canonical example. There are many other examples just like it to illustrate the concept of delayed evaluation that benefits from not executing some parts of the program logic. Lazy is certainly not slow, but the opposite of eager evaluation common to most imperative programming languages.

Laziness is a consequence of non-strict function evaluation. Consider the "infinite" list of 1s:
ones = 1:ones
At the time of definition, the (:) function isn't evaluated; ones is just a promise to do so when it is necessary. Such a time would be when you pattern match:
myHead :: [a] -> a
myHead (x:rest) = x
When myHead ones is called, x and rest are needed, but the pattern match against 1:ones simply binds x to 1 and rest to ones; we don't need evaluate ones any further at this time, so we don't.
The syntax for infinite lists, using the .. "operator" for arithmetic sequences, is sugar for calls to enumFrom and enumFromThen. That is
-- An infintite list of ones
ones = [1,1..] -- enumFromThen 1 1
-- The natural numbers
nats = [1..] -- enumFrom 1
so again, laziness just comes from the non-strict evaluation of enumFrom.

Unlike with other languages, Haskell decouples the creation and definition of an object.... You can easily watch this in action using Debug.Trace.
You can define a variable like this
aValue = 100
(the value on the right hand side could include a complicated evaluation, but let's keep it simple)
To see if this code ever gets called, you can wrap the expression in Debug.Trace.trace like this
import Debug.Trace
aValue = trace "evaluating aValue" 100
Note that this doesn't change the definition of aValue, it just forces the program to output "evaluating aValue" whenever this expression is actually created at runtime.
(Also note that trace is considered unsafe for production code, and should only be used to debug).
Now, try two experiments.... Write two different mains
main = putStrLn $ "The value of aValue is " ++ show aValue
and
main = putStrLn "'sup"
When run, you will see that the first program actually creates aValue (you will see the "creating aValue" message, while the second does not.
This is the idea of laziness.... You can put as many definitions in a program as you want, but only those that are used will be actually created at runtime.
The real use of this can be seen with objects of infinite size. Many lists, trees, etc. have an infinite number of elements. Your program will use only some finite number of values, but you don't want to muddy the definition of the object with this messy fact. Take for instance the infinite lists given in other answers here....
[1..] -- = [1,2,3,4,....]
You can again see laziness in action here using trace, although you will have to write out a variant of [1..] in an expanded form to do this.
f::Int->[Int]
f x = trace ("creating " ++ show x) (x:f (x+1)) --remember, the trace part doesn't change the expression, it is just used for debugging
Now you will see that only the elements you use are created.
main = putStrLn $ "the list is " ++ show (take 4 $ f 1)
yields
creating 1
creating 2
creating 3
creating 4
the list is [1,2,3,4]
and
main = putStrLn "yo"
will not show any item being created.

haskell implementation of a sequence

I just started Haskell and I'm struggling!!!
So I need to create a list om Haskell that has the formula
F(n) = (F(n-1)+F(n-2)) * F(n-3)/F(n-4)
and I have F(0) =1, F(1)=1,F(2)=1,F(3)=1
So I thought of initializing the first 4 elements of the list and then have a create a recursive function that runs for n>4 and appends the values to the list.
My code looks like this
let F=[1,1,1,1]
fib' n F
| n<4="less than 4"
|otherwise = (F(n-1)+F(n-2))*F(n-3)/F(n-4) : fib (n-1) F
My code looks conceptually right to me(not sure though), but I get an incorrect indentation error when i compile it. And am I allowed to initialize the elements of the list in the way that I have?

First off, variables in Haskell have to be lower case. Secondly, Haskell doesn't let you mix integers and fractions so freely as you may be used to from untyped or barely-typed languages. If you want to convert from an Int or an Integer to, say, a Double, you'll need to use fromIntegral. Thirdly, you can't stick a string in a context where you need a number. Fourthly, you may or may not have an indentation problem—be sure not to use tabs in your Haskell files, and to use the GHC option -fwarn-tabs to be sure.
Now we get to the heart of the matter: you're going about this all somewhat wrong. I'm going to give you a hint instead of a full answer:
thesequence = 1 : 1 : 1 : 1 : -- Something goes here that *uses* thesequence

Reading Unformatted Binary file: Unexpected output - Fortran90

Preface: I needed to figure out the structure of a binary grid_data_file. From the Fortran routines I figured that the first record consists of 57 bytes and has information in the following order.
No. of the file :: integer*4
File name :: char*16
file status :: char*3 (i.e. new, old, tmp)
.... so forth (rest is clear from write statement in the program)
Now for the testing I wrote a simple program as follows: (I haven't included all the parameters)
Program testIO
implicit none
integer :: x, nclat, nclon
character :: y, z
real :: lat_gap, lon_gap, north_lat, west_lat
integer :: gridtype
open(11, file='filename', access='direct', form='unformatted', recl='200')
read(11, rec=1) x,y,z,lat_gap,lon_gap, north_lat,west_lat, nclat, nclon, gridtyp
write(*,*) x,y,z,lat_gap,lon_gap, north_lat,west_lat, nclat, nclon, gridtyp
close(11)
END
To my surprise, when I change the declaration part to
integer*4 :: x, nclat, nclon
character*16 :: y
character*3 :: z
real*4 :: lat_gap, lon_gap, north_lat, west_lat
integer*2 :: gridtype
It gives me SOME correct information, albeit not all! I can't understand this. It would help me to improve my Fortran knowledge if someone explains this phenomenon.
Moreover, I can't use ACCESS=stream due to machine being old and not supported, so I conclude that above is the only possibility to figure out the file structure.

From your replies and what others have commented, I think your problem might be a misunderstanding of what a Fortran "record" is:
You say that you have a binary file where each entry (you said record, but more on that later) is 57 bytes.
The problem is that a "record" in Fortran I/O is not what you would expect it is coming from a C (or anywhere else, really) background. See the following document from Intel, which gives a good explanation of the different access modes:
https://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/fortran/lin/bldaps_for/common/bldaps_rectypes.htm
In short, it has extra data (a header) describing the data in each entry.
Moreover, I can't use ACCESS=stream due to machine being old and not supported, so I conclude that above is the only possibility to figure out the file structure. Any guidance would be a big help!
If you can't use stream, AFAIK there is really no simple and painless way to read binary files with no record information.
A possible solution which requires a C compiler is to do IO in a C function that you call from Fortran, "minimal" example:
main.f90:
program main
integer, parameter :: dp = selected_real_kind(15)
character(len=*), parameter :: filename = 'test.bin'
real(dp) :: val
call read_bin(filename, val)
print*, 'Read: ', val
end program
read.c:
#include <string.h>
#include <stdio.h>
void read_bin_(const char *fname, double *ret, unsigned int len)
{
char buf[256];
printf("len = %d\n", len);
strncpy(buf, fname, len);
buf[len] = '\0'; // fortran strings are not 0-terminated
FILE* fh = fopen(buf, "rb");
fread(ret, sizeof(double), 1, fh);
fclose(fh);
}
Note that there is an extra parameter needed in the end and some string manipulation because of the way Fortran handles strings, which differs from C.
write.c:
#include <stdio.h>
int main() {
double d = 1.234;
FILE* fh = fopen("test.bin", "wb");
fwrite(&d, sizeof(double), 1, fh);
fclose(fh);
}
Compilation instructions:
gcc -o write write.c
gcc -c -g read.c
gfortran -g -o readbin main.f90 read.o
Create binary file with ./write, then see how the Fortran code can read it back with ./readbin.
This can be extended for different data types to basically emulate access=stream. In the end, if you can recompile the original Fortran code to output the data file differently, this will be the easiest solution, as this one is pretty much a crude hack.
Lastly, a tip for getting into unknown data formats: The tool od is your friend, check its manpage. It can directly convert binary represantations into a variety of different native datatypes. Try with the above example (the z adds the character representation in the right-hand column, not very useful here, in general yes):
od -t fDz test.bin

How to get good performance when writing a list of integers from 1 to 10 million to a file?

question
I want a program that will write a sequence like,
1
...
10000000
to a file. What's the simplest code one can write, and get decent performance? My intuition is that there is some lack-of-buffering problem. My C code runs at 100 MB/s, whereas by reference the Linux command line utility dd runs at 9 GB/s 3 GB/s (sorry for the imprecision, see comments -- I'm more interested in the big picture orders-of-magnitude though).
One would think this would be a solved problem by now ... i.e. any modern compiler would make it immediate to write such programs that perform reasonably well ...
C code
#include <stdio.h>
int main(int argc, char **argv) {
int len = 10000000;
for (int a = 1; a <= len; a++) {
printf ("%d\n", a);
}
return 0;
}
I'm compiling with clang -O3. A performance skeleton which calls putchar('\n') 8 times gets comparable performance.
Haskell code
A naiive Haskell implementation runs at 13 MiB/sec, compiling with ghc -O2 -optc-O3 -optc-ffast-math -fllvm -fforce-recomp -funbox-strict-fields. (I haven't recompiled my libraries with -fllvm, perhaps I need to do that.) Code:
import Control.Monad
main = forM [1..10000000 :: Int] $ \j -> putStrLn (show j)
My best stab with Haskell runs even slower, at 17 MiB/sec. The problem is I can't find a good way to convert Vector's into ByteString's (perhaps there's a solution using iteratees?).
import qualified Data.Vector.Unboxed as V
import Data.Vector.Unboxed (Vector, Unbox, (!))
writeVector :: (Unbox a, Show a) => Vector a -> IO ()
writeVector v = V.mapM_ (System.IO.putStrLn . show) v
main = writeVector (V.generate 10000000 id)
It seems that writing ByteString's is fast, as demonstrated by this code, writing an equivalent number of characters,
import Data.ByteString.Char8 as B
main = B.putStrLn (B.replicate 76000000 '\n')
This gets 1.3 GB/s, which isn't as fast as dd, but obviously much better.

Some completely unscientific benchmarking first:
All programmes have been compiled with the default optimisation level (-O3 for gcc, -O2 for GHC) and run with
time ./prog > outfile
As a baseline, the C programme took 1.07s to produce a ~76MB (78888897 bytes) file, roughly 70MB/s throughput.
The "naive" Haskell programme (forM [1 .. 10000000] $ \j -> putStrLn (show j)) took 8.64s, about 8.8MB/s.
The same with forM_ instead of forM took 5.64s, about 13.5MB/s.
The ByteString version from dflemstr's answer took 9.13s, about 8.3MB/s.
The Text version from dflemstr's answer took 5.64s, about 13.5MB/s.
The Vector version from the question took 5.54s, about 13.7MB/s.
main = mapM_ (C.putStrLn . C.pack . show) $ [1 :: Int .. 10000000], where C is Data.ByteString.Char8, took 4.25s, about 17.9MB/s.
putStr . unlines . map show $ [1 :: Int .. 10000000] took 3.06s, about 24.8MB/s.
A manual loop,
main = putStr $ go 1
where
go :: Int -> String
go i
| i > 10000000 = ""
| otherwise = shows i . showChar '\n' $ go (i+1)
took 2.32s, about 32.75MB/s.
main = putStrLn $ replicate 78888896 'a' took 1.15s, about 66MB/s.
main = C.putStrLn $ C.replicate 78888896 'a' where C is Data.ByteString.Char8, took 0.143s, about 530MB/s, roughly the same figures for lazy ByteStrings.
What can we learn from that?
First, don't use forM or mapM unless you really want to collect the results. Performancewise, that sucks.
Then, ByteString output can be very fast (10.), but if the construction of the ByteString to output is slow (3.), you end up with slower code than the naive String output.
What's so terrible about 3.? Well, all the involved Strings are very short. So you get a list of
Chunk "1234567" Empty
and between any two such, a Chunk "\n" Empty is put, then the resulting list is concatenated, which means all these Emptys are tossed away when a ... (Chunk "1234567" (Chunk "\n" (Chunk "1234568" (...)))) is built. That's a lot of wasteful construct-deconstruct-reconstruct going on. Speed comparable to that of the Text and the fixed "naive" String version can be achieved by packing to strict ByteStrings and using fromChunks (and Data.List.intersperse for the newlines). Better performance, slightly better than 6., can be obtained by eliminating the costly singletons. If you glue the newlines to the Strings, using \k -> shows k "\n" instead of show, the concatenation has to deal with half as many slightly longer ByteStrings, which pays off.
I'm not familiar enough with the internals of either text or vector to offer more than a semi-educated guess concerning the reasons for the observed performance, so I'll leave them out. Suffice it to say that the performance gain is marginal at best compared to the fixed naive String version.
Now, 6. shows that ByteString output is faster than String output, enough that in this case the additional work of packing is more than compensated. However, don't be fooled by that to believe that is always so. If the Strings to pack are long, the packing can take more time than the String output.
But ten million invocations of putStrLn, be it the String or the ByteString version, take a lot of time. It's faster to grab the stdout Handle just once and construct the output String in non-IO code. unlines already does well, but we still suffer from the construction of the list map show [1 .. 10^7]. Unfortunately, the compiler didn't manage to eliminate that (but it eliminated [1 .. 10^7], that's already pretty good). So let's do it ourselves, leading to 8. That's not too terrible, but still takes more than twice as long as the C programme.
One can make a faster Haskell programme by going low-level and directly filling ByteStrings without going through String via show, but I don't know if the C speed is reachable. Anyway, that low-level code isn't very pretty, so I'll spare you what I have, but sometimes one has to get one's hands dirty if speed matters.

Using lazy byte strings gives you some buffering, because the string will be written instantly and more numbers will only be produced as they are needed. This code shows the basic idea (there might be some optimizations that could be made):
import qualified Data.ByteString.Lazy.Char8 as ByteString
main =
ByteString.putStrLn .
ByteString.intercalate (ByteString.singleton '\n') .
map (ByteString.pack . show) $
([1..10000000] :: [Int])
I still use Strings for the numbers here, which leads to horrible slowdowns. If we switch to the text library instead of the bytestring library, we get access to "native" show functions for ints, and can do this:
import Data.Monoid
import Data.List
import Data.Text.Lazy.IO as Text
import Data.Text.Lazy.Builder as Text
import Data.Text.Lazy.Builder.Int as Text
main :: IO ()
main =
Text.putStrLn .
Text.toLazyText .
mconcat .
intersperse (Text.singleton '\n') .
map Text.decimal $
([1..10000000] :: [Int])
I don't know how you are measuring the "speed" of these programs (with the pv tool?) but I imagine that one of these procedures will be the fastest trivial program you can get.

If you are going for maximum performance, then it helps to take a holistic view; i.e., you want to write a function that maps from [Int] to series of system calls that write chunks of memory to a file.
Lazy bytestrings are good representation for a sequence of chunks of memory. Mapping a lazy bytestring to a series of systems calls that write chunks of memory is what L.hPut is doing (assuming an import qualified Data.ByteString.Lazy as L). Hence, we just need a means to efficiently construct the corresponding lazy bytestring. This is what lazy bytestring builders are good at. With the new bytestring builder (here is the API documentation), the following code does the job.
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Lazy.Builder (toLazyByteString, charUtf8)
import Data.ByteString.Lazy.Builder.ASCII (intDec)
import Data.Foldable (foldMap)
import Data.Monoid (mappend)
import System.IO (openFile, IOMode(..))
main :: IO ()
main = do
h <- openFile "/dev/null" WriteMode
L.hPut h $ toLazyByteString $
foldMap ((charUtf8 '\n' `mappend`) . intDec) [1..10000000]
Note that I output to /dev/null to avoid interference by the disk driver. The effort of moving the data to the OS remains the same. On my machine, the above code runs in 0.45 seconds, which is 12 times faster than the 5.4 seconds of your original code. This implies a throughput of 168 MB/s. We can squeeze out an additional 30% speed (220 MB/s) using bounded encodings].
import qualified Data.ByteString.Lazy.Builder.BasicEncoding as E
L.hPut h $ toLazyByteString $
E.encodeListWithB
((\x -> (x, '\n')) E.>$< E.intDec `E.pairB` E.charUtf8)
[1..10000000]
Their syntax looks a bit quirky because a BoundedEncoding a specifies the conversion of a Haskell value of type a to a bounded-length sequence of bytes such that the bound can be computed at compile-time. This allows functions such as E.encodeListWithB to perform some additional optimizations for implementing the actual filling of the buffer. See the the documentation of Data.ByteString.Lazy.Builder.BasicEncoding in the above link to the API documentation (phew, stupid hyperlink limit for new users) for more information.
Here is the source of all my benchmarks.
The conclusion is that we can get very good performance from a declarative solution provided that we understand the cost model of our implementation and use the right datastructures. Whenever constructing a packed sequence of values (e.g., a sequence of bytes represented as a bytestring), then the right datastructure to use is a bytestring Builder.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string