I have the following snippet:
import Network.MessagePackRpc.Server
ping :: String -> IO String
ping s = return s
main :: IO ()
main = do
serve 8081 [ ("add", fun add), ("ping", fun ping) ]
Now, what I observe is that when I send a string with e.g. 100000 identical 1024-Byte strings to it, the small snippet runs in approximately 2s. If I go and replace return s with e.g. return "the-1024-byte-string" then this runs approximately 25% faster. I have exercised this up and down. I am really surprised that the impact is so huge. Does anyone have an explanation?
Returning a known (at compile time) constant could enable more inlining. One would have to check the generated code to be sure, however.
Related
Recently, I started learning Haskell because I wanted to broaden my knowledge on functional programming and I must say I am really loving it so far. The resource I am currently using is the course 'Haskell Fundamentals Part 1' on Pluralsight. Unfortunately I have some difficulty understanding one particular quote of the lecturer about the following code and was hoping you guys could shed some light on the topic.
Accompanying Code
helloWorld :: IO ()
helloWorld = putStrLn "Hello World"
main :: IO ()
main = do
helloWorld
helloWorld
helloWorld
The Quote
If you have the same IO action multiple times in a do-block, it will be run multiple times. So this program prints out the string 'Hello World' three times. This example helps illustrate that putStrLn is not a function with side effects. We call the putStrLn function once to define the helloWorld variable. If putStrLn had a side effect of printing the string, it would only print once and the helloWorld variable repeated in the main do-block wouldn't have any effect.
In most other programming languages, a program like this would print 'Hello World' only once, since the printing would happen when the putStrLn function was called. This subtle distinction would often trip up beginners, so think about this a bit, and make sure you understand why this program prints 'Hello World' three times and why it would print it only once if the putStrLn function did the printing as a side effect.
What I don't understand
For me it seems almost natural that the string 'Hello World' is printed three times. I perceive the helloWorld variable (or function?) as a sort of callback which is invoked later. What I don't understand is, how if putStrLn had a side effect, it would result in the string being printed only once. Or why it would only be printed once in other programming languages.
Let's say in C# code, I would presume it would look like this:
C# (Fiddle)
using System;
public class Program
{
public static void HelloWorld()
{
Console.WriteLine("Hello World");
}
public static void Main()
{
HelloWorld();
HelloWorld();
HelloWorld();
}
}
I am sure I am overlooking something quite simple or misinterpret his terminology. Any help would be greatly appreciated.
EDIT:
Thank you all for your answers! Your answers helped me get a better understanding of these concepts. I don't think it fully clicked yet, but I will revisit the topic in the future, thank you!
It’d probably be easier to understand what the author means if we define helloWorld as a local variable:
main :: IO ()
main = do
let helloWorld = putStrLn "Hello World!"
helloWorld
helloWorld
helloWorld
which you could compare to this C#-like pseudocode:
void Main() {
var helloWorld = {
WriteLine("Hello World!")
}
helloWorld;
helloWorld;
helloWorld;
}
I.e. in C# WriteLine is a procedure that prints its argument and returns nothing. In Haskell, putStrLn is a function that takes a string and gives you an action that would print that string were it to be executed. It means that there is absolutely no difference between writing
do
let hello = putStrLn "Hello World"
hello
hello
and
do
putStrLn "Hello World"
putStrLn "Hello World"
That being said, in this example the difference isn’t particularly profound, so it’s fine if you don’t quite get what the author is trying to get at in this section and just move on for now.
it works a bit better if you compare it to python
hello_world = print('hello world')
hello_world
hello_world
hello_world
The point here being that IO actions in Haskell are “real” values that don’t need to be wrapped in further “callbacks” or anything of the sort to prevent them from executing - rather, the only way to do get them to execute is to put them in a particular place (i.e. somewhere inside main or a thread spawned off main).
This isn’t just a parlour trick either, this does end up having some interesting effects on how you write code (for example, it’s part of the reason why Haskell doesn’t really need any of the common control structures you’d be familiar with from imperative languages and can get away with doing everything in terms of functions instead), but again I wouldn’t worry too much about this (analogies like these don’t always immediately click)
It might be easier to see the difference as described if you use a function that actually does something, rather than helloWorld. Think of the following:
add :: Int -> Int -> IO Int
add x y = do
putStrLn ("I am adding " ++ show x ++ " and " ++ show y)
return (x + y)
plus23 :: IO Int
plus23 = add 2 3
main :: IO ()
main = do
_ <- plus23
_ <- plus23
_ <- plus23
return ()
This will print out "I am adding 2 and 3" 3 times.
In C#, you might write the following:
using System;
public class Program
{
public static int add(int x, int y)
{
Console.WriteLine("I am adding {0} and {1}", x, y);
return x + y;
}
public static void Main()
{
int x;
int plus23 = add(2, 3);
x = plus23;
x = plus23;
x = plus23;
return;
}
}
Which would print only once.
If the evaluation of putStrLn "Hello World" had side effects, then then message would only be printed once.
We can approximate that scenario with the following code:
import System.IO.Unsafe (unsafePerformIO)
import Control.Exception (evaluate)
helloWorld :: ()
helloWorld = unsafePerformIO $ putStrLn "Hello World"
main :: IO ()
main = do
evaluate helloWorld
evaluate helloWorld
evaluate helloWorld
unsafePerformIO takes an IO action and "forgets" it's an IO action, unmooring it from the usual sequencing imposed by the composition of IO actions and letting the effect take place (or not) according to the vagaries of lazy evaluation.
evaluate takes a pure value and ensures that the value is evaluated whenever the resulting IO action is evaluated—which for us it will be, because it lies in the path of main. We are using it here to connect the evaluation of some values to the exection of the program.
This code only prints "Hello World" one time. We treat helloWorld as a pure value. But that means it will be shared between all evaluate helloWorld calls. And why not? It's a pure value after all, why recalculate it needlessly? The first evaluate action "pops" the "hidden" effect and the later actions just evaluate the resulting (), which doesn't cause any further effects.
There is one detail to notice: you call putStrLn function only once, while defining helloWorld. In main function you just use the return value of that putStrLn "Hello, World" three times.
The lecturer says that putStrLn call has no side effects and it's true. But look at the type of helloWorld - it is an IO action. putStrLn just creates it for you. Later, you chain 3 of them with the do block to create another IO action - main. Later, when you execute your program, that action will be run, that's where side effects lie.
The mechanism that lies in base of this - monads. This powerful concept allows you to use some side effects like printing in a language that doesn't support side effects directly. You just chain some actions, and that chain will be run on start of your program. You will need to understand that concept deeply if you want to use Haskell seriously.
I read this down and up: http://nim-lang.org/docs/times.html
But still cannot figure the simple thing: how to get time in ms twice, once before my code, and again after my code runs, and then print the difference?
I tried their example:
var t0 = cpuTime()
sleep(300)
echo "CPU time [s] ", cpuTime() - t0
But this prints something meaningless:
CPU time [s] 4.200000000000005e-05
If you plan to perform a lot of measurements, the best approach is to create a re-usable helper template, abstracting away the timing code:
import times, os, strutils
template benchmark(benchmarkName: string, code: untyped) =
block:
let t0 = epochTime()
code
let elapsed = epochTime() - t0
let elapsedStr = elapsed.formatFloat(format = ffDecimal, precision = 3)
echo "CPU Time [", benchmarkName, "] ", elapsedStr, "s"
benchmark "my benchmark":
sleep 300
This will print out
CPU Time [my benchmark] 0.305s
If you need more comprehensive data about the performance of all of the code included in your project, Nim offers a special build mode, which instruments the compiled code with profiling probes. You can find more about it here:
http://nim-lang.org/docs/estp.html
Lastly, since Nim generates C code with C function names that directly correspond to their Nim counterparts, you can use any C profiler with Nim programs.
cpuTime only calculates the time the CPU actually spends on the process, at least on Linux. So the entire sleeping time doesn't count. You can use epochTime instead, which is the actual UNIX timestamp with subsecond accuracy.
Lets suppose we have a function
type Func = Bool -> SophisticatedData
fun1 :: Func
And we'd like to change this function some input:
change :: SophisticatedData -> Func -> Func
change data func = \input -> if input == False then data else func input
Am I right that after several calls of change (endFunc = change data1 $ change data2 $ startFunc) resulting function would call all intermediate ones each time? Am I right that GC wouldn't able to delete unused data? What is the haskell way to cope with this task?
Thanks.
Well let's start by cleaning up change to be a bit more legible
change sd f input = if input then func input else sd
So when we compose these
change d1 $ change d2 $ change d3
GHC starts by storing a thunk for each of them. Remember that $ is a function to so the whole change d* thing is going to be a thunk at first. Thunks are relatively cheap and if you're not creating 10k or so of them at once you'll be just fine :) so no worries there.
Now the question is, what happens when you start evaluating, the answer is, it'll still not evaluate the complex data, so it's still quite memory efficient, and it only needs to force input to determine which branch it's taking. Because of this, you should never actually fully evaluate SophisticatedData until after choose has run and returned a one to you, then it will be evaluated as need if you use it.
Further more, at each step, GHC can garbage collect the unneeded thunks since they can't be referenced anymore.
In conclusion, you should be just fine. Trust in the laziness
You are correct: if foo is a chain of O(n) calls to change, there will be O(n) overhead on every call to foo. The way to deal with this is to memoize foo:
memoize :: Func -> Func
memoize f = \x -> if x then fTrue else fFalse where
fTrue = f True
fFalse = f False
I think I'm fundamentally misunderstanding how to attack this type of problem with Netwire:
I have the following test-case:
I'd like to take a string, split it into lines, print each line, then exit.
The pieces I'm missing are:
How to inhibit after a value early in a pipeline, but then if that value is split and the results produced later, not inhibit until all those results are consumed.
What the main loop function should look like
If I should use 'once'
Here is the code I have so far:
import Control.Wire
main :: IO ()
main = recur mainWire
recur :: Wire () IO () () -> IO ()
recur a = do
(e,w) <- stepWire a 0 ()
case e of Left () -> return ()
Right () -> recur w
mainWire :: Wire () IO () ()
mainWire = pure "asdf\nqwer\nzxcv"
>>> once
>>> arr lines
>>> fifo
>>> arr print
>>> perform
This outputs the following:
"asdf"
and then quits. If I remove the once, then the program performs as expected, repeatedly outputting the full list of lines forever.
I'd like the following output:
"asdf"
"qwer"
"zxcv"
I'm sure that I'm just missing some intuition here about the correct way to approach this class of problem with Netwire.
Note: This is for an older version of netwire (before events worked like they do now), so some translating of the code would be required to make this work properly with the current version.
If I understood you right, you want a wire that produces the lines of a string, and then inhibits when it's done with that? It's a bit hard to tell.
once as the name implies, produces exactly once and then inhibits forever. Again it's a bit unclear what your wires are doing (because you didn't tell us) but it's not something you normally put into your "main" wire (so far I've only ever used once with andThen).
If that is correct, I'd probably do it something along the lines of:
produceLines s = produceLines' $ lines s where
produceLines' [] = inhibit mempty
produceLines' (l:ls) = pure s . once --> produceLines' ls
(You could write that as a fold or something, I just thought this was a bit clearer).
--> is pretty for andThen in case you didn't know. Basically this splits the passed string into lines, and turns them into a wire that produces the first line once, and then behaves like a similar wire except with the first element removed. It inhibits indefinitely once all values were produced.
Is that what you wanted?
Update
I see what you were trying to do now.
The wire you were trying to write could be done as
perform . arr print . fifo . ((arr lines . pure "asdf\nqwer\nzxcv" . once) --> pure [])
The part in the parentheses produce ["adf","nqwer","nzxc"] for one instant, and then produces [] forever. fifo takes values from the previous wire, adding the result from the previous wire in every instance (because of that we have to keep producing []). The rest is as you know it (I'm using the function-like notation rather than the arrow notation because I prefer it, but that shouldn't be a problem for you).
question
I want a program that will write a sequence like,
1
...
10000000
to a file. What's the simplest code one can write, and get decent performance? My intuition is that there is some lack-of-buffering problem. My C code runs at 100 MB/s, whereas by reference the Linux command line utility dd runs at 9 GB/s 3 GB/s (sorry for the imprecision, see comments -- I'm more interested in the big picture orders-of-magnitude though).
One would think this would be a solved problem by now ... i.e. any modern compiler would make it immediate to write such programs that perform reasonably well ...
C code
#include <stdio.h>
int main(int argc, char **argv) {
int len = 10000000;
for (int a = 1; a <= len; a++) {
printf ("%d\n", a);
}
return 0;
}
I'm compiling with clang -O3. A performance skeleton which calls putchar('\n') 8 times gets comparable performance.
Haskell code
A naiive Haskell implementation runs at 13 MiB/sec, compiling with ghc -O2 -optc-O3 -optc-ffast-math -fllvm -fforce-recomp -funbox-strict-fields. (I haven't recompiled my libraries with -fllvm, perhaps I need to do that.) Code:
import Control.Monad
main = forM [1..10000000 :: Int] $ \j -> putStrLn (show j)
My best stab with Haskell runs even slower, at 17 MiB/sec. The problem is I can't find a good way to convert Vector's into ByteString's (perhaps there's a solution using iteratees?).
import qualified Data.Vector.Unboxed as V
import Data.Vector.Unboxed (Vector, Unbox, (!))
writeVector :: (Unbox a, Show a) => Vector a -> IO ()
writeVector v = V.mapM_ (System.IO.putStrLn . show) v
main = writeVector (V.generate 10000000 id)
It seems that writing ByteString's is fast, as demonstrated by this code, writing an equivalent number of characters,
import Data.ByteString.Char8 as B
main = B.putStrLn (B.replicate 76000000 '\n')
This gets 1.3 GB/s, which isn't as fast as dd, but obviously much better.
Some completely unscientific benchmarking first:
All programmes have been compiled with the default optimisation level (-O3 for gcc, -O2 for GHC) and run with
time ./prog > outfile
As a baseline, the C programme took 1.07s to produce a ~76MB (78888897 bytes) file, roughly 70MB/s throughput.
The "naive" Haskell programme (forM [1 .. 10000000] $ \j -> putStrLn (show j)) took 8.64s, about 8.8MB/s.
The same with forM_ instead of forM took 5.64s, about 13.5MB/s.
The ByteString version from dflemstr's answer took 9.13s, about 8.3MB/s.
The Text version from dflemstr's answer took 5.64s, about 13.5MB/s.
The Vector version from the question took 5.54s, about 13.7MB/s.
main = mapM_ (C.putStrLn . C.pack . show) $ [1 :: Int .. 10000000], where C is Data.ByteString.Char8, took 4.25s, about 17.9MB/s.
putStr . unlines . map show $ [1 :: Int .. 10000000] took 3.06s, about 24.8MB/s.
A manual loop,
main = putStr $ go 1
where
go :: Int -> String
go i
| i > 10000000 = ""
| otherwise = shows i . showChar '\n' $ go (i+1)
took 2.32s, about 32.75MB/s.
main = putStrLn $ replicate 78888896 'a' took 1.15s, about 66MB/s.
main = C.putStrLn $ C.replicate 78888896 'a' where C is Data.ByteString.Char8, took 0.143s, about 530MB/s, roughly the same figures for lazy ByteStrings.
What can we learn from that?
First, don't use forM or mapM unless you really want to collect the results. Performancewise, that sucks.
Then, ByteString output can be very fast (10.), but if the construction of the ByteString to output is slow (3.), you end up with slower code than the naive String output.
What's so terrible about 3.? Well, all the involved Strings are very short. So you get a list of
Chunk "1234567" Empty
and between any two such, a Chunk "\n" Empty is put, then the resulting list is concatenated, which means all these Emptys are tossed away when a ... (Chunk "1234567" (Chunk "\n" (Chunk "1234568" (...)))) is built. That's a lot of wasteful construct-deconstruct-reconstruct going on. Speed comparable to that of the Text and the fixed "naive" String version can be achieved by packing to strict ByteStrings and using fromChunks (and Data.List.intersperse for the newlines). Better performance, slightly better than 6., can be obtained by eliminating the costly singletons. If you glue the newlines to the Strings, using \k -> shows k "\n" instead of show, the concatenation has to deal with half as many slightly longer ByteStrings, which pays off.
I'm not familiar enough with the internals of either text or vector to offer more than a semi-educated guess concerning the reasons for the observed performance, so I'll leave them out. Suffice it to say that the performance gain is marginal at best compared to the fixed naive String version.
Now, 6. shows that ByteString output is faster than String output, enough that in this case the additional work of packing is more than compensated. However, don't be fooled by that to believe that is always so. If the Strings to pack are long, the packing can take more time than the String output.
But ten million invocations of putStrLn, be it the String or the ByteString version, take a lot of time. It's faster to grab the stdout Handle just once and construct the output String in non-IO code. unlines already does well, but we still suffer from the construction of the list map show [1 .. 10^7]. Unfortunately, the compiler didn't manage to eliminate that (but it eliminated [1 .. 10^7], that's already pretty good). So let's do it ourselves, leading to 8. That's not too terrible, but still takes more than twice as long as the C programme.
One can make a faster Haskell programme by going low-level and directly filling ByteStrings without going through String via show, but I don't know if the C speed is reachable. Anyway, that low-level code isn't very pretty, so I'll spare you what I have, but sometimes one has to get one's hands dirty if speed matters.
Using lazy byte strings gives you some buffering, because the string will be written instantly and more numbers will only be produced as they are needed. This code shows the basic idea (there might be some optimizations that could be made):
import qualified Data.ByteString.Lazy.Char8 as ByteString
main =
ByteString.putStrLn .
ByteString.intercalate (ByteString.singleton '\n') .
map (ByteString.pack . show) $
([1..10000000] :: [Int])
I still use Strings for the numbers here, which leads to horrible slowdowns. If we switch to the text library instead of the bytestring library, we get access to "native" show functions for ints, and can do this:
import Data.Monoid
import Data.List
import Data.Text.Lazy.IO as Text
import Data.Text.Lazy.Builder as Text
import Data.Text.Lazy.Builder.Int as Text
main :: IO ()
main =
Text.putStrLn .
Text.toLazyText .
mconcat .
intersperse (Text.singleton '\n') .
map Text.decimal $
([1..10000000] :: [Int])
I don't know how you are measuring the "speed" of these programs (with the pv tool?) but I imagine that one of these procedures will be the fastest trivial program you can get.
If you are going for maximum performance, then it helps to take a holistic view; i.e., you want to write a function that maps from [Int] to series of system calls that write chunks of memory to a file.
Lazy bytestrings are good representation for a sequence of chunks of memory. Mapping a lazy bytestring to a series of systems calls that write chunks of memory is what L.hPut is doing (assuming an import qualified Data.ByteString.Lazy as L). Hence, we just need a means to efficiently construct the corresponding lazy bytestring. This is what lazy bytestring builders are good at. With the new bytestring builder (here is the API documentation), the following code does the job.
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Lazy.Builder (toLazyByteString, charUtf8)
import Data.ByteString.Lazy.Builder.ASCII (intDec)
import Data.Foldable (foldMap)
import Data.Monoid (mappend)
import System.IO (openFile, IOMode(..))
main :: IO ()
main = do
h <- openFile "/dev/null" WriteMode
L.hPut h $ toLazyByteString $
foldMap ((charUtf8 '\n' `mappend`) . intDec) [1..10000000]
Note that I output to /dev/null to avoid interference by the disk driver. The effort of moving the data to the OS remains the same. On my machine, the above code runs in 0.45 seconds, which is 12 times faster than the 5.4 seconds of your original code. This implies a throughput of 168 MB/s. We can squeeze out an additional 30% speed (220 MB/s) using bounded encodings].
import qualified Data.ByteString.Lazy.Builder.BasicEncoding as E
L.hPut h $ toLazyByteString $
E.encodeListWithB
((\x -> (x, '\n')) E.>$< E.intDec `E.pairB` E.charUtf8)
[1..10000000]
Their syntax looks a bit quirky because a BoundedEncoding a specifies the conversion of a Haskell value of type a to a bounded-length sequence of bytes such that the bound can be computed at compile-time. This allows functions such as E.encodeListWithB to perform some additional optimizations for implementing the actual filling of the buffer. See the the documentation of Data.ByteString.Lazy.Builder.BasicEncoding in the above link to the API documentation (phew, stupid hyperlink limit for new users) for more information.
Here is the source of all my benchmarks.
The conclusion is that we can get very good performance from a declarative solution provided that we understand the cost model of our implementation and use the right datastructures. Whenever constructing a packed sequence of values (e.g., a sequence of bytes represented as a bytestring), then the right datastructure to use is a bytestring Builder.