How to benchmark few lines of code in nim? - nim-lang

I read this down and up: http://nim-lang.org/docs/times.html
But still cannot figure the simple thing: how to get time in ms twice, once before my code, and again after my code runs, and then print the difference?
I tried their example:
var t0 = cpuTime()
sleep(300)
echo "CPU time [s] ", cpuTime() - t0
But this prints something meaningless:
CPU time [s] 4.200000000000005e-05

If you plan to perform a lot of measurements, the best approach is to create a re-usable helper template, abstracting away the timing code:
import times, os, strutils
template benchmark(benchmarkName: string, code: untyped) =
block:
let t0 = epochTime()
code
let elapsed = epochTime() - t0
let elapsedStr = elapsed.formatFloat(format = ffDecimal, precision = 3)
echo "CPU Time [", benchmarkName, "] ", elapsedStr, "s"
benchmark "my benchmark":
sleep 300
This will print out
CPU Time [my benchmark] 0.305s
If you need more comprehensive data about the performance of all of the code included in your project, Nim offers a special build mode, which instruments the compiled code with profiling probes. You can find more about it here:
http://nim-lang.org/docs/estp.html
Lastly, since Nim generates C code with C function names that directly correspond to their Nim counterparts, you can use any C profiler with Nim programs.

cpuTime only calculates the time the CPU actually spends on the process, at least on Linux. So the entire sleeping time doesn't count. You can use epochTime instead, which is the actual UNIX timestamp with subsecond accuracy.

Related

How can I parallelize a for loop in python using multiprocessing package?

Note: I don't need any communication between the processes/threads, I'm interested in completion signal only (that's the reason I posted this question as a new one since all other examples I've found communicated between each other).
How can I use multiprocessing package in Python 3 to parallelize the following piece of code (the end goal is to make it run faster):
a = 123
b = 456
for id in ids: # len(ids) = 10'000
# executes a binary with CLI flags
run_binary_with_id(id, a, b)
# i.e. runs "./hello_world_exec --id id --a a --b b" which takes about 30 seconds on average
I tried the following:
import multiprocessing as mp
def run_binary_with_id(id, a, b):
run_command('./hello_world_exec --id {} --a {} --b {}'.format(id, a, b))
if __name__ == '__main__':
ctx = mp.get_context('spawn')
q = ctx.Queue()
a = 123
b = 456
ids = range(10000)
for id in ids:
p = ctx.Process(target=run_binary_with_id, args=(id,a,b))
p.start()
p.join()
# The binary was executed len(ids) number of times, do other stuff assuming everything's completed at this point
or
for id in ids:
map.apply_async(run_binary_with_id, (id,a,b))
In a similar question the answer is the following:
def consume(iterator):
deque(iterator, max_len=0)
x=pool.imap_unordered(f,((i,j) for i in range(10000) for j in range(10000)))
consume(x)
which I don't really understand at all (why do I need this consume()).
Trying to spawn 10000 processes to run in parallel is almost certainly going to overload your system and make it run slower than running the processes sequentially due to the overhead involved in the OS having to constantly perform context switching between processes when the number of processes far exceeds the number of CPUs/cores your system has.
You can instead use multiprocessing.Pool to limit the number of worker processes spawned for the task. The Pool constructor limits the number of processes to the number of cores your system has by default, but you can fine tune it if you'd like with the processes parameter. You can then use its map method to easily map a sequence of arguments to apply to a given function to run in parallel. It can only map one argument to the function, however, so you would have to use functools.partial to supply default values for the other arguments, which in your case do not change between calls:
from functools import partial
if __name__ == '__main__':
_run_binary_with_id = partial(run_binary_with_id, a=123, b=456)
with mp.Pool() as pool:
pool.map(_run_binary_with_id, range(10000))

complete vs. simple i/o lua

I am trying to write a program to analyze data from a simulation. Since the simulation software I am using is what is running the Lua program, I am not sure if this is the right place to ask this question, but I am probably making a programming error.
I am struggling with the difference between using the simple and complete I/O models. I have a block of code, which works, and looks like this:
io.output([[filename_and_location]])
function segment.other_actions
if ion_splat ~= 0 then io.write(ion_px_mm, "\n") end
io.close()
end
Note: ion_splat and ion_px_mm are pre-determined variables that take on number values. This code is run over and over again throughout the simulation.
Then I decided to try achieving the same thing using the complete I/O model like this:
f = io.open([[file_name_and_location]],"w")
function segment.other_actions ()
if ion_splat ~= 0 then f:write(ion_py_mm, "\n") end
f:close()
end
end
This runs, but takes a lot longer than the other way. Why is that?
Example 1:
for i = 1, 1000 do
io.output("test.txt")
io.write("some data to be written\n")
io.close()
end
Example 2:
for i = 1, 1000 do
local f = io.open("test.txt", "w")
f:write("some data to be written\n")
f:close()
end
There is no measurable difference in the execution time.
The latter approach is usually preferable because the used file is identified explicitly.

My timer always returns 0

I have created a timer in haskell. The problem is, it always returns 0. I think this is because of laziness, but I do not know how to fix it.
import System.CPUTime
timeit::IO ()->IO (Float)
timeit io=do
start <-getCPUTime
action <-seq start io
end <-seq action getCPUTime
return $! (fromIntegral $ end-start)/(10**12)
As you can see, I have throw in seq and $! galor, but to no avail. What do I do?
Here in an example run:
*Main> timeit test
What is your name?
Haskell
Your name is Haskell.
0.0
Here is some code I got to work
import Data.Time.Clock
timeit::IO ()->IO NominalDiffTime
timeit doit=do
start <- getCurrentTime
doit
end <- getCurrentTime
return (diffUTCTime end start)
Now for some discussion-
From the comments it seemed that you wanted real time, not cpu time, so that is what I wrote.
The System.Time libary is deprecated, so I switched you to the Data.Time library.
NominalDiffTime holds the time in seconds....
NominalDiffTime ignores leap seconds! In the unlikely event that a leap second is added during the running of the program, a 10 second delay will show up as 11 seconds. I googled around on how to fix this, and although Data.Time does have a DiffTime type to account for this, there doesn't seem to be a simple way to generate a DiffTime from UTCTime. I think you may be able to use the Posix time libraries and get seconds from Jan 1, 1970.... Then take the diff, but this seems like to much hastle for a pretty rare bug. If you are writing software to safely land airplanes however, please dig a bit deeper and fix this problem. :)
I've used this before and it works well for approximations. Don't rely on it for very precise times, but it should be accurate to within a millisecond. I've never tested its accuracy, so use at your own risk
import System.CPUTime
timeit :: IO a -> IO (a, Integer)
timeit action = do
start <- getCPUTime
value <- action
end <- getCPUTime
return (value, end - start)
No explicit strictness required.

How to compute effectively string length of cell array of strings

I have a cell array in Matlab:
strings = {'one', 'two', 'three'};
How can I efficiently calculate the length of all three strings? Right now I use a for loop:
lengths = zeros(3,1);
for i = 1:3
lengths(i) = length(strings{i});
end
This is however unusable slow when you have a large amount of strings (I've got 480,863 of them). Any suggestions?
You can also use:
cellfun(#length, strings)
It will not be faster, but makes the code clearer.
Regarding the slowness, you should first run the profiler to check where the bottleneck is. Only then should you optimize.
Edit: I just recalled that 'length' used to be a built-in function in cellfun in older Matlab versions. So it might actually be faster! Try
cellfun('length',strings)
Edit(2) : I have to admit that my first answer was a wild guess. Following #Rodin s comment, I decided to check out the speedup.
Here is the code of the benchmark:
First, the code that generates a lot of strings and saves to disk:
function GenerateCellStrings()
strs = cell(1,10000);
for i=1:10000
strs{i} = GenerateRandomString();
end
save strs;
end
function st = GenerateRandomString()
MAX_STR_LENGTH = 1000;
n = randi(MAX_STR_LENGTH);
st = char(randi([97 122], 1,n ));
end
Then, the benchmark itself:
function CheckRunTime()
load strs;
tic;
disp('Loop:');
for i=1:numel(strs)
n = length(strs{i});
end
toc;
disp('cellfun (String):');
tic;
cellfun('length',strs);
toc;
disp('cellfun (function handle):');
tic;
cellfun(#length,strs);
toc;
end
And the results are:
Loop:
Elapsed time is 0.010663 seconds.
cellfun (String):
Elapsed time is 0.000313 seconds.
cellfun (function handle):
Elapsed time is 0.006280 seconds.
Wow!! The 'length' syntax is about 30 times faster than a loop! I can only guess why it becomes so fast. Maybe the fact that it recognizes length specifically. Might be JIT optimization.
Edit(3) - I found out the reason for the speedup. It is indeed recognition of length specifically. Thanks to #reve_etrange for the info.
Keep an array of the lengths of said strings, and update that array when you update the strings. This will allow you O(1) time access to string lengths. Since you are updating it at the same time you generate or load strings, it shouldn't slow things down much, since integer array operations are (generally) faster than string operations.

Why is the constant returned faster?

I have the following snippet:
import Network.MessagePackRpc.Server
ping :: String -> IO String
ping s = return s
main :: IO ()
main = do
serve 8081 [ ("add", fun add), ("ping", fun ping) ]
Now, what I observe is that when I send a string with e.g. 100000 identical 1024-Byte strings to it, the small snippet runs in approximately 2s. If I go and replace return s with e.g. return "the-1024-byte-string" then this runs approximately 25% faster. I have exercised this up and down. I am really surprised that the impact is so huge. Does anyone have an explanation?
Returning a known (at compile time) constant could enable more inlining. One would have to check the generated code to be sure, however.

Resources