/dev/random Extremely Slow? - linux

Some background info: I was looking to run a script on a Red Hat server to read some data from /dev/random and use the Perl unpack() command to convert it to a hex string for usage later on (benchmarking database operations). I ran a few "head -1" on /dev/random and it seemed to be working out fine, but after calling it a few times, it would just kinda hang. After a few minutes, it would finally output a small block of text, then finish.
I switched to /dev/urandom (I really didn't want to, its slower and I don't need that quality of randomness) and it worked fine for the first two or three calls, then it too began hang.
I was wondering if it was the "head" command that was bombing it, so I tried doing some simple I/O using Perl, and it too was hanging.
As a last ditch effort, I used the "dd" command to dump some info out of it directly to a file instead of to the terminal. All I asked of it was 1mb of data, but it took 3 minutes to get ~400 bytes before I killed it.
I checked the process lists, CPU and memory were basically untouched. What exactly could cause /dev/random to crap out like this and what can I do to prevent/fix it in the future?
Edit: Thanks for the help guys! It seems that I had random and urandom mixed up. I've got the script up and running now. Looks like I learned something new today. :)

On most Linux systems, /dev/random is powered from actual entropy gathered by the environment. If your system isn't delivering a large amount of data from /dev/random, it likely means that you're not generating enough environmental randomness to power it.
I'm not sure why you think /dev/urandom is "slower" or higher quality. It reuses an internal entropy pool to generate pseudorandomness - making it slightly lower quality - but it doesn't block. Generally, applications that don't require high-level or long-term cryptography can use /dev/urandom reliably.
Try waiting a little while then reading from /dev/urandom again. It's possible that you've exhausted the internal entropy pool reading so much from /dev/random, breaking both generators - allowing your system to create more entropy should replenish them.
See Wikipedia for more info about /dev/random and /dev/urandom.

This question is pretty old. But still relevant so I'm going to give my answer. Many CPUs today come with a built-in hardware random number generator (RNG). As well many systems come with a trusted platform module (TPM) that also provide a RNG. There are also other options that can be purchased but chances are your computer already has something.
You can use rngd from rng-utils package on most linux distros to seed more random data. For example on fedora 18 all I had to do to enable seeding from the TPM and the CPU RNG (RDRAND instruction) was:
# systemctl enable rngd
# systemctl start rngd
You can compare speed with and without rngd. It's a good idea to run rngd -v -f from command line. That will show you detected entropy sources. Make sure all necessary modules for supporting your sources are loaded. To use TPM, it needs to be activated through the tpm-tools. update: here is a nice howto.
BTW, I've read on the Internet some concerns about TPM RNG often being broken in different ways, but didn't read anything concrete against the RNGs found in Intel, AMD and VIA chips. Using more than one source would be best if you really care about randomness quality.
urandom is good for most use cases (except sometimes during early boot). Most programs nowadays use urandom instead of random. Even openssl does that. See myths about urandom and comparison of random interfaces.
In recent Fedora and RHEL/CentOS rng-tools also support the jitter entropy. You can on lack of hardware options or if you just trust it more than your hardware.
UPDATE: another option for more entropy is HAVEGED (questioned quality). On virtual machines there is a kvm/qemu VirtIORNG (recommended).
UPDATE 2: In Linux 5.6 kernel does its own jitter entropy.

use /dev/urandom, its cryptographically secure.
good read: http://www.2uo.de/myths-about-urandom/
"If you are unsure about whether you should use /dev/random or /dev/urandom, then probably you want to use the latter."
When in doubt in early boot, wether you have enough entropy gathered. use the system call getrandom() instead. [1] (from Linux kernel >= 3.17)
Its best of both worlds,
it blocks until (only once!) enough entropy is gathered,
after that it will never block again.
[1] git kernel commit

If you want more entropy for /dev/random then you'll either need to purchase a hardware RNG or use one of the *_entropyd daemons in order to generate it.

If you are using randomness for testing (not cryptography), then repeatable randomness is better, you can get this with pseudo randomness starting at a known seed. There is usually a good library function for this in most languages.
It is repeatable, for when you find a problem and are trying to debug. It also does not eat up entropy. May be seed the pseudo random generator from /dev/urandom and record the seed in the test log. Perl has a pseudo random number generator you can use.

This fixed it for me.
Use new SecureRandom() instead of SecureRandom.getInstanceStrong()
Some more info can be found here :
https://tersesystems.com/blog/2015/12/17/the-right-way-to-use-securerandom/

/dev/random should be pretty fast these days. However I did notice in OS X reading small bytes from /dev/urandom was really slow. Work around seemed to be to use arc4random instead: https://github.com/crystal-lang/crystal/pull/11974

Related

What is random : crng init done

Can someone explain to me what the above message means? I am developing a Linux block driver and I am attempting to format with ext4. After a few minutes I get this message. I have tried searching other threads but cant find an explanation of what it is. Thanks
tl;dr: The kernel's random-number generator is ready to generate random numbers that are unpredictable enough for serious cryptographic use.
In some systems, something at boot time (e.g. starting sshd) waits for this, this happens frequently when switching an embedded system to OpenSSL 1.1. You can fix that with tools like egd or rng-tools, or hardware randomness support, or tweaking things so the rest of bootup doesn't wait on that something to complete.
Backstory:
Pseudo-random number generators are deterministic algorithms, so with enough output (and/or some knowledge of the internal state, or good guesses about it) an attacker can predict the future output. This is Really Bad if some of that output is going to be e.g. a secret cryptographic key.
For a long time, the Linux kernel has had code to extract some true randomness ("entropy") from unpredictable events (arrival time of network packets, user input, etc.), using math we're not expected to understand, and the resultant randomness is made available with /dev/random. If you read from /dev/random it will give you unpredictable random numbers up until this randomness is exhausted, then you have to wait for the kernel to extract more. /dev/urandom will give you the same random numbers, but if the true randomness runs out, it will start using a (potentially predictable) algorithm from there. So it will always give you something. (Some systems also have hardware support for true randomness e.g. thermal noise).
But it turns out, for cryptographic purposes, you don't need an unending supply of true randomness. If you start with enough true randomness to get a strong cryptographic key, you can then encrypt (say) an unending string of zeroes. An attacker cannot predict that output without knowing the key (if they can, the encryption you're using is broken, and you've already lost, regardless of how good your randomness is).
So the kernel will collect some randomness from the rest of the system at bootup, until it has enough to generate a good crypto key, then it can generate unpredictable random numbers forever.
Now there's a system call getrandom(), OpenSSL 1.1 uses this to seed its random number generators by default, and that system call will not return until the system has collected enough randomness.

Is there a standard constant *nix benchmark, and if not, how to make a `bogobench`?

Ordinary single-threaded *nix programs can be benchmarked with utils like time, i.e.:
# how long does `seq` take to count to 100,000,000
/usr/bin/time seq 100000000 > /dev/null
Outputs:
1.16user 0.06system 0:01.23elapsed 100%CPU (0avgtext+0avgdata 1944maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps
...but numbers returned are always system dependent, which in a sense also measures the user's hardware.
Is there some non-relative benchmarking method or command-line util which would return approximately the same virtual timing numbers on any system, (or at least a reasonably large subset of systems)? Just like grep -m1 bogo /proc/cpuinfo returns a roughly approximate but stable unit, such a benchmark should also return a somewhat similar unit of duration.
Suppose for benchmarking ordinary commands we have a magic util bogobench (where "bogo" is an adjective signifying "a somewhat bogus status", but not necessarily having algorithms in common with BogoMIPs):
bogobench foo bar.data
And we run this on two physically separate systems:
a 1996 Pentium II
a 2015 Xeon
Desired output would be something like:
21 bogo-seconds
So bogobench should return about the same number in both cases, even though it probably would finish in much less time on the 2nd system.
A hardware emulator like qemu might be one approach, but not necessarily the only approach:
Insert the code to benchmark into a wrapper script bogo.sh
Copy bogo.sh to a bootable Linux disk image bootimage.iso, within a directory where bogo.sh would autorun then promptly shutdown the emulator. During which it outputs some form of timing data to parse into bogo-seconds.
Run bootimage.iso using one of qemu's more minimal -machine options:
qemu-system-i386 -machine type=isapc bootimage.iso
But I'm not sure how to make qemu use a virtual clock, rather than the host CPU's clock, and qemu itself seems like a heavy tool for a seemingly simple task. (Really MAME or MESS would be more versatile emulators than qemu for such a task -- but I'm not adept with MAME, although MAME currently has some capacity for 80486 PC emulation.)
Online we sometimes compare and contrast timing-based benchmarks made on machine X with one made on machine Y. Whereas I'd like both user X and Y to be able to do their benchmark on a virtual machine Z, with bonus points for emulating X or Y (like MAME) if need be, except with no consideration of X or Y's real run-time, (unlike MAME where emulations are often playable). In this way users could report how programs perform in interesting cases without the programmer having to worry that the results were biased by idiosyncrasies of a user's hardware, such as CPU quirks, background processes hogging resources, etc.
Indeed, even on the user's own hardware, a time based benchmark can be unreliable, as often the user can't be sure some background process, (or bug, or hardware error like a bad sector, or virus), might not be degrading some aspect of performance. Whereas a more virtual benchmark ought to be less susceptible to such influences.
The only sane way I see to implement this is with a cycle-accurate simulator for some kind of hardware design.
AFAIK, no publicly-available cycle-accurate simulators for modern x86 hardware exist, because it's extremely complex and despite a lot of stuff being known about x86 microarchitecture internals (Agner Fog's stuff, Intel's and AMD's own optimization guides, and other stuff in the x86 tag wiki), enough of the behaviour is still a black box full of CPU-design trade-secrets that it's at best possible to simulate something similar. (E.g. branch prediction is definitely one of the most secret but highly important parts).
While it should be possible to come close to simulating Intel Sandybridge or Haswell's actual pipeline and out-of-order core / ROB / RS (at far slower than realtime), nobody has done it that I know of.
But cycle-accurate simulators for other hardware designs do exist: Donald Knuth's MMIX architecture is a clean RISC design that could actually be built in silicon, but currently only exists on paper.
From that link:
Of particular interest is the MMMIX meta-simulator, which is able to do dynamic scheduling of a complex pipeline, allowing superscalar execution with any number of functional units and with many varieties of caching and branch prediction, etc., including a detailed implementation of both hard and soft interrupts.
So you could use this as a reference machine for everyone to run their benchmarks on, and everyone could get comparable results that will tell you how fast something runs on MMIX (after compiling for MMIX with gcc). But not how fast it runs on x86 (presumably also compiling with gcc), which may differ by a significant factor even for two programs that do the same job a different way.
For [fastest-code] challenges over on the Programming puzzles and Code Golf site, #orlp created the GOLF architecture with a simulator that prints timing results, designed for exactly this purpose. It's a toy architecture with stuff like print to stdout by storing to 0xffffffffffffffff, so it's not necessarily going to tell you anything about how fast something will run on any real hardware.
There isn't a full C implementation for GOLF, AFAIK, so you can only really use it with hand-written asm. This is a big difference from MMIX, which optimizing compilers do target.
One practical approach that could (maybe?) be extended to be more accurate over time is to use existing tools to measure some hardware invariant performance metric(s) for the code under test, and then apply a formula to come up with your bogoseconds score.
Unfortunately most easily measurable hardware metrics are not invariant - rather, they depend on the hardware. An obvious one that should be invariant, however, would be "instructions retired". If the code is taking the same code paths every time it is run, the instructions retired count should be the same on all hardware1.
Then you apply some kind of nominal clock speed (let's say 1 GHz) and nominal CPI (let's say 1.0) to get your bogoseconds - if you measure 15e9 instructions, you output a result of 15 bogoseconds.
The primary flaw here is that the nominal CPI may be way off from the actual CPI! While most programs hover around 1 CPI, it's easy to find examples where they can approach 0.25 or whatever the inverse of the width is, or alternately be 10 or more if there are many lengthy stalls. Of course such extreme programs may be what you'd want to benchmark - and even if not you have the issue that if you are using your benchmark to evaluate code changes, it will ignore any improvements or regressions in CPI and look only at instruction count.
Still, it satisfies your requirement in as much as it effectively emulates a machine that executes exactly 1 instruction every cycle, and maybe it's a reasonable broad-picture approach. It is pretty easy to implement with tools like perf stat -e instructions (like one-liner easy).
To patch the holes then you could try to make the formula better - let's say you could add in a factor for cache misses to account for that large source of stalls. Unfortunately, how are you going to measure cache-misses in a hardware invariant way? Performance counters won't help - they rely on the behavior and sizes of your local caches. Well, you could use cachegrind to emulate the caches in a machine-independent way. As it turns out, cachegrind even covers branch prediction. So maybe you could plug your instruction count, cache miss and branch miss numbers into a better formula (e.g., use typical L2, L3, RAM latencies, and a typical cost for branch misses).
That's about as far as this simple approach will take you, I think. After that, you might as well just rip apart any of the existing x862 emulators and add your simple machine model right in there. You don't need to cycle accurate, just pick a nominal width and model it. Probably whatever underlying emulation cachegrind is going might be a good match and you get the cache and branch prediction modeling already for free.
1 Of course, this doesn't rule out bugs or inaccuracies in the instruction counting mechanism.
2 You didn't tag your question x86 - but I'm going to assume that's your target since you mentioned only Intel chips.

How to Configure and Sample Intel Performance Counters In-Process

In a nutshell, I'm trying to achieve the following inside a userland benchmark process (pseudo-code, assuming x86_64 and a UNIX system):
results[] = ...
for (iteration = 0; iteration < num_iterations; iteration++) {
pctr_start = sample_pctr();
the_benchmark();
pctr_stop = sample_pctr();
results[iteration] = pctr_stop - pctr_start;
}
FWIW, the performance counter I am thinking of using is CPU_CLK_UNHALTED.THREAD_ALL, to read the number of core cycles independent of clock frequency changes (In an earlier question I had been planning to use the TSC register for this, but alas, that is not what this register measures at all).
My initial intention was to use inline assembler to first configure a counter using WRMSR, then to read the counter using RDPMC inside sample_pctr().
I stumbled at the first hurdle, as writing MSRs requires kernel privileges. It seems like you can in fact read the counters from user space (if configured correctly), but the act of configuring the counter (with an MSR) needs to be undertaken by the kernel.
Does anyone know a lightweight way to ask the kernel to configure the a performance counters from user-space so that I can then use RDPMC from within my benchmark harness?
Stuff I've looked into/thought about:
Perf tools for Linux. Seems to be geared up for sampling over the whole lifetime of a process, not within a process as specific points (before and after each iteration).
Use perf syscalls directly (i.e. perf_event_open). Looks like the counter value will only update periodically (using a sample rate) or after the counter exceeds a threshold. I need the counter value precisely at the moment I ask. This is why RDPMC seemed so attractive. I imagine that sampling frequently will itself skew the performance counter readings.
PAPI builds on perf, so probably inherits the above problem.
Write a kernel module -- too much effort, too error prone.
Ideally I would like a solution which works on OpenBSD and Linux, but somehow I think that is a tall order. Perhaps just for Linux for now.
Any help is most appreciated. Thanks.
EDIT: I just found the Linux msr device node, which would probably suffice. I'll leave the question up in case a better answer shows up.
It seems the best way -- for Linux at least -- is to use the msr device node.
You simply open a device node, seek to the address of the MSR required, and read or write 8 bytes.
OpenBSD is harder, since (at the time of writing) there is no user-space proxy to the MSRs. So you would need to write a kernel module or implement a sysctl by hand.

How do I ensure accurate results of stress tests?

I have written a script for a project that stress tests the cpu, the vm and the i/o whilst running vmstat, iostat and sar. The scripts all run for 30 seconds. My tutor has asked me however to ensure that the results are accurate? How can I ever be sure? Surely I just take the machine's word for it after running a few tests? The tests have been run for 60 seconds each and so have the commands to try and ensure a fair test, but how can I be sure that they are accurate according to my tutor's concerns? Any ideas?
The systems are server versions of Ubuntu 12.04, Debian 7 and Suse 11
There is no way to know which are your tutor's concerns, so you should ask him!
"accuracy" usually means that your test results should not be offset by a factor you're not taking into account, like some CPU features being disabled or not used, differences in software configuration, etc.
What is it that you evaluate, anyway? Evaluating CPU performance is not the same as evaluating a particular hardware system, which is yet different if you consider the software as well. Basically, you need to eliminate all differences which are not part of your evaluation, and make sure the rest of the configuration is representative (e.g. installing a modern OS which supports all the features the CPU provides).
And remember that in the end you will always take the machine's word for it, there's just no other way. All you can say is that you have considered all factors you're aware of, and hope that the factors remaining unknown don't have a big influence.

What are the available interactive languages that run in tiny memory? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am looking for general purpose programming languages that
have an interactive (live coding) prompt
work in 32 KB of RAM by itself or 8 KB when the compiler is hosted on a separate machine
run on a microcontroller with as little as 8-32 KB RAM total (without an MMU).
Below is my list so far, what am I missing?
Python: The PyMite VM needs 64K flash, 8K RAM. Targets LPC, SAM7 and ATmegas with 8K or more. Hosted.
Lua: The eLua FAQ recommends 256K flash, 64K RAM.
FORTH: amforth needs 8K flash, 150 bytes RAM, 30 bytes EEPROM on an ATmega.
Scheme: armpit Scheme The smallest target is the LPC2103 with 32K Flash, 4K SRAM.
C: Interactive C runs on 68HC11 with no flash and 32K SRAM. Hosted.
C: picoc an open source, cross-compiling, interactive C system. When compiled for AVR, it takes 63K flash, 8K RAM. The RAM could be reduced with effort to keep tables in flash.
C++: AngelScript an open source, byte-code based, C/C++ like scripting language with easy native calls.
Tcl: TinyTCL runs on DOS, 60K binary. Looks easy to port.
BASIC: TinyBasic: Initializes with a 64K heap, might be adjustable.
Lisp
PostScript: (I haven't found a FOSS implementation for low memory yet)
Shell: bitlash: An interactive command shell for Arduino (ATmega). See also AVRSH.
A homebrew Forth runtime can be implemented in very little memory indeed. I know someone who made one on a Cosmac in the 1970s. The core runtime was just 30 bytes.
I hear that CHIP-8, XPL0, PicoC, and Objective Caml have been ported to graphing calculators.
The Wikipedia "Lego Mindstorms" article lists a bunch of programming languages that allegedly run on the Lego RCX or Lego NXT platform.
Do any of them meet your "live coding" criteria?
You might want to check out the other microcontroller Forths at the Forth wiki . It lists at least 4 Forths for the Atmel AVR: amforth (which you already mention), PFAVR, avrforth, and ByteForth.
(Links to those interpreters, as well as this StackOverflow question, are included in the "Embedded Systems" wikibook).
I would recommend LUA (or eLUA http://www.eluaproject.net/ ). I've "ported" LUA to a Cortex-M3 a while back. From the top of my head it had a flash size of 60~100KB and needed about 20KB RAM to run. I did strip down to the bare essentials, but depending on your application, that might be enough. There's still room for optimization, especially about RAM requirements, but I doubt you can run it comfortable in 8KB.
Some AVR interpreters/VMs:
http://www.cqham.ru/tbcgroup/index_eng.htm
http://www.jcwolfram.de/projekte/avr/chipbasic2/main.php
http://www.jcwolfram.de/projekte/avr/chipbasic8/main.php
http://www.jcwolfram.de/projekte/avr/main.php
http://code.google.com/p/python-on-a-chip/
http://www.avrfreaks.net/index.php?module=Freaks%20Academy&func=viewItem&item_id=688&item_type=project
http://www.avrfreaks.net/index.php?module=Freaks%20Academy&func=viewItem&item_id=626&item_type=project
http://www.avrfreaks.net/index.php?module=Freaks%20Academy&func=viewItem&item_id=460&item_type=project
http://www.harbaum.org/till/nanovm/index.shtml
Wren fits your criteria -- by default it's configured to use just 4k of RAM. AFAIK it hasn't seen any actual use, since the guy I wrote it for decided he didn't need an interpreter running wholly on the target system after all.
The language is influenced most obviously by ML and Forth.
Have you considered a port in C of Tiny Basic? Or, perhaps rewriting the UCSD Pascal p-machine to your architecture from Z-80?
Seriously, though, JavaScript would make a good embedded scripting language, but I've no clue what the minimum memory requirements are for the VM + GC, nor how difficult to remove OS dependencies. I played with NJS a while back, which could possibly fit your needs. This one is interesting in that the compiler is written in JavaScript (self hosting).
You can take a look at very powerful AvrCo Multitasking Pascal for AVR. You can try it at http://www.e-lab.de. MEGA8/88 version is free. There are tons of drivers and simulator with JTAG debugger and nice live or simulated visualizations of all standard devices (LCDCHAR, LCDGRAPH, 7SEG, 14SEG, LEDDOT, KEYBOARD, RC5, SERVO, STEPPER...).
You're missing EmbedVM, homepage here, svn repo here. Remember to check out both [1,2] videos on the front page ;)
From the homepage:
EmbedVM is a small embeddable virtual machine for microcontrollers
with a C-like language frontend. It has been tested with GCC and AVR
microcontrollers. But as the Virtual machine is rather simple it
should be easy to port it to other architectures.
The VM simulates a 16bit CPU that can access up to 64kB of memory. It
can only operate on 16bit values and arrays of 16bit and 8bit values.
There is no support for complex data structures (struct, objects,
etc.). A function can have a maximum of 32 local variables and 32
arguments.
Besides the memory for the VM, a small structure holding the VM state
and the reasonable amount of memory the EmbedVM functions need on the
stack there are no additional memory requirements for the VM.
Especially the VM does not depend on any dymaic memory management.
EmbedVM is optimized for size and simplicity, not execution speed. The
VM itself takes up about 3kB of program memory on an AVR
microcontroller. On an AVR ATmega168 running at 16MHz the VM can
execute about 75 VM instructions per millisecond.
All memory accesses done by the VM are parformed using user callback
functions. So it is possible to have some or all of the VM memory on
external memory devices, flash memory, etc. or "memory-map" hardware
functions to the VM.
The compiler is a UNIX/Linux commandline tool that reads in a *.evm
file and generates bytecode in vaious formats (binary file, intel hex,
C array initializers and a special debug output format). It also
generates a symbol file that can be used to access data in the VM
memory from the host application.
The C-like language looks like this: http://svn.clifford.at/embedvm/trunk/examples/numberquizz/vmcode.evm
I would recommend MY-BASIC, runs with in minimum 8 KB RAM, and easy to port.
There's also JavaScript, via Espruino.
This is built specifically for Microcontrollers and there are builds for various different chips (mainly STM32s) that fit a full system into as little as 8kB RAM.
Have you considered simply using the /bin/sh supplied by busybox? Or on of the smaller scripting languages they recommend?
Prolog - http://www.gprolog.org/
According to a google search "prolog small" the size of the executable can be made quite small by avoiding linking the built-in predicates.
None of the languages in the list in the question or in the answers proved satisfactory for the requirement of super easy compilation and integration into an existing micro controller project (disclosure: I didn't actually try every single one of the suggestions).
I found instead tinyscript which is a single .c+.h file that compiled with the rest of the source files on my project with the only additional configuration required being to provide a void outchar(int c) which can be empty if you don't require output from the scripts.
For me speed of execution is far less important than ease of build and integration and interop with C, as my use case is mainly just calling some C functions in order.
I have been using in my previous work busybox on a BlackFin.
we compiled perl + php for it, after changing s/fork/vfork/g it worked pretty good... more or less. Not having an MMU is not a good idea. The memory fragmentation will kill the server pretty easily. All I did was:
for i in `seq 1 100`; do wget http://black-fin-ip/test.php; done
It died while I was walking to my boss and telling him that the server is going to die in production :)
I would suggest use python. But now the only problem is the memory overhead right? So I have great idea for people who may be stuck in this problem later on.
First thing's first, write a bf interpreter(or just get source code from somewhere). The interpreter will be really small. Also bf is a Turing complete language. Now you need to write your code in python and then transpiler it to bf using bfpy( https://github.com/felko/bfpy/blob/master/README.md ). I've given you the solution with the least overhead and I am pretty sure a bf interpreter will easily stay under 10KB of ram usage.
Erlang - http://erlang.org/
it can fit in 2MB
http://www.experts123.com/q/is-erlang-small-enough-for-embedded-systems.html

Resources