Can someone explain to me what the above message means? I am developing a Linux block driver and I am attempting to format with ext4. After a few minutes I get this message. I have tried searching other threads but cant find an explanation of what it is. Thanks
tl;dr: The kernel's random-number generator is ready to generate random numbers that are unpredictable enough for serious cryptographic use.
In some systems, something at boot time (e.g. starting sshd) waits for this, this happens frequently when switching an embedded system to OpenSSL 1.1. You can fix that with tools like egd or rng-tools, or hardware randomness support, or tweaking things so the rest of bootup doesn't wait on that something to complete.
Backstory:
Pseudo-random number generators are deterministic algorithms, so with enough output (and/or some knowledge of the internal state, or good guesses about it) an attacker can predict the future output. This is Really Bad if some of that output is going to be e.g. a secret cryptographic key.
For a long time, the Linux kernel has had code to extract some true randomness ("entropy") from unpredictable events (arrival time of network packets, user input, etc.), using math we're not expected to understand, and the resultant randomness is made available with /dev/random. If you read from /dev/random it will give you unpredictable random numbers up until this randomness is exhausted, then you have to wait for the kernel to extract more. /dev/urandom will give you the same random numbers, but if the true randomness runs out, it will start using a (potentially predictable) algorithm from there. So it will always give you something. (Some systems also have hardware support for true randomness e.g. thermal noise).
But it turns out, for cryptographic purposes, you don't need an unending supply of true randomness. If you start with enough true randomness to get a strong cryptographic key, you can then encrypt (say) an unending string of zeroes. An attacker cannot predict that output without knowing the key (if they can, the encryption you're using is broken, and you've already lost, regardless of how good your randomness is).
So the kernel will collect some randomness from the rest of the system at bootup, until it has enough to generate a good crypto key, then it can generate unpredictable random numbers forever.
Now there's a system call getrandom(), OpenSSL 1.1 uses this to seed its random number generators by default, and that system call will not return until the system has collected enough randomness.
Related
I felt very frustrated in the battle with cheaters of my game. I found a lot of hackers tampered my game data to avoid the anti-cheat system. I have tried some methods to verify if the game data has tampered or not. Such as encrypting my asset-package or check the hash of the package header.
However, I got stuck on the issue that my asset-package is huge. It is almost 1~3GB. I know the digital signature is doing very well in verifying data. But I need this to be done in almost real-time.
It seems I have to make a trade-off between verifying the whole file and the performance. Does there any way to verify a huge file in a short-time?
AES-NI based hashing such as Meow Hash can easily reach 16 bytes per cycle on a single thread, that is, for data already on-cache, it process tens of gigabytes of input in a second. Obviously in reality the memory and disk I/O speed becomes the limiting factor, but they apply on any method, so you can think of them as the upper limit. Since it's not designed for security, it's also possible for cheaters to quickly figure out a viable collision.
But, even if you figure out a sweet spot between speed and security, you're still relying on cheaters not forwarding your file/memory I/O. Additionally, it's still possible for the cheaters to just NOP any asset verification call. Since you care about cheaters, I'd assume this is an online game. The more common practice is to rearchitect the game to prevent cheating even with a broken asset. On Valorant, they move the line of sight calculation to server-side. LoL add kernel driver
I am writing my first NES emulator in C. The goal is to make it easily understandable and cycle accurate (does not necessarily have to be code-efficient though), in order to play games at normal 'hardware' speed. When digging into the technical references of the 6502, it seems like the instructions consume more than one CPU cycle - and also has different cycles depending on given conditions (such as branching). My plan is to create read and write functions, and also group opcodes by addressing modes using a switch.
The question is: When I have a multiple-cycle instruction, such as a BRK, do I need to emulate what is exactly happening in each cycle:
#Method 1
cycle - action
1 - read BRK opcode
2 - read padding byte (ignored)
3 - store high byte of PC
4 - store low byte of PC
5 - store status flags with B flag set
6 - low byte of target address
7 - high byte of target address
...or can I just execute all the required operations in one 'cycle' (one switch case) and do nothing in the remaining cycles?
#Method 2
1 - read BRK opcode,
read padding byte (ignored),
store high byte of PC,
store low byte of PC,
store status flags with B flag set,
low byte of target address,
high byte of target address
2 - do nothing
3 - do nothing
4 - do nothing
5 - do nothing
6 - do nothing
7 - do nothing
Since both methods consume the desired 7 cycles, will there be no difference between the two? (accuracy-wise)
Personally I think method 1 is the way-to-go solution, however I cannot think of a proper, easy way to implement it... (Please help!)
Do you 'need' to? It depends on the software. Imagine the simplest example:
STA ($56), Y
... which happens to hit a hardware register. If you don't do at least the write on the correct cycle then you've introduced a timing deficiency. The register you're writing to will be written to at the wrong time. What if it's something like a palette register, and the programmer is running a raster effect? Then you've just moved where the colour changes. You've changed the graphical output.
In practice, clever programmers do much smarter things than that — e.g. one might use a read-modify-write operation to read a hardware value at an exact cycle, modify it, then write it back at some other exact cycle.
So my answer is:
most software isn't written so that the difference between (1) and (2) will have any effect; but
some definitely is, because the author was very clever; and
some definitely is, just because the author experimented until they found a cool effect, regardless of whether they were cognisant of the cause; and
in any case, when you find something that doesn't work properly on your emulator, how much time do you want to spend considering all the permutations and combinations of potential causes? Every one you can factor out is one less to consider.
Most emulators used to use your method (2). What normally happens is that they work with 90% of software. Then there's a few cases that don't work, for which the emulator author puts in a special case here, a special case there. Those usually ended up interacting poorly and the emulator spent the rest of its life oscillating between supporting different 95% combinations of available software until somebody wrote a better one.
So just go with method (1). It will cause some software that would otherwise be broken not to be so. Also it'll teach you more, and it'll definitely eliminate any potential motivation for special cases so it'll keep your code cleaner. It'll be marginally slower but I think your computer can probably handle it.
Other tips: the 6502 has only a few addressing modes, and the addressing mode entirely dictates the timing. This document is everything you need to know for perfect timing. If you want perfect cleanliness, your switch table can just pick an addressing mode and a central operation, then exit and you can branch on addressing mode to do the main action.
If you're going to use vanilla read and write methods, which is smart on a 6502 as every single cycle is either a read or a write so it's almost all you need to say, just be careful of the method signatures. For example, the 6502 has a SYNC pin which allows an observer to discriminate an ordinary read from an opcode read. Check whether the NES exposes that to cartridges, as it's often used on systems that expose it for implicit paging and the main identifying characteristic of the NES is that there are hundreds of paging schemes.
EDIT: minor updates:
it's not actually completely true to say that a 6502 always reads or writes; it also has an RDY input. If the RDY input is asserted and the 6502 intends to read, it will instead halt while maintaining the intended read address. Rarely used in practice because it's insufficient for common tasks like allowing somebody else to take possession of memory — the 6502 will write regardless of the RDY input, it's really meant to help with single-stepping — and seemingly not included on the NES cartridge pinout, you needn't implement it for that machine.
per the same pinout, the sync signal also doesn't seem to be exposed to cartridges on that system.
Ordinary single-threaded *nix programs can be benchmarked with utils like time, i.e.:
# how long does `seq` take to count to 100,000,000
/usr/bin/time seq 100000000 > /dev/null
Outputs:
1.16user 0.06system 0:01.23elapsed 100%CPU (0avgtext+0avgdata 1944maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps
...but numbers returned are always system dependent, which in a sense also measures the user's hardware.
Is there some non-relative benchmarking method or command-line util which would return approximately the same virtual timing numbers on any system, (or at least a reasonably large subset of systems)? Just like grep -m1 bogo /proc/cpuinfo returns a roughly approximate but stable unit, such a benchmark should also return a somewhat similar unit of duration.
Suppose for benchmarking ordinary commands we have a magic util bogobench (where "bogo" is an adjective signifying "a somewhat bogus status", but not necessarily having algorithms in common with BogoMIPs):
bogobench foo bar.data
And we run this on two physically separate systems:
a 1996 Pentium II
a 2015 Xeon
Desired output would be something like:
21 bogo-seconds
So bogobench should return about the same number in both cases, even though it probably would finish in much less time on the 2nd system.
A hardware emulator like qemu might be one approach, but not necessarily the only approach:
Insert the code to benchmark into a wrapper script bogo.sh
Copy bogo.sh to a bootable Linux disk image bootimage.iso, within a directory where bogo.sh would autorun then promptly shutdown the emulator. During which it outputs some form of timing data to parse into bogo-seconds.
Run bootimage.iso using one of qemu's more minimal -machine options:
qemu-system-i386 -machine type=isapc bootimage.iso
But I'm not sure how to make qemu use a virtual clock, rather than the host CPU's clock, and qemu itself seems like a heavy tool for a seemingly simple task. (Really MAME or MESS would be more versatile emulators than qemu for such a task -- but I'm not adept with MAME, although MAME currently has some capacity for 80486 PC emulation.)
Online we sometimes compare and contrast timing-based benchmarks made on machine X with one made on machine Y. Whereas I'd like both user X and Y to be able to do their benchmark on a virtual machine Z, with bonus points for emulating X or Y (like MAME) if need be, except with no consideration of X or Y's real run-time, (unlike MAME where emulations are often playable). In this way users could report how programs perform in interesting cases without the programmer having to worry that the results were biased by idiosyncrasies of a user's hardware, such as CPU quirks, background processes hogging resources, etc.
Indeed, even on the user's own hardware, a time based benchmark can be unreliable, as often the user can't be sure some background process, (or bug, or hardware error like a bad sector, or virus), might not be degrading some aspect of performance. Whereas a more virtual benchmark ought to be less susceptible to such influences.
The only sane way I see to implement this is with a cycle-accurate simulator for some kind of hardware design.
AFAIK, no publicly-available cycle-accurate simulators for modern x86 hardware exist, because it's extremely complex and despite a lot of stuff being known about x86 microarchitecture internals (Agner Fog's stuff, Intel's and AMD's own optimization guides, and other stuff in the x86 tag wiki), enough of the behaviour is still a black box full of CPU-design trade-secrets that it's at best possible to simulate something similar. (E.g. branch prediction is definitely one of the most secret but highly important parts).
While it should be possible to come close to simulating Intel Sandybridge or Haswell's actual pipeline and out-of-order core / ROB / RS (at far slower than realtime), nobody has done it that I know of.
But cycle-accurate simulators for other hardware designs do exist: Donald Knuth's MMIX architecture is a clean RISC design that could actually be built in silicon, but currently only exists on paper.
From that link:
Of particular interest is the MMMIX meta-simulator, which is able to do dynamic scheduling of a complex pipeline, allowing superscalar execution with any number of functional units and with many varieties of caching and branch prediction, etc., including a detailed implementation of both hard and soft interrupts.
So you could use this as a reference machine for everyone to run their benchmarks on, and everyone could get comparable results that will tell you how fast something runs on MMIX (after compiling for MMIX with gcc). But not how fast it runs on x86 (presumably also compiling with gcc), which may differ by a significant factor even for two programs that do the same job a different way.
For [fastest-code] challenges over on the Programming puzzles and Code Golf site, #orlp created the GOLF architecture with a simulator that prints timing results, designed for exactly this purpose. It's a toy architecture with stuff like print to stdout by storing to 0xffffffffffffffff, so it's not necessarily going to tell you anything about how fast something will run on any real hardware.
There isn't a full C implementation for GOLF, AFAIK, so you can only really use it with hand-written asm. This is a big difference from MMIX, which optimizing compilers do target.
One practical approach that could (maybe?) be extended to be more accurate over time is to use existing tools to measure some hardware invariant performance metric(s) for the code under test, and then apply a formula to come up with your bogoseconds score.
Unfortunately most easily measurable hardware metrics are not invariant - rather, they depend on the hardware. An obvious one that should be invariant, however, would be "instructions retired". If the code is taking the same code paths every time it is run, the instructions retired count should be the same on all hardware1.
Then you apply some kind of nominal clock speed (let's say 1 GHz) and nominal CPI (let's say 1.0) to get your bogoseconds - if you measure 15e9 instructions, you output a result of 15 bogoseconds.
The primary flaw here is that the nominal CPI may be way off from the actual CPI! While most programs hover around 1 CPI, it's easy to find examples where they can approach 0.25 or whatever the inverse of the width is, or alternately be 10 or more if there are many lengthy stalls. Of course such extreme programs may be what you'd want to benchmark - and even if not you have the issue that if you are using your benchmark to evaluate code changes, it will ignore any improvements or regressions in CPI and look only at instruction count.
Still, it satisfies your requirement in as much as it effectively emulates a machine that executes exactly 1 instruction every cycle, and maybe it's a reasonable broad-picture approach. It is pretty easy to implement with tools like perf stat -e instructions (like one-liner easy).
To patch the holes then you could try to make the formula better - let's say you could add in a factor for cache misses to account for that large source of stalls. Unfortunately, how are you going to measure cache-misses in a hardware invariant way? Performance counters won't help - they rely on the behavior and sizes of your local caches. Well, you could use cachegrind to emulate the caches in a machine-independent way. As it turns out, cachegrind even covers branch prediction. So maybe you could plug your instruction count, cache miss and branch miss numbers into a better formula (e.g., use typical L2, L3, RAM latencies, and a typical cost for branch misses).
That's about as far as this simple approach will take you, I think. After that, you might as well just rip apart any of the existing x862 emulators and add your simple machine model right in there. You don't need to cycle accurate, just pick a nominal width and model it. Probably whatever underlying emulation cachegrind is going might be a good match and you get the cache and branch prediction modeling already for free.
1 Of course, this doesn't rule out bugs or inaccuracies in the instruction counting mechanism.
2 You didn't tag your question x86 - but I'm going to assume that's your target since you mentioned only Intel chips.
I have received a verilog project, with a key component in it being encrypted.
The performance on a FPGA varies depending on build environment and configuration[Note 1] and I suspect this is caused by insufficient timing constraints [Note 2].
In the TimeQuest timing analyzer I can see the names of the pins, nets, registers, ports within the encrypted core, but without looking at the actual code I don't know exactly what they mean.
So how should I start writing a SDC timing constraint in this situation?
Note 1:
The component is a MIPI-CSI2 TX. On the test RX side I constantly get SoT error(SoT error and sync not achieved), and sometimes ECC errors.
For a while, to make it "work" on an FPGA the code has to built on a Windows Machine. Then a few minor and unrelated changes are made to the code, it could work if built on a Linux machine. Then very recently a Linux build machine seems to perform better than a Windows machine.
Also changing optimization parameters seems to break the code very badly. E.g. currently only "Balanced" mode works. "Performance" and "Aggressive Performance" modes, which should improve performance, causes a lot of errors in the received signals. Memory contents of a upstream signal processing block also affect the MIPI-CSI2 TX this way.
Above all make me think there are some uncertainties not fully constrained by the SDC timing constraints.
Note 2:
I'm not able to fully verify this theory because I don't have the equipment to fully test signal, nor could I do gate-level simulation because the encrypted code does not allow me to generate EDA netlist.
So in general, there are two types of constraints I am usually concerned with up front. One is the IO constraints which in the event of signals going off chip, you want to control the relationship between the data and the clock that is being used to transmit data and of course the receive clock and the receive data. The second type is the clock constraint. You want to validate that the clocks themselves are specified correctly. Depending on the tool, once you have done that, there is generally a process of deriving clocks that occurs to get any PLL derivative clocks... that part may be automatic, or if it is like Altera, you just need to call derive_clocks and you are on your way.
There are other kinds of constraints of course, but most of those are for specifying exceptions. "Make this a multi-cycle, or false path that.." etc.. Those are probably not what you want here. Constrain to the clock and that should be pretty tight.
If it is an input/output constraint issue, you can vary the relationship between the RX and TX clocks and associated data.
Beyond that, if this encrypted IP is provided by a vendor, they will usually provide any additional constraints that are required beyond just basic input clock constraint that you need.
One strategy, if you really think it is a clock variation issue, you can spec the input clock to be a little faster than it really is. This would technically cause timing to get tougher to meet, but you will gain margin in your final image.
Another thing to think about, depending on your technology you are building to, you need to correctly specify the device part number, because of the speed grade. You might have a slow part, and are compiling to a fast part. This will wreak havoc. Some vendors guard against loading the wrong image on the wrong part, but some will let you hang yourself.
Finally, check your I/O signaling standard and termination specs. If you think signal integrity is at play here, then these will have to be considered. Make sure if you are using LVDS or CML when you should be using HSSL or you might need to add pullups to the signaling pins...etc.
Some background info: I was looking to run a script on a Red Hat server to read some data from /dev/random and use the Perl unpack() command to convert it to a hex string for usage later on (benchmarking database operations). I ran a few "head -1" on /dev/random and it seemed to be working out fine, but after calling it a few times, it would just kinda hang. After a few minutes, it would finally output a small block of text, then finish.
I switched to /dev/urandom (I really didn't want to, its slower and I don't need that quality of randomness) and it worked fine for the first two or three calls, then it too began hang.
I was wondering if it was the "head" command that was bombing it, so I tried doing some simple I/O using Perl, and it too was hanging.
As a last ditch effort, I used the "dd" command to dump some info out of it directly to a file instead of to the terminal. All I asked of it was 1mb of data, but it took 3 minutes to get ~400 bytes before I killed it.
I checked the process lists, CPU and memory were basically untouched. What exactly could cause /dev/random to crap out like this and what can I do to prevent/fix it in the future?
Edit: Thanks for the help guys! It seems that I had random and urandom mixed up. I've got the script up and running now. Looks like I learned something new today. :)
On most Linux systems, /dev/random is powered from actual entropy gathered by the environment. If your system isn't delivering a large amount of data from /dev/random, it likely means that you're not generating enough environmental randomness to power it.
I'm not sure why you think /dev/urandom is "slower" or higher quality. It reuses an internal entropy pool to generate pseudorandomness - making it slightly lower quality - but it doesn't block. Generally, applications that don't require high-level or long-term cryptography can use /dev/urandom reliably.
Try waiting a little while then reading from /dev/urandom again. It's possible that you've exhausted the internal entropy pool reading so much from /dev/random, breaking both generators - allowing your system to create more entropy should replenish them.
See Wikipedia for more info about /dev/random and /dev/urandom.
This question is pretty old. But still relevant so I'm going to give my answer. Many CPUs today come with a built-in hardware random number generator (RNG). As well many systems come with a trusted platform module (TPM) that also provide a RNG. There are also other options that can be purchased but chances are your computer already has something.
You can use rngd from rng-utils package on most linux distros to seed more random data. For example on fedora 18 all I had to do to enable seeding from the TPM and the CPU RNG (RDRAND instruction) was:
# systemctl enable rngd
# systemctl start rngd
You can compare speed with and without rngd. It's a good idea to run rngd -v -f from command line. That will show you detected entropy sources. Make sure all necessary modules for supporting your sources are loaded. To use TPM, it needs to be activated through the tpm-tools. update: here is a nice howto.
BTW, I've read on the Internet some concerns about TPM RNG often being broken in different ways, but didn't read anything concrete against the RNGs found in Intel, AMD and VIA chips. Using more than one source would be best if you really care about randomness quality.
urandom is good for most use cases (except sometimes during early boot). Most programs nowadays use urandom instead of random. Even openssl does that. See myths about urandom and comparison of random interfaces.
In recent Fedora and RHEL/CentOS rng-tools also support the jitter entropy. You can on lack of hardware options or if you just trust it more than your hardware.
UPDATE: another option for more entropy is HAVEGED (questioned quality). On virtual machines there is a kvm/qemu VirtIORNG (recommended).
UPDATE 2: In Linux 5.6 kernel does its own jitter entropy.
use /dev/urandom, its cryptographically secure.
good read: http://www.2uo.de/myths-about-urandom/
"If you are unsure about whether you should use /dev/random or /dev/urandom, then probably you want to use the latter."
When in doubt in early boot, wether you have enough entropy gathered. use the system call getrandom() instead. [1] (from Linux kernel >= 3.17)
Its best of both worlds,
it blocks until (only once!) enough entropy is gathered,
after that it will never block again.
[1] git kernel commit
If you want more entropy for /dev/random then you'll either need to purchase a hardware RNG or use one of the *_entropyd daemons in order to generate it.
If you are using randomness for testing (not cryptography), then repeatable randomness is better, you can get this with pseudo randomness starting at a known seed. There is usually a good library function for this in most languages.
It is repeatable, for when you find a problem and are trying to debug. It also does not eat up entropy. May be seed the pseudo random generator from /dev/urandom and record the seed in the test log. Perl has a pseudo random number generator you can use.
This fixed it for me.
Use new SecureRandom() instead of SecureRandom.getInstanceStrong()
Some more info can be found here :
https://tersesystems.com/blog/2015/12/17/the-right-way-to-use-securerandom/
/dev/random should be pretty fast these days. However I did notice in OS X reading small bytes from /dev/urandom was really slow. Work around seemed to be to use arc4random instead: https://github.com/crystal-lang/crystal/pull/11974