How does one safely compare two strings with bounded length in such a way that each comparison takes the same time? Hashing unfortunately has a timing attack vulnerability.
Is there any way to compare two strings without hashing in a way that is not vulnerable to timing-attacks?
TL;DR: Use assembly.
Constant Time code is really hard to pull off. To be truly constant time you need:
a constant time algorithm,
a constant time implementation of said algorithm.
What does "constant time algorithm" mean?
The example of string comparison is great. Most of the time, you want the comparison to take as little as possible, which means bailing out at the first difference:
fn simple_compare(a: &str, b: &str) -> bool {
if a.len() != b.len() { return false; }
for (a, b) in a.bytes().zip(b.bytes()) {
if a != b { return false; }
}
true
}
The constant time version algorithm version however should have constant time regardless of the input:
the input should always have the same size,
the time taken to compute the result should be identical no matter where the difference is located (if any).
The algorithm Lukas gave is almost right:
/// Prerequisite: a.len() == b.len()
fn ct_compare(a: &str, b: &str) -> bool {
debug_assert!(a.len() == b.len());
a.bytes().zip(b.bytes())
.fold(0, |acc, (a, b)| acc | (a ^ b) ) == 0
}
What does "constant time implementation" mean?
Even if the algorithm is constant time, the implementation may not be.
If the exact same sequence of CPU instructions is not used, then on some architecture one of the instructions could be faster, while the other is slower, and the implementation would lose.
If the algorithm uses table look-up, then there could be more or less cache misses.
Can you write a constant time implementation of string comparison in Rust?
No.
The Rust language could potentially be suited to the task, however its toolchain is not:
the LLVM optimizer will wreak havoc with your algorithm, short-circuiting it, eliminating unnecessary reads, now or in the future,
the LLVM backends will wreak havoc with your implementation, picking different instructions.
The short and long is that, today, the only way to access a constant time implementation from Rust is to write said implementation in assembly.
To write a timing-attack-safe string comparison algorithm yourself is pretty easy in theory. There are many resources online on how to do it in other languages. The important part is to trick the optimizer in not optimizing your code in a way you don't want. Here is one example Rust implementation which uses the algorithm described here:
fn ct_compare(a: &str, b: &str) -> bool {
if a.len() != b.len() {
return false;
}
a.bytes().zip(b.bytes())
.fold(0, |acc, (a, b)| acc | (a ^ b) ) == 0
}
(Playground)
Of course, this algorithm can be easily generalized to everything that is AsRef<[u8]>. This is left as an exercise to the reader ;-)
It looks like there is a crate already offering these kinds of comparisons: consistenttime. I haven't tested it, but the documentation looks quite good.
For those looking for a crate providing such implementation, you can use rust-crypto that provides the function fixed_time_eq.
The implementation is very similar to the one of Lukas Kalbertodt.
Related
I've been recently experimenting with different hash functions in Rust. Started off with the fasthash crate, where many algorithms are implemented; e.g., murmur3 is then called as
let hval = murmur3::hash32_with_seed(&tmp_read_buff, 123 as u32);
This works very fast (e.g., few seconds for 100000000 short inputs). I also stumbled upon FxHash, the algorithm used a lot internally in Firefox (at least initially?). I rolled my version of hashing a byte array with this algorithm as follows
use rustc_hash::FxHasher;
use std::hash::Hasher;
fn hash_with_fx(read_buff: &[u8]) -> u64 {
let mut hasher = FxHasher::default();
for el in read_buff {
hasher.write_u8(*el);
}
return hasher.finish();
}
This works, however, it's about 5x slower. I'd really like to know if I'm missing something apparent here/how could I achieve similar or better speeds to fasthash's e.g., murmur3. My intuition is that with FxHash, the core operation is very simple,
self.hash = self.hash.rotate_left(5).bitxor(i).wrapping_mul(K);
hence it should be one of the fastest.
The FxHasher documentation mentions that:
the speed of the hash function itself is much higher because it works on up to 8 bytes at a time.
But your algorithm completely removes this possibility because you are processing each byte individually. You can make it a lot faster by hashing in chunks of 8 bytes.
fn hash_with_fx(read_buff: &[u8]) -> u64 {
let mut hasher = FxHasher::default();
let mut chunks = read_buff.chunks_exact(8);
for bytes in &mut chunks {
// unwrap is ok because `chunks_exact` provides the guarantee that
// the `bytes` slice is exactly the requested length
let int = u64::from_be_bytes(bytes.try_into().unwrap());
hasher.write_u64(int);
}
for byte in chunks.remainder() {
hasher.write_u8(*byte);
}
hasher.finish()
}
For very small inputs (especially for numbers like 7 bytes) it may introduce a small extra overhead compared with your original code. But, for larger inputs, it ought to be significantly faster.
Bonus material
It should be possible to remove a few extra instructions in the loop by using unwrap_unchecked instead of unwrap. It's completely sound to do so in this case, but it may not be worth introducing unsafe code into your codebase. I would measure the difference before deciding to include unsafe code.
I'd like to try to eliminate bounds checking on code generated by Rust. I have variables that are rarely zero and my code paths ensure they do not run into trouble. But because they can be, I cannot use NonZeroU64. When I am sure they are non-zero, how can I signal this to the compiler?
For example, if I have the following function, I know it will be non-zero. Can I tell the compiler this or do I have to have the unnecessary check?
pub fn f(n:u64) -> u32 {
n.trailing_zeros()
}
I can wrap the number in NonZeroU64 when I am sure, but then I've already incurred the check, which defeats the purpose ...
Redundant checks within a single function body can usually be optimized out. So you just need convert the number to NonZeroU64 before calling trailing_zeros(), and rely on the compiler to optimize the bound checks.
use std::num::NonZeroU64;
pub fn g(n: NonZeroU64) -> u32 {
n.trailing_zeros()
}
pub fn other_fun(n: u64) -> u32 {
if n != 0 {
println!("Do something with non-zero!");
let n = NonZeroU64::new(n).unwrap();
g(n)
} else {
42
}
}
In the above code, the if n != 0 makes sure n cannot be zero within the block, and compiler is smart enough to remove the unwrap call, making NonZeroU64::new(n).unwrap() an zero-cost operation. You can check the asm to verify that.
core::intrinsics::assume
Informs the optimizer that a condition is always true. If the
condition is false, the behavior is undefined.
No code is generated for this intrinsic, but the optimizer will try to
preserve it (and its condition) between passes, which may interfere
with optimization of surrounding code and reduce performance. It
should not be used if the invariant can be discovered by the optimizer
on its own, or if it does not enable any significant optimizations.
This intrinsic does not have a stable counterpart.
Is there any difference between these two ways to get lowercase Vec. This version iterates over chars, converts them and collects the results:
fn lower(s: &str) -> Vec<char> {
s.chars().flat_map(|c| c.to_lowercase()).collect()
}
and this version first converts to a String and then collects the chars of that:
fn lower_via_string(s: &str) -> Vec<char> {
s.to_lowercase().chars().collect()
}
A short look at the code for str::to_lowercase immediately revealed a counterexample: It appears that Σ at the end of words receives special treatment from str::to_lowercase, which chars()-then-char::to_lowercase can't give, so the results differ on "xΣ ".
Playground
Before looking at the code of std::to_lowercase, I thought: Well, it should be really easy to find a counterexample with a fuzzer. I messed up the setup at first and it didn't find anything, but now was able to get it right, so I'll add it for completeness:
cargo new theyrenotequal
cd theyrenotequal
cargo fuzz init
cat >fuzz/fuzz_targets/fuzz_target_1.rs
#![no_main]
use libfuzzer_sys::fuzz_target;
fuzz_target!(|data: &str| {
if data.to_lowercase().chars().collect::<Vec<_>>()
!= data
.chars()
.flat_map(|c| c.to_lowercase())
.collect::<Vec<_>>()
{
panic!("Fuxxed: {}", data)
}
});
cargo fuzz run fuzz_target_1 -Zbuild-std
That spat out "AΣ#ӮѮ" after 8 million iterations.
Completing the answer of #Caesar, in case the behavioral difference doesn't matter, there is still a performance difference.
String::to_lowercase() allcates a new String and fills it with the characters. char::to_lowercase() only does that on-the-fly. So the former is expected to be much slower. I don't think there can't be a version of String::to_lowercase() that returns an iterator and avoids the penalty of the allocation, just that it hasn't done yet.
I like using partial application, because it permits (among other things) to split a complicated function call, that is more readable.
An example of partial application:
fn add(x: i32, y: i32) -> i32 {
x + y
}
fn main() {
let add7 = |x| add(7, x);
println!("{}", add7(35));
}
Is there overhead to this practice?
Here is the kind of thing I like to do (from a real code):
fn foo(n: u32, things: Vec<Things>) {
let create_new_multiplier = |thing| ThingMultiplier::new(thing, n); // ThingMultiplier is an Iterator
let new_things = things.clone().into_iter().flat_map(create_new_multiplier);
things.extend(new_things);
}
This is purely visual. I do not like to imbricate too much the stuff.
There should not be a performance difference between defining the closure before it's used versus defining and using it it directly. There is a type system difference — the compiler doesn't fully know how to infer types in a closure that isn't immediately called.
In code:
let create_new_multiplier = |thing| ThingMultiplier::new(thing, n);
things.clone().into_iter().flat_map(create_new_multiplier)
will be the exact same as
things.clone().into_iter().flat_map(|thing| {
ThingMultiplier::new(thing, n)
})
In general, there should not be a performance cost for using closures. This is what Rust means by "zero cost abstraction": the programmer could not have written it better themselves.
The compiler converts a closure into implementations of the Fn* traits on an anonymous struct. At that point, all the normal compiler optimizations kick in. Because of techniques like monomorphization, it may even be faster. This does mean that you need to do normal profiling to see if they are a bottleneck.
In your particular example, yes, extend can get inlined as a loop, containing another loop for the flat_map which in turn just puts ThingMultiplier instances into the same stack slots holding n and thing.
But you're barking up the wrong efficiency tree here. Instead of wondering whether an allocation of a small struct holding two fields gets optimized away you should rather wonder how efficient that clone is, especially for large inputs.
When doing integer arithmetic with checks for overflows, calculations often need to compose several arithmetic operations. A straightforward way of chaining checked arithmetic in Rust uses checked_* methods and Option chaining:
fn calculate_size(elem_size: usize,
length: usize,
offset: usize)
-> Option<usize> {
elem_size.checked_mul(length)
.and_then(|acc| acc.checked_add(offset))
}
However, this tells the compiler to generate a branch per each elementary operation. I have encountered a more unrolled approach using overflowing_* methods:
fn calculate_size(elem_size: usize,
length: usize,
offset: usize)
-> Option<usize> {
let (acc, oflo1) = elem_size.overflowing_mul(length);
let (acc, oflo2) = acc.overflowing_add(offset);
if oflo1 | oflo2 {
None
} else {
Some(acc)
}
}
Continuing computation regardless of overflows and aggregating the overflow flags with bitwise OR ensures that at most one branching is performed in the entire evaluation (provided that the implementations of overflowing_* generate branchless code). This optimization-friendly approach is more cumbersome and requires some caution in dealing with intermediate values.
Does anyone have experience with how the Rust compiler optimizes either of the patterns above on various CPU architectures, to tell whether the explicit unrolling is worthwhile, especially for more complex expressions?
Does anyone have experience with how the Rust compiler optimizes either of the patterns above on various CPU architectures, to tell whether the explicit unrolling is worthwhile, especially for more complex expressions?
You can use the playground to check how LLVM optimizes things: just click on "LLVM IR" or "ASM" instead of "Run". Stick a #[inline(never)] on the function you wish to check, and pay attention to pass it run-time arguments, to avoid constant folding. As in here:
use std::env;
#[inline(never)]
fn calculate_size(elem_size: usize,
length: usize,
offset: usize)
-> Option<usize> {
let (acc, oflo1) = elem_size.overflowing_mul(length);
let (acc, oflo2) = acc.overflowing_add(offset);
if oflo1 | oflo2 {
None
} else {
Some(acc)
}
}
fn main() {
let vec: Vec<usize> = env::args().map(|s| s.parse().unwrap()).collect();
let result = calculate_size(vec[0], vec[1], vec[2]);
println!("{:?}",result);
}
The answer you'll get, however, is that the overflow intrinsics in Rust and LLVM have been coded for convenience and not performance, unfortunately. This means that while the explicit unrolling optimizes well, counting on LLVM to optimize the checked code is not realistic for now.
Normally this is not an issue; but for a performance hotspot, you may want to unroll manually.
Note: this lack of performance is also the reason that overflow checking is disabled by default in Release mode.