Computing u32 hash with FxHasher fast - rust

I've been recently experimenting with different hash functions in Rust. Started off with the fasthash crate, where many algorithms are implemented; e.g., murmur3 is then called as
let hval = murmur3::hash32_with_seed(&tmp_read_buff, 123 as u32);
This works very fast (e.g., few seconds for 100000000 short inputs). I also stumbled upon FxHash, the algorithm used a lot internally in Firefox (at least initially?). I rolled my version of hashing a byte array with this algorithm as follows
use rustc_hash::FxHasher;
use std::hash::Hasher;
fn hash_with_fx(read_buff: &[u8]) -> u64 {
let mut hasher = FxHasher::default();
for el in read_buff {
hasher.write_u8(*el);
}
return hasher.finish();
}
This works, however, it's about 5x slower. I'd really like to know if I'm missing something apparent here/how could I achieve similar or better speeds to fasthash's e.g., murmur3. My intuition is that with FxHash, the core operation is very simple,
self.hash = self.hash.rotate_left(5).bitxor(i).wrapping_mul(K);
hence it should be one of the fastest.

The FxHasher documentation mentions that:
the speed of the hash function itself is much higher because it works on up to 8 bytes at a time.
But your algorithm completely removes this possibility because you are processing each byte individually. You can make it a lot faster by hashing in chunks of 8 bytes.
fn hash_with_fx(read_buff: &[u8]) -> u64 {
let mut hasher = FxHasher::default();
let mut chunks = read_buff.chunks_exact(8);
for bytes in &mut chunks {
// unwrap is ok because `chunks_exact` provides the guarantee that
// the `bytes` slice is exactly the requested length
let int = u64::from_be_bytes(bytes.try_into().unwrap());
hasher.write_u64(int);
}
for byte in chunks.remainder() {
hasher.write_u8(*byte);
}
hasher.finish()
}
For very small inputs (especially for numbers like 7 bytes) it may introduce a small extra overhead compared with your original code. But, for larger inputs, it ought to be significantly faster.
Bonus material
It should be possible to remove a few extra instructions in the loop by using unwrap_unchecked instead of unwrap. It's completely sound to do so in this case, but it may not be worth introducing unsafe code into your codebase. I would measure the difference before deciding to include unsafe code.

Related

What is the most efficient way of taking a number of integer user inputs and storing it in a Vec<i32>?

I was trying to use rust for competitive coding and I was wondering what is the most efficient way of storing user input in a Vec. I have come up with a method but I am afraid that it is slow and redundant.
Here is my code:
use std::io;
fn main() {
let mut input = String::new();
io::stdin().read_line(&mut input).expect("cant read line");
let input:Vec<&str> = input.split(" ").collect();
let input:Vec<String> = input.iter().map(|x| x.to_string()).collect();
let input:Vec<i32> = input.iter().map(|x| x.trim().parse().unwrap()).collect();
println!("{:?}", input);
}
PS: I am new to rust.
I see those ways of improving performance of the code:
Although not really relevant for std::io::stdin(), std::io::BufReader may have great effect for reading e.g. from std::fs::File. Buffer capacity can also matter.
Using locked stdin: let si = std::io::stdin(); let si = si.locked();
Avoiding allocations by keeping vectors around and using extend_from_iter instead of collect, if the code reads multiple line (unlike in the sample you posted in the question).
Maybe avoiding temporary vectors alltogether and just chaining Iterator operations together. Or using a loop like for line in input.split(...) { ... }. It may affect performance in both ways - you need to experiment to find out.
Avoiding to_string() and just storing reference to input buffer (which can also be used to parse() into i32. Note that this may invite famous Rust borrowing and lifetimes complexity.
Maybe finding some fast SIMD-enhanced string to int parser instead of libstd's parse().
Maybe streaming the result to algorithm instead of collecting everything to a Vec first. This can be beneficial especially if multiple threads can be used. For performance, you would still likely need to send data in chunks, not by one single i32.
Yeah, there are some changes you can make that will make your code more precise, simple and faster.
A better code :
use std::io;
fn main() {
let mut input = String::new();
io::stdin().read_line(&mut input).unwrap();
let input: Vec<i32> = input.split_whitespace().map(|x| x.parse().unwrap()).collect();
println!("{:?}", input);
}
Explanation
The input.split_whitespace() returns an iterator containing elements that are seperated by any kind of whitespace including line breaks. This saves the time used in spliting by just one whitespace input.split(" ") and iterating over again with a .trim() method on each string slice to remove any surronding whitespaces.
(You can also checkout input.split_ascii_whitespace(), if you want to restrict the split over ascii whitespaces).
There was no need for the code input.iter().map(|x| x.to_string()).collect(), since you can call also call a .trim() method on a string slice.
This saves some time in both the runtime and coding process, since the .collect() method is only used once and there was just one iteration.

How can I use Rayon to split a big range into chunks of ranges and have each thread find within a chunk?

I am making a program that brute forces a password by parallelization. At the moment the password to crack is already available as plain text, I'm just attempting to brute force it anyway.
I have a function called generate_char_array() which, based on an integer seed, converts base and returns a u8 slice of characters to try and check. This goes through the alphabet first for 1 character strings, then 2, etc.
let found_string_index = (0..1e12 as u64).into_par_iter().find_any(|i| {
let mut array = [0u8; 20];
let bytes = generate_char_array(*i, &mut array);
return &password_bytes == &bytes;
});
With the found string index (or seed integer rather), I can generate the found string.
The problem is that the way Rayon parallelizes this for me is split the arbitrary large integer range into thread_count-large slices (e.g. for 4 threads, 0..2.5e11, 2.5e11..5e11 etc). This is not good, because the end of the range is for arbitrarily super large password lengths (10+, I don't know), whereas most passwords (including the fixed "zzzzz" I tend to try) are much shorter, and thus what I get is that the first thread does all the work, and the rest of the threads just waste time testing way too long passwords and synchronizing; getting actually slower than single thread performance as a result.
How could I instead split the arbitrary big range (doesn't have to have an end actually) into chunks of ranges and have each thread find within chunks? That would make the workers in different threads actually useful.
This goes through the alphabet first for 1 character strings, then 2
You wish to impose some sequencing on your data processing, but the whole point of Rayon is to go in parallel.
Instead, use regular iterators to sequentially go up in length and then use parallel iterators inside a specific length to quickly process all of the values of that length.
Since you haven't provided enough code for a runnable example, I've made this rough approximation to show the general shape of such a solution:
extern crate rayon;
use rayon::iter::{IntoParallelRefIterator, ParallelIterator};
use std::ops::RangeInclusive;
type Seed = u8;
const LENGTHS: RangeInclusive<usize> = 1..=3;
const SEEDS: RangeInclusive<Seed> = 0..=std::u8::MAX;
fn find<F>(test_password: F) -> Option<(usize, Seed)>
where
F: Fn(usize, Seed) -> bool + Sync,
{
// Rayon doesn't support RangeInclusive yet
let seeds: Vec<_> = SEEDS.collect();
// Step 1-by-1 through the lengths, sequentially
LENGTHS.flat_map(|length| {
// In parallel, investigate every value in this length
// This doesn't do that, but it shows how the parallelization
// would be introduced
seeds
.par_iter()
.find_any(|&&seed| test_password(length, seed))
.map(|&seed| (length, seed))
}).next()
}
fn main() {
let pass = find(|l, s| {
println!("{}, {}", l, s);
// Actually generate and check the password based on the search criteria
l == 3 && s == 250
});
println!("Found password length and seed: {:?}", pass);
}
This can "waste" a little time at the end of each length as the parallel threads spin down one-by-one before spinning back up for the next length, but that seems unlikely to be a primary concern.
This is a version of what I suggested in my comment.
The main loop is parallel and is only over the first byte of each attempt. For each first byte, do the full brute force search for the remainder.
let matched_bytes = (0 .. 0xFFu8).into_par_iter().filter_map(|n| {
let mut array = [0u8; 8];
// the first digit is always the same in this run
array[0] = n;
// The highest byte is 0 because it's provided from the outer loop
(0 ..= 0x0FFFFFFFFFFFFFFF as u64).into_iter().filter_map(|i| {
// pass a slice so that the first byte is not affected
generate_char_array(i, &mut array[1 .. 8]);
if &password_bytes[..] == &array[0 .. password_bytes.len()] {
Some(array.clone())
} else {
None
}
}).next()
}).find_any(|_| true);
println!("found = {:?}", matched_bytes);
Also, even for a brute force method, this is probably highly inefficient still!
If Rayon splits the slices as you described, then apply simple math to balance the password lengths:
let found_string_index = (0..max_val as u64).into_par_iter().find_any(|i| {
let mut array = [0u8; 20];
let v = i/span + (i%span) * num_cpu;
let bytes = generate_char_array(*v, &mut array);
return &password_bytes == &bytes;
});
The span value depends on the number of CPUs (the number of threads used by Rayon), in your case:
let num_cpu = 4;
let span = 2.5e11 as u64;
let max_val = span * num_cpu;
Note the performance of this approach is highly dependent on how Rayon performs the split of the sequence on parallel threads. Verify that it works as you reported in the question.

Least memory usage and best performance

I need to store a billion "appearances" and I am looking for the most efficient way to store these with respect to both memory usage and performance. What are, for example, the differences in those respects for a1, a2, a3 in:
struct Appearance<'a> {
identity: &'a u64,
role: &'a str
}
struct AnotherAppearance<'a>((&'a u64, &'a str));
fn main() {
let thing = 42;
let hair_color = "hair color";
let a1 = Appearance {identity: &thing, role: &hair_color};
let a2 = AnotherAppearance((&thing, &hair_color));
let a3 = (&thing, &hair_color);
}
Are there better ways to work with such a structure? Also, is there a way to get detailed information about a1, a2, a3 so that I could see how they are represented in memory for myself?
First, as Ijedrz noted, all your proposed alternatives have the same size. In fact, from the compiler's point of view, they're all identical.
If you're after smaller memory sizes, you might be better off using something like:
struct Appearance {
identity: u32,
role: InternedString,
}
First of all, u32 has 4 billion distinct values, so you definitely don't need a u64 for a billion records. Aside from that, a &u64 is going to be the same size as a u64 on a 64-bit machine, so there's not much point in using it. That a u32 is half the size is a bonus.
Beyond that, &str seems incredibly wasteful. That's going to take two pointers for data which, I assume, is unlikely to change much. If there are many more Appearances than roles, your best bet is to intern the strings and reduce the field to a pointer (or even better: a u32 ID which indirects through another table). There is no interned string in the standard library, but they're not that hard to implement, assuming you can't find one somewhere. Such a structure (assuming InternedString is a u32 ID) would be 8 bytes versus your 24 bytes.
If performance is what you're after, that depends on how you use the structures. That said, &u64 is slower than u64, so changing that will probably help. As for the string, it depends on how you use it. If you mostly do comparisons, an interned string will be faster because you can compare those with a single comparison; comparing regular strings can be much slower since you have to actually look at the contents.
All three variants seem to have the same size:
use std::mem::size_of;
println!("a1: {}", size_of::<Appearance>()); // a1: 24
println!("a2: {}", size_of::<AnotherAppearance>()); // a2: 24
println!("a3: {}", size_of::<(&u64, &str)>()); // a3: 24
So I would just use the one that is most descriptive, i.e. Appearance.

How to compare strings in constant time?

How does one safely compare two strings with bounded length in such a way that each comparison takes the same time? Hashing unfortunately has a timing attack vulnerability.
Is there any way to compare two strings without hashing in a way that is not vulnerable to timing-attacks?
TL;DR: Use assembly.
Constant Time code is really hard to pull off. To be truly constant time you need:
a constant time algorithm,
a constant time implementation of said algorithm.
What does "constant time algorithm" mean?
The example of string comparison is great. Most of the time, you want the comparison to take as little as possible, which means bailing out at the first difference:
fn simple_compare(a: &str, b: &str) -> bool {
if a.len() != b.len() { return false; }
for (a, b) in a.bytes().zip(b.bytes()) {
if a != b { return false; }
}
true
}
The constant time version algorithm version however should have constant time regardless of the input:
the input should always have the same size,
the time taken to compute the result should be identical no matter where the difference is located (if any).
The algorithm Lukas gave is almost right:
/// Prerequisite: a.len() == b.len()
fn ct_compare(a: &str, b: &str) -> bool {
debug_assert!(a.len() == b.len());
a.bytes().zip(b.bytes())
.fold(0, |acc, (a, b)| acc | (a ^ b) ) == 0
}
What does "constant time implementation" mean?
Even if the algorithm is constant time, the implementation may not be.
If the exact same sequence of CPU instructions is not used, then on some architecture one of the instructions could be faster, while the other is slower, and the implementation would lose.
If the algorithm uses table look-up, then there could be more or less cache misses.
Can you write a constant time implementation of string comparison in Rust?
No.
The Rust language could potentially be suited to the task, however its toolchain is not:
the LLVM optimizer will wreak havoc with your algorithm, short-circuiting it, eliminating unnecessary reads, now or in the future,
the LLVM backends will wreak havoc with your implementation, picking different instructions.
The short and long is that, today, the only way to access a constant time implementation from Rust is to write said implementation in assembly.
To write a timing-attack-safe string comparison algorithm yourself is pretty easy in theory. There are many resources online on how to do it in other languages. The important part is to trick the optimizer in not optimizing your code in a way you don't want. Here is one example Rust implementation which uses the algorithm described here:
fn ct_compare(a: &str, b: &str) -> bool {
if a.len() != b.len() {
return false;
}
a.bytes().zip(b.bytes())
.fold(0, |acc, (a, b)| acc | (a ^ b) ) == 0
}
(Playground)
Of course, this algorithm can be easily generalized to everything that is AsRef<[u8]>. This is left as an exercise to the reader ;-)
It looks like there is a crate already offering these kinds of comparisons: consistenttime. I haven't tested it, but the documentation looks quite good.
For those looking for a crate providing such implementation, you can use rust-crypto that provides the function fixed_time_eq.
The implementation is very similar to the one of Lukas Kalbertodt.

What happens if I call Vec::from_raw_parts with a smaller capacity than the pointer actually has?

I have a vector of u8 that I want to interpret as a vector of u32. It is assumed that the bytes are in the right order. I don't want to allocate new memory and copy bytes after casting. I got the following to work:
use std::mem;
fn reinterpret(mut v: Vec<u8>) -> Option<Vec<u32>> {
let v_len = v.len();
v.shrink_to_fit();
if v_len % 4 != 0 {
None
} else {
let v_cap = v.capacity();
let v_ptr = v.as_mut_ptr();
println!("{:?}|{:?}|{:?}", v_len, v_cap, v_ptr);
let v_reinterpret = unsafe { Vec::from_raw_parts(v_ptr as *mut u32, v_len / 4, v_cap / 4) };
println!("{:?}|{:?}|{:?}",
v_reinterpret.len(),
v_reinterpret.capacity(),
v_reinterpret.as_ptr());
println!("{:?}", v_reinterpret);
println!("{:?}", v); // v is still alive, but is same as rebuilt
mem::forget(v);
Some(v_reinterpret)
}
}
fn main() {
let mut v: Vec<u8> = vec![1, 1, 1, 1, 1, 1, 1, 1];
let test = reinterpret(v);
println!("{:?}", test);
}
However, there's an obvious problem here. From the shrink_to_fit documentation:
It will drop down as close as possible to the length but the allocator may still inform the vector that there is space for a few more elements.
Does this mean that my capacity may still not be a multiple of the size of u32 after calling shrink_to_fit? If in from_raw_parts I set capacity to v_len/4 with v.capacity() not an exact multiple of 4, do I leak those 1-3 bytes, or will they go back into the memory pool because of mem::forget on v?
Is there any other problem I am overlooking here?
I think moving v into reinterpret guarantees that it's not accessible from that point on, so there's only one owner from the mem::forget(v) call onwards.
This is an old question, and it looks like it has a working solution in the comments. I've just written up what exactly goes wrong here, and some solutions that one might create/use in today's Rust.
This is undefined behavior
Vec::from_raw_parts is an unsafe function, and thus you must satisfy its invariants, or you invoke undefined behavior.
Quoting from the documentation for Vec::from_raw_parts:
ptr needs to have been previously allocated via String/Vec (at least, it's highly likely to be incorrect if it wasn't).
T needs to have the same size and alignment as what ptr was allocated with. (T having a less strict alignment is not sufficient, the alignment really needs to be equal to satsify the dealloc requirement that memory must be allocated and deallocated with the same layout.)
length needs to be less than or equal to capacity.
capacity needs to be the capacity that the pointer was allocated with.
So, to answer your question, if capacity is not equal to the capacity of the original vec, then you've broken this invariant. This gives you undefined behavior.
Note that the requirement isn't on size_of::<T>() * capacity either, though, which brings us to the next topic.
Is there any other problem I am overlooking here?
Three things.
First, the function as written is disregarding another requirement of from_raw_parts. Specifically, T must have the same size as alignment as the original T. u32 is four times as big as u8, so this again breaks this requirement. Even if capacity*size remains the same, size isn't, and capacity isn't. This function will never be sound as implemented.
Second, even if all of the above was valid, you've also ignored the alignment. u32 must be aligned to 4-byte boundaries, while a Vec<u8> is only guaranteed to be aligned to a 1-byte boundary.
A comment on the OP mentions:
I think on x86_64, misalignment will have performance penalty
It's worth noting that while this may be true of machine language, it is not true for Rust. The rust reference explicitly states "A value of alignment n must only be stored at an address that is a multiple of n." This is a hard requirement.
Why the exact type requirement?
Vec::from_raw_parts seems like it's pretty strict, and that's for a reason. In Rust, the allocator API operates not only on allocation size, but on a Layout, which is the combination of size, number of things, and alignment of individual elements. In C with memalloc, all the allocator can rely upon is that the size is the same, and some minimum alignment. In Rust, though, it's allowed to rely on the entire Layout, and invoke undefined behavior if not.
So in order to correctly deallocate the memory, Vec needs to know the exact type that it was allocated with. By converting a Vec<u32> into Vec<u8>, it no longer knows this information, and so it can no longer properly deallocate this memory.
Alternative - Transforming slices
Vec::from_raw_parts's strictness comes from the fact that it needs to deallocate the memory. If we create a borrowing slice, &[u32] instead, we no longer need to deal with it! There is no capacity when turning a &[u8] into &[u32], so we should be all good, right?
Well, almost. You still have to deal with alignment. Primitives are generally aligned to their size, so a [u8] is only guaranteed to be aligned to 1-byte boundaries, while [u32] must be aligned to a 4-byte boundary.
If you want to chance it, though, and create a [u32] if possible, there's a function for that - <[T]>::align_to:
pub unsafe fn align_to<U>(&self) -> (&[T], &[U], &[T])
This will trim of any starting and ending misaligned values, and then give you a slice in the middle of your new type. It's unsafe, but the only invariant you need to satisfy is that the elements in the middle slice are valid.
It's sound to reinterpret 4 u8 values as a u32 value, so we're good.
Putting it all together, a sound version of the original function would look like this. This operates on borrowed rather than owned values, but given that reinterpreting an owned Vec is instant-undefined-behavior in any case, I think it's safe to say this is the closest sound function:
use std::mem;
fn reinterpret(v: &[u8]) -> Option<&[u32]> {
let (trimmed_front, u32s, trimmed_back) = unsafe { v.align_to::<u32>() };
if trimmed_front.is_empty() && trimmed_back.is_empty() {
Some(u32s)
} else {
// either alignment % 4 != 0 or len % 4 != 0, so we can't do this op
None
}
}
fn main() {
let mut v: Vec<u8> = vec![1, 1, 1, 1, 1, 1, 1, 1];
let test = reinterpret(&v);
println!("{:?}", test);
}
As a note, this could also be done with std::slice::from_raw_parts rather than align_to. However, that requires manually dealing with the alignment, and all it really gives is more things we need to ensure we're doing right. Well, that and compatibility with older compilers - align_to was introduced in 2018 in Rust 1.30.0, and wouldn't have existed when this question was asked.
Alternative - Copying
If you do need a Vec<u32> for long term data storage, I think the best option is to just allocate new memory. The old memory is allocated for u8s anyways, and wouldn't work.
This can be made fairly simple with some functional programming:
fn reinterpret(v: &[u8]) -> Option<Vec<u32>> {
let v_len = v.len();
if v_len % 4 != 0 {
None
} else {
let result = v
.chunks_exact(4)
.map(|chunk: &[u8]| -> u32 {
let chunk: [u8; 4] = chunk.try_into().unwrap();
let value = u32::from_ne_bytes(chunk);
value
})
.collect();
Some(result)
}
}
First, we use <[T]>::chunks_exact to iterate over chunks of 4 u8s. Next, try_into to convert from &[u8] to [u8; 4]. The &[u8] is guaranteed to be length 4, so this never fails.
We use u32::from_ne_bytes to convert the bytes into a u32 using native endianness. If interacting with a network protocol, or on-disk serialization, then using from_be_bytes or from_le_bytes may be preferable. And finally, we collect to turn our result back into a Vec<u32>.
As a last note, a truly general solution might use both of these techniques. If we change the return type to Cow<'_, [u32]>, we could return aligned, borrowed data if it works, and allocate a new array if it doesn't! Not quite the best of both worlds, but close.

Resources