How do I elegantly take first 4 bytes of SHA256 in Rust? Specifically from generic_array produced by:
sha2::Sha256::new().chain(b"blabla_93794926").result();
(without using unsafe Rust)
This should work:
let first_bytes = &(Sha256::new().chain(b"blabla_93794926").result()[0..4]);
Related
I've been recently experimenting with different hash functions in Rust. Started off with the fasthash crate, where many algorithms are implemented; e.g., murmur3 is then called as
let hval = murmur3::hash32_with_seed(&tmp_read_buff, 123 as u32);
This works very fast (e.g., few seconds for 100000000 short inputs). I also stumbled upon FxHash, the algorithm used a lot internally in Firefox (at least initially?). I rolled my version of hashing a byte array with this algorithm as follows
use rustc_hash::FxHasher;
use std::hash::Hasher;
fn hash_with_fx(read_buff: &[u8]) -> u64 {
let mut hasher = FxHasher::default();
for el in read_buff {
hasher.write_u8(*el);
}
return hasher.finish();
}
This works, however, it's about 5x slower. I'd really like to know if I'm missing something apparent here/how could I achieve similar or better speeds to fasthash's e.g., murmur3. My intuition is that with FxHash, the core operation is very simple,
self.hash = self.hash.rotate_left(5).bitxor(i).wrapping_mul(K);
hence it should be one of the fastest.
The FxHasher documentation mentions that:
the speed of the hash function itself is much higher because it works on up to 8 bytes at a time.
But your algorithm completely removes this possibility because you are processing each byte individually. You can make it a lot faster by hashing in chunks of 8 bytes.
fn hash_with_fx(read_buff: &[u8]) -> u64 {
let mut hasher = FxHasher::default();
let mut chunks = read_buff.chunks_exact(8);
for bytes in &mut chunks {
// unwrap is ok because `chunks_exact` provides the guarantee that
// the `bytes` slice is exactly the requested length
let int = u64::from_be_bytes(bytes.try_into().unwrap());
hasher.write_u64(int);
}
for byte in chunks.remainder() {
hasher.write_u8(*byte);
}
hasher.finish()
}
For very small inputs (especially for numbers like 7 bytes) it may introduce a small extra overhead compared with your original code. But, for larger inputs, it ought to be significantly faster.
Bonus material
It should be possible to remove a few extra instructions in the loop by using unwrap_unchecked instead of unwrap. It's completely sound to do so in this case, but it may not be worth introducing unsafe code into your codebase. I would measure the difference before deciding to include unsafe code.
The documentation of copy_to_bytes() says:
Consumes len bytes inside self and returns new instance of Bytes with this data.
This function may be optimized by the underlying type to avoid actual copies. For example, Bytes implementation will do a shallow copy (ref-count increment).
I wonder how copy_to_bytes() actually behaves when being applied on Chain? Does it copy both Bytes, or just one? Or does it just increase the ref-count?
use bytes::*;
fn main() {
let slice1 = &[1,2,3];
let slice2 = &[4,5,6];
let a = Bytes::from_static(slice1);
let b = Bytes::from_static(slice2);
let chained = a.chain(b).copy_to_bytes(slice1.len() + slice2.len());
}
Playground
I also asked this question as an issue on tokio-rs but obtained no answer.
They are just treated as if they were appended.
As you pointed from the documentation:
Consumes len bytes inside self and returns new instance of Bytes with
this data.
This function may be optimized by the underlying type to avoid actual
copies. For example, Bytes implementation will do a shallow copy
(ref-count increment).
So, yes, the values are copied into the bytes buffer.
Playground
I want to use the blake2AsHex kind of function in Rust. This function exists in javascript but I am looking for a corresponding function in rust. So far, using the primitives of Substrate which are:
pub fn blake2_256(data: &[u8]) -> [u8; 32]
// Do a Blake2 256-bit hash and return result.
I am getting a different value.
When I execute this in console:
util_crypto.blake2AsHex("0x0000000000000000000000000000000000000000000000000000000000000001")
I get the desired value: 0x33e423980c9b37d048bd5fadbd4a2aeb95146922045405accc2f468d0ef96988. However, when I execute this rust code:
let res = hex::encode(&blake2_256("0x0000000000000000000000000000000000000000000000000000000000000001".as_bytes()));
println!("File Hash encoding: {:?}", res);
I get a different value:
47016246ca22488cf19f5e2e274124494d272c69150c3db5f091c9306b6223fc
Hence, how can I implement blake2AsHex in Rust?
Again you have an issue with data types here.
"0x0000000000000000000000000000000000000000000000000000000000000001".as_bytes()
is converting a big string to bytes, not the hexadecimal representation.
You need to correctly create the bytearray that you want to represent, and then it should work.
You are already using hex::encode for bytes to hex string... you should be using hex::decode for hex string to bytes:
https://docs.rs/hex/0.3.1/hex/fn.decode.html
Decodes a hex string into raw bytes.
Rust has FromStr, however as far as I can see this only takes Unicode text input. Is there an equivalent to this for [u8] arrays?
By "parse" I mean take ASCII characters and return an integer, like C's atoi does.
Or do I need to either...
Convert the u8 array to a string first, then call FromStr.
Call out to libc's atoi.
Write an atoi in Rust.
In nearly all cases the first option is reasonable, however there are cases where files maybe be very large, with no predefined encoding... or contain mixed binary and text, where its most straightforward to read integer numbers as bytes.
No, the standard library has no such feature, but it doesn't need one.
As stated in the comments, the raw bytes can be converted to a &str via:
str::from_utf8
str::from_utf8_unchecked
Neither of these perform extra allocation. The first one ensures the bytes are valid UTF-8, the second does not. Everyone should use the checked form until such time as profiling proves that it's a bottleneck, then use the unchecked form once it's proven safe to do so.
If bytes deeper in the data need to be parsed, a slice of the raw bytes can be obtained before conversion:
use std::str;
fn main() {
let raw_data = b"123132";
let the_bytes = &raw_data[1..4];
let the_string = str::from_utf8(the_bytes).expect("not UTF-8");
let the_number: u64 = the_string.parse().expect("not a number");
assert_eq!(the_number, 231);
}
As in other code, these these lines can be extracted into a function or a trait to allow for reuse. However, once that path is followed, it would be a good idea to look into one of the many great crates aimed at parsing. This is especially true if there's a need to parse binary data in addition to textual data.
I do not know of any way in the standard library, but maybe the atoi crate works for you? Full disclosure: I am its author.
use atoi::atoi;
let (number, digits) = atoi::<u32>(b"42 is the answer"); //returns (42,2)
You can check if the second element of the tuple is a zero to see if the slice starts with a digit.
let (number, digits) = atoi::<u32>(b"x"); //returns (0,0)
let (number, digits) = atoi::<u32>(b"0"); //returns (0,1)
I want to perform a very simple task, but I cannot manage to stop the compiler from complaining.
fn transform(s: String) -> String {
let bytes = s.as_bytes();
format!("{}/{}", bytes[0..2], bytes[2..4])
}
[u8] does not have a constant size known at compile-time.
Some tips making this operation to work as intended?
Indeed, the size of a [u8] isn't known at compile time. The size of &[u8] however is known at compile time because it's just a pointer plus a usize representing the length of sequence.
format!("{:?}/{:?}", &bytes[0..2], &bytes[2..4])
Rust strings are encoded in utf-8, so working with strings in this way is generally a bad idea because a single unicode character may consist of multiple bytes.