How to properly encode a VarInt? - rust

I have a YAML file with test cases for encoding and decoding elements, which are guaranteed to be correct. The left-hand side represents the expected encoded bytes, and the right-hand side contains the original number. For VarInts, the test cases are:
examples:
"\0": 0
"\u0001": 1
"\u000A": 10
"\u00c8\u0001": 200
"\u00e8\u0007": 1000
"\u00a9\u0046": 9001
"\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u0001": -1
The first three examples work correctly when interpreted as unsigned numbers. However, the fourth example (200) and the subsequent ones don't yield the correct results.
Specifically for 200, I have the following minimally reproducible example:
use bytes::{Buf, BufMut};
use integer_encoding::{VarIntReader, VarIntWriter, VarInt};
let value = "\u{00c8}\u{0001}";
// "È\u{1}"
println!("Expected encoded number as a string: {:?}", value);
let mut buf: &[u8] = value.as_bytes();
// [195, 136, 1]
println!("Expected encoded number as a byte array: {:?}", buf);
let num_as_i32: i32 = 200;
let mut wr = vec![].writer();
wr.write_varint(num_as_i32);
let encoded_result_as_i32: Vec<u8> = wr.into_inner();
// [144, 3]
println!("Encoded result as i32: {:?}", encoded_result_as_i32);
let num_as_u32: u32 = 200;
let mut wr2 = vec![].writer();
wr2.write_varint(num_as_u32);
let encoded_result_as_u32: Vec<u8> = wr2.into_inner();
// [200, 1]
println!("Encoded result as u32: {:?}", encoded_result_as_u32);
The result [200, 1] seems to make sense as it matches the hex values for "\u00c8\u0001", but it doesn't match the supposedly expected value of [195, 136, 1].
The last example(-1) should be encoded as 1 according to the protobuf VarInt reference, so there seems to be something I'm missing about that as well.
Is there something wrong with the string interpretation of the expected encoded values? Or is something missing in the encoding process?

The issue here is that "\u00c8\u0001" needs to be read as a byte array [200, 1] instead of an UTF-8 string, which gets incorrectly interpreted as [195, 136, 1].
The encoding itself is correct and the solution would be to either read the encoding correctly without allowing it to be converted to UTF-8, or to allow it to be converted into UTF-8 and revert it to the correct byte array if possible.
This would be more adequate as a separate question, so I'm closing this one.
Kudos to #cafce25 for the help!
Edit: solution

Related

How to parse 16bit hex from char iterator

Say I have a str s with the value foo004Frtz and I want to use an iterator to parse the string.
Goal is to in the end parse the 004F into a u16.
let it = s.chars().into_iter();
Assuming that it is at the correct position (second o), how do I parse the u16 from the iterator?
I tried:
let x = u16::from_str_radix(hex, 16); // Should be set to 79
but have a hard time constructing hex from the iterator.
EDIT:
Make value a bit more complicated.
Thx to #Chayim I figured out one solution:
let hex = &it.as_str()[..4];
let x = u16::from_str_radix(&hex, 16).unwrap();
Not sure though how efficient as_str() is...

Why is in this rust tutorial the string beeing converted to bytes?

I am reading the rust tutorial and in this section the tutorial converts a string into a byte array
like so:
fn first_word(s: &String) -> usize {
let bytes = s.as_bytes();
for (i, &item) in bytes.iter().enumerate() {
if item == b' ' {
return i;
}
}
s.len()
}
They state that this conversion is because we want to find the first instance of the space character so we need to compare to it. My question is why do we need to convert to bytes? What if instead of converting the string to bytes we convert the byte ' ' into a String and compare to that?
Strings in Rust are UTF-8 encoded. You can iterate over chars but that will be a bit slower because Unicode code points are variable length, and the char type is 4 bytes long so you can't fit as many in a cache line.
A space has the same byte representation regardless of whether you are using ASCII or UTF-8 encoding, so this is an easy optimisation. It's also the same amount of code as iterating over chars.
But, probably more importantly, the function in question is returning an index for where the character is found. Finding the index by iterating over chars would tell you how many unicode code points to skip to get to the that position, but you'd have to iterate again each time you wanted to use the index because each preceding codepoint could be anywhere from 1 to 4 bytes long. An index into bytes is much more straightforward and efficient.
For example, with a byte index:
let words = String::from("Hello there");
let index = first_word(&words); // byte index
// just a slice
let first_word = str::from_utf8(&words.as_bytes()[0..index]).unwrap();
Indexing Unicode code points:
let words = String::from("Hello there");
let index = first_word(&words); // code point index
// having to iterate again, and allocate a new String
let first_word: String = words.chars().take(index).collect();
Any method to take a slice here would involve calculating the byte position first.

Rust split vector of bytes by specific bytes

I have a file containing information that I want to load in the application. The file has some header infos as string, then multiple entries that are ended by ';' Some entries are used for different types and therefore lenght is variable, but all variables are separated by ','
Example:
\Some heading
\Some other heading
I,003f,3f3d00ed,"Some string",00ef,
0032,20f3
;
Y,02d1,0000,0000,"Name of element",
00000007,0,
00000000,0,
;
Y,02d1,0000,0000,"Name of element",30f0,2d0f,02sd,
00000007,0,
00000000,0,
;
I is one type of element
Y is another type of element
What I want to achieve is, to bring the elements into different structs to work with. Most of the values are numbers but some are strings.
What I was able to achieve is:
Import the file as Vec<u8>
Put it in a string (can't do that directly, beacuse there may be UTF-8 problems in elements I'm not interested in)
Split it to a Vec<&str> by ';'
Pass the strings to functions depending on their type
Split it to a Vec by '\n'
Split it to a Vec by ','
Reading out the data I need and interpret from the strings (str::from_str_radix for example)
Buld the struct and return it
This seems not to be the way to go, since I start with bytes, allocate them as string and then again allocate numbers on most of the values.
So my question is:
Can I split the Vec<u8> into multiple vectors separated by ';' (byte 59), split these further by '\n' and split this further by ','.
I assume it would be more performant to apply the bytes directly to the correct data-type. Or is my concern wrong?
Can I split the Vec into multiple vectors separated by ';' (byte 59), split these further by '\n' and split this further by ','.
Usually that is not going to work if the other bytes may appear in other places, like embedded in the strings.
Then there is also the question of how the strings are encoded, whether there are escape sequences, etc.
I assume it would be more performant to apply the bytes directly to the correct data-type. Or is my concern wrong?
Reading the entire file into memory and then performing several copies from one Vec to another Vec and another and so on is going to be slower than a single pass with a state machine of some kind. Not to mention it will make working with files bigger than memory extremely slow or impossible.
I wouldn't worry about performance until you have a working algorithm, in particular if you have to work with an undocumented, non-trivial, third-party format and you are not experienced at reading binary formats.
An answer not exactly for this question - but i got directed here and didn't get a better answer.
The function below takes a byte slice pointer and another byte slice and splits the first by the second.
fn split_bytes<'a>(bs: &'a[u8], pred: &'a[u8]) -> Vec<&'a[u8]> {
let mut indexes: Vec<(usize, usize)> = Vec::new();
for (index, el) in bs.windows(pred.len()).enumerate() {
if el == pred {
indexes.push((index, index + pred.len()));
}
};
indexes.reverse();
let mut cur = bs.clone();
let mut res: Vec<&[u8]> = Vec::new();
for (start, end) in indexes.to_owned() {
let (first_left, first_right) = cur.split_at(end);
res.push(first_right);
let (second_left, _) = first_left.split_at(start);
cur = second_left
}
res
}
Here is a demo link in the Rust playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=ff653e2be80d4f73542e26dc37c46f13

[] operator for strings, link with slices for vectors

Why do you have to walk over the string to find the nᵗʰ letter of a string when you do s[n] where s is a string. (According to https://doc.rust-lang.org/book/strings.html)
From what I understood, a string is an array of chars and a char is an array of 4 bytes or a number of 4 bytes. So is getting the nth letter would be similar as doing this : v[4*n..4*n+4] where v is a vector ?
What is the cost of v[i..j] ?
I would assume that the cost of v[i..j] is j-i and so that the cost of s[n] should be 4.
Note: The second edition of The Rust Programming Language has an improved and smooth explanation to Strings in Rust, which you might wish to read as well. The answer below, although still accurate, quotes from the first edition of the book.
I will try to clarify these misconceptions about strings in Rust by quoting from the book (https://doc.rust-lang.org/book/strings.html).
A ‘string’ is a sequence of Unicode scalar values encoded as a stream of UTF-8 bytes. All strings are guaranteed to be a valid encoding of UTF-8 sequences.
With this in mind, plus that UTF-8 code points are variably sized (1 to 4 bytes depending on the character), all strings in Rust, whether they are &str or String, are not arrays of characters, and can not be treated like such. It is further explained why on Slicing:
Because strings are valid UTF-8, they do not support indexing:
let s = "hello";
println!("The first letter of s is {}", s[0]); // ERROR!!!
Usually, access to a vector with [] is very fast. But, because each character in a UTF-8 encoded string can be multiple bytes, you have to walk over the string to find the nᵗʰ letter of a string. This is a significantly more expensive operation, and we don’t want to be misleading.
Unlike what was mentioned in the question, one cannot do s[n], because although in theory this would allows us to fetch the nth byte in constant time, that byte is not guaranteed to make any sense on its own.
What is the cost of v[i..j] ?
The cost of slicing is actually constant, because it is done at byte-level:
You can get a slice of a string with slicing syntax:
let dog = "hachiko";
let hachi = &dog[0..5];
But note that these are byte offsets, not character offsets. So this will fail at runtime:
let dog = "忠犬ハチ公";
let hachi = &dog[0..2];
with this error:
thread '' panicked at 'index 0 and/or 2 in 忠犬ハチ公 do not lie on
character boundary'
Basically, slicing is acceptable and will yield a new view of that string, so no copies are made. However, it should only be used when you are completely sure that the offsets are right in terms of character boundaries.
In order to iterate over each character of a string, you may instead call chars():
let c = s.chars().nth(n);
Even with that in mind, note that handling Unicode character might not be exactly what you want if you wish to handle character modifiers in UTF-8 (which are scalar values by themselves but should not be treated individually either). Quoting now from the str API:
fn chars(&self) -> Chars
Returns an iterator over the chars of a string slice.
As a string slice consists of valid UTF-8, we can iterate through a string slice by char. This method returns such an iterator.
It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.
Remember, chars may not match your human intuition about characters:
let y = "y̆";
let mut chars = y.chars();
assert_eq!(Some('y'), chars.next()); // not 'y̆'
assert_eq!(Some('\u{0306}'), chars.next());
assert_eq!(None, chars.next());
The unicode_segmentation crate provides a means to define grapheme cluster boundaries:
extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;
let s = "a̐éö̲\r\n";
let g = UnicodeSegmentation::graphemes(s, true).collect::<Vec<&str>>();
let b: &[_] = &["a̐", "é", "ö̲", "\r\n"];
assert_eq!(g, b);
If you do want to treat the string as an array of codepoints (which isn't strictly the same as characters; there are combining marks, emoji with separate skin-tone modifiers, etc.), you can collect it into a Vec:
fn main() {
let s = "£10 🙃!";
for (i,c) in s.char_indices() {
println!("{} {}", i, c);
}
let v: Vec<char> = s.chars().collect();
println!("v[5] = {}", v[5]);
}
Play link
With bonus demonstration of some varying character widths, this outputs:
0 £
2 1
3 0
4
5 🙃
9 !
v[5] = !

How do I reverse a string in 0.9?

How do I reverse a string in Rust 0.9?
According to rosettacode.org this worked in 0.8:
let reversed:~str = "一二三四五六七八九十".rev_iter().collect();
... but I can't get iterators working on strings in 0.9.
Also tried std::str::StrSlice::bytes_rev but I haven't figured out a clean way to convert the result back into a string without the compiler choking.
First of all iteration over bytes and reversing will break multibyte characters (you want iteration over chars)
let s = ~"abc";
let s2: ~str = s.chars_rev().collect();
println!("{:?}", s2);

Resources