I am working with big numbers in Rust, my inputs come in as hex strings and I am using a hash function which has done all the heavy lifting with large numbers, all I need to do is to provide it with decimal string versions of the numbers (not hex).
What would be the best practice to convert hex strings to their equivalent decimal representation, whilst staying within strings in order not to cause any overflow issues?
I don't think implementing this by hand is a trivial problem. It will involve a lot of error-prone math on strings and it will be quite slow in the end. String based math in general is quite slow.
There are excellent big integer libraries out there, though. For example num:
use num::{BigInt, Num};
fn convert_hex_to_dec(hex_str: &str) -> String {
BigInt::from_str_radix(hex_str, 16).unwrap().to_string()
}
fn main() {
let hex_str = "978d54b635d7a829d5e3d9bee5a56018ba5e01c10";
let dec_str = convert_hex_to_dec(hex_str);
println!("{}", dec_str);
}
13843350254437927549667668733810160916738275810320
If you planned to do your entire math with strings, I'd advice using BigInts instead.
Related
Just had this error pop up while messing around with some graphics for a terminal interface...
thread 'main' panicked at 'byte index 2 is not a char boundary; it is inside '░' (bytes 1..4) of ░▒▓█', src/main.rs:38:6
Can I not use these characters, or do I need to work some magic to support what I thought were default ASCII characters?
(Here's the related code for those wondering.)
// Example call with the same parameters that led to this issue.
charlist(" ░▒▓█".to_string(), 0.66);
// Returns the n-th character in a string.
// (Where N is a float value from 0 to 1,
// 0 being the start of the string and 1 the end.)
fn charlist<'a>(chars: &'a String, amount: f64) -> &'a str {
let chl: f64 = chars.chars().count() as f64; // Length of the string
let chpos = -((amount*chl)%chl) as i32; // Scalar converted to integer position in the string
&chars[chpos as usize..chpos as usize+1] // Slice the single requested character
}
There are couple misconceptions you seem to have. So let me address them in order.
░, ▒, ▓ and █ are not ASCII characters! They are unicode code points. You can determine this with following simple experiment.
fn main() {
let slice = " ░▒▓█";
for c in slice.chars() {
println!("{}, {}", c, c.len_utf8());
}
}
This code has output:
, 1
░, 3
▒, 3
▓, 3
█, 3
As you can see this "boxy" characters have a length of 3 bytes each! Rust uses utf-8 encoding for all of it's strings. This leads to another misconception.
I this line &chars[chpos as usize..chpos as usize+1] you are trying to get a slice of one byte in length. String slices in rust are indexed with bytes. But you tried to slice in the middle of a character (it has length of 3 bytes). In general characters in utf-8 encoding can be from one to four bytes in length. To get char's length in bytes you can use method len_utf8.
You can get an iterator of characters in a string slice using method chars. Then getting n-th character is as easy as using iterators method nth So the following is true:
assert_eq!(" ░▒▓█".chars().nth(3).unwrap(), '▒');
If you want to have also indices of this chars you can use method char_indices.
Using f64 values to represent nth character is odd and I would encourage you rethink if you really want to do this. But if you do you have two options.
You must remember that since characters have a variable length, string's slice method len doesn't return number of characters, but slice's length in bytes. To know how many characters are in the string you have no other option than iterating over it. So if you for example want to have a middle character you must first know how many there are. I can think of two ways you can do this.
You can either collect characters for Vec<char> (or something similar). Then you will know how many characters are there and can in O(1) index nth one. However this will result in additional memory allocation.
You can fist count how many characters there are with slice.chars().len(). Then calculate position of the nth one and get it by again iterating over chars and getting the nth one (as I showed above). This won't result in any additional memory allocation, but it will have complexity of O(2n), since you will have to iterate over whole string twice.
Which one you pick is up to you. You will have to make a compromise.
This isn't really a correctness problem, but prefer using &str over &String in the arguments of functions (as it will provide more flexibility to your callers). And you don't have to specify lifetime if you have only one reference in the arguments and the other one is in the returned type. Rust will infer that they have to have the same lifetime.
This isn't the exact use case, but it is basically what I am trying to do:
let mut username = "John_Smith";
println!("original username: {}",username);
username.set_char_at(4,'.'); // <------------- The part I don't know how to do
println!("new username: {}",username);
I can't figure out how to do this in constant time and using no additional space. I know I could use "replace" but replace is O(n). I could make a vector of the characters but that would require additional space.
I think you could create another variable that is a pointer using something like as_mut_slice, but this is deemed unsafe. Is there a safe way to replace a character in a string in constant time and space?
As of Rust 1.27 you can now use String::replace_range:
let mut username = String::from("John_Smith");
println!("original username: {}", username); // John_Smith
username.replace_range(4..5, ".");
println!("new username: {}", username); // John.Smith
(playground)
replace_range won't work with &mut str. If the size of the range and the size of the replacement string aren't the same, it has to be able to resize the underlying String, so &mut String is required. But in the case you ask about (replacing a single-byte character with another single-byte character) its memory usage and time complexity are both O(1).
There is a similar method on Vec, Vec::splice. The primary difference between them is that splice returns an iterator that yields the removed items.
In general ? For any pair of characters ? It's impossible.
A string is not an array. It may be implemented as an array, in some limited contexts.
Rust supports Unicode, which brings some challenges:
a Unicode code point might is an integral between 0 and 224
a grapheme may be composed of multiple Unicode code points
In order to represent this, a Rust string is (for now) a UTF-8 bytes sequence:
a single Unicode code point might be represented by 1 to 4 bytes
a grapheme might be represented by 1 or more bytes (no upper limit)
and therefore, the very notion of "replacing character i" brings a few challenges:
the position of character i is between the index i and the end of the string, it requires reading the string from the beginning to know exactly where though, which is O(N)
switching the i-th character in-place for another requires that both characters take up exactly the same amount of bytes
In general ? It's impossible.
In a particular and very specific case where the byte index is known and the byte encoding is known coincide length-wise, it is doable by directly modifying the byte sequence return by as_mut_bytes which is duly marked unsafe since you may inadvertently corrupt the string (remember, this bytes sequence must be a UTF-8 sequence).
If you want to handle only ASCII there is separate type for that:
use std::ascii::{AsciiCast, OwnedAsciiCast};
fn main() {
let mut ascii = "ascii string".to_string().into_ascii();
*ascii.get_mut(6) = 'S'.to_ascii();
println!("result = {}", ascii);
}
There are some missing pieces (like into_ascii for &str) but it does what you want.
Current implementaion of to_/into_ascii fails if input string is invalid ascii. There is to_ascii_opt (old naming of methods that might fail) but will probably be renamed to to_ascii in the future (and failing method removed or renamed).
To compare two Strings, ignoring case, it looks like I first need to convert to a lower case version of the string:
let a_lower = a.to_lowercase();
let b_lower = b.to_lowercase();
a_lower.cmp(&b_lower)
Is there a method that compares strings, ignoring case, without creating the temporary lower case strings, i.e. that iterates over the characters, performs the to-lowercase conversion there and compares the result?
There is no built-in method, but you can write one to do exactly as you described, assuming you only care about ASCII input.
use itertools::{EitherOrBoth::*, Itertools as _}; // 0.9.0
use std::cmp::Ordering;
fn cmp_ignore_case_ascii(a: &str, b: &str) -> Ordering {
a.bytes()
.zip_longest(b.bytes())
.map(|ab| match ab {
Left(_) => Ordering::Greater,
Right(_) => Ordering::Less,
Both(a, b) => a.to_ascii_lowercase().cmp(&b.to_ascii_lowercase()),
})
.find(|&ordering| ordering != Ordering::Equal)
.unwrap_or(Ordering::Equal)
}
As some comments below have pointed out, case-insensitive comparison is not going to work properly for UTF-8, without operating on the whole string, and even then there are multiple representations of some case conversions, which could give unexpected results.
With those caveats, the following will work for a lot of extra cases compared with the ASCII version above (e.g. most accented Latin characters) and may be satisfactory, depending on your requirements:
fn cmp_ignore_case_utf8(a: &str, b: &str) -> Ordering {
a.chars()
.flat_map(char::to_lowercase)
.zip_longest(b.chars().flat_map(char::to_lowercase))
.map(|ab| match ab {
Left(_) => Ordering::Greater,
Right(_) => Ordering::Less,
Both(a, b) => a.cmp(&b),
})
.find(|&ordering| ordering != Ordering::Equal)
.unwrap_or(Ordering::Equal)
}
If you are only working with ASCII, you can use eq_ignore_ascii_case:
assert!("Ferris".eq_ignore_ascii_case("FERRIS"));
UNICODE
The best way for supporting UNICODE is using to_lowercase() or to_uppercase().
This is because UNICODE has many caveats and these functions handles most situations. There are some locale specific strings not handled correctly.
let left = "first".to_string();
let right = "FiRsT".to_string();
assert!(left.to_lowercase() == right.to_lowercase());
Efficiency
It is possible to iterate and return on first non-equal character, so in essence you only allocate one character at a time. However iterating using chars function does not account for all situations UNICODE can throw at us.
See the answer by Peter Hall for details on this.
ASCII
Most efficient if only using ASCII is to use eq_ignore_ascii_case (as per Ibraheem Ahmed's answer). This is does not allocate/copy temporaries.
This is only good if your code controls at least one side of the comparison and you are certain that it will only include ASCII.
assert!("Ferris".eq_ignore_ascii_case("FERRIS"));
Locale
Rusts case functions are best effort regarding locales and do not handle all locales. To support proper internationalisation, you will need to look for other crates that do this.
I have a piece of text with characters of different bytelength.
let text = "Hello привет";
I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this
let slice = &text[start..end];
and got the following error
thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'п' (bytes 6..8) of `Hello привет`'
I suppose it happens since Cyrillic letters are multi-byte and the [..] notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:
slice = text[start:end] ?
I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?
Possible solutions to codepoint slicing
I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?
If you know the exact byte indices, you can slice a string:
let text = "Hello привет";
println!("{}", &text[2..10]);
This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):
let text = "Hello привет";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("{}", &text[2..end]);
As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.
let text = "Hello привет";
let text_vec = text.chars().collect::<Vec<_>>();
println!("{}", text_vec[2..8].iter().cloned().collect::<String>());
Why is this not easier?
As you can see, neither of these solutions is all that great. This is intentional, for two reasons:
As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the [] operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).
But the more important reason:
Unicode codepoints are generally not a useful unit
What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).
But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:
>>> s = "Jürgen"
>>> s[0:2]
'Ju'
Surprising, right? This is because the string above is:
0x004A LATIN CAPITAL LETTER J
0x0075 LATIN SMALL LETTER U
0x0308 COMBINING DIAERESIS
...
This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.
Another example:
>>> s = "fire"
>>> s[0:2]
'fir'
Also not what you'd expect. This time, fi is actually the ligature fi, which is one codepoint.
There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.
So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.
Further resources on this topic:
Blogpost "Let's stop ascribing meaning to unicode codepoints"
Blogpost "Breaking our Latin-1 assumptions
http://utf8everywhere.org/
A UTF-8 encoded string may contain characters which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.
You may use str::char_indices() for solving this (remember, that getting to a position in a UTF-8 string is O(n)):
fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> {
assert!(end >= start);
string.char_indices().nth(start).and_then(|(start_pos, _)| {
string[start_pos..]
.char_indices()
.nth(end - start - 1)
.map(|(end_pos, _)| &string[start_pos..end_pos])
})
}
playground
You may use str::chars() if you are fine with getting a String:
let string: String = text.chars().take(end).skip(start).collect();
Here is a function which retrieves a utf8 slice, with the following pros:
handle all edge cases (empty input, 0-width output ranges, out-of-scope ranges);
never panics;
use start-inclusive, end-exclusive ranges.
pub fn utf8_slice(s: &str, start: usize, end: usize) -> Option<&str> {
let mut iter = s.char_indices()
.map(|(pos, _)| pos)
.chain(Some(s.len()))
.skip(start)
.peekable();
let start_pos = *iter.peek()?;
for _ in start..end { iter.next(); }
Some(&s[start_pos..*iter.peek()?])
}
Why do you have to walk over the string to find the nᵗʰ letter of a string when you do s[n] where s is a string. (According to https://doc.rust-lang.org/book/strings.html)
From what I understood, a string is an array of chars and a char is an array of 4 bytes or a number of 4 bytes. So is getting the nth letter would be similar as doing this : v[4*n..4*n+4] where v is a vector ?
What is the cost of v[i..j] ?
I would assume that the cost of v[i..j] is j-i and so that the cost of s[n] should be 4.
Note: The second edition of The Rust Programming Language has an improved and smooth explanation to Strings in Rust, which you might wish to read as well. The answer below, although still accurate, quotes from the first edition of the book.
I will try to clarify these misconceptions about strings in Rust by quoting from the book (https://doc.rust-lang.org/book/strings.html).
A ‘string’ is a sequence of Unicode scalar values encoded as a stream of UTF-8 bytes. All strings are guaranteed to be a valid encoding of UTF-8 sequences.
With this in mind, plus that UTF-8 code points are variably sized (1 to 4 bytes depending on the character), all strings in Rust, whether they are &str or String, are not arrays of characters, and can not be treated like such. It is further explained why on Slicing:
Because strings are valid UTF-8, they do not support indexing:
let s = "hello";
println!("The first letter of s is {}", s[0]); // ERROR!!!
Usually, access to a vector with [] is very fast. But, because each character in a UTF-8 encoded string can be multiple bytes, you have to walk over the string to find the nᵗʰ letter of a string. This is a significantly more expensive operation, and we don’t want to be misleading.
Unlike what was mentioned in the question, one cannot do s[n], because although in theory this would allows us to fetch the nth byte in constant time, that byte is not guaranteed to make any sense on its own.
What is the cost of v[i..j] ?
The cost of slicing is actually constant, because it is done at byte-level:
You can get a slice of a string with slicing syntax:
let dog = "hachiko";
let hachi = &dog[0..5];
But note that these are byte offsets, not character offsets. So this will fail at runtime:
let dog = "忠犬ハチ公";
let hachi = &dog[0..2];
with this error:
thread '' panicked at 'index 0 and/or 2 in 忠犬ハチ公 do not lie on
character boundary'
Basically, slicing is acceptable and will yield a new view of that string, so no copies are made. However, it should only be used when you are completely sure that the offsets are right in terms of character boundaries.
In order to iterate over each character of a string, you may instead call chars():
let c = s.chars().nth(n);
Even with that in mind, note that handling Unicode character might not be exactly what you want if you wish to handle character modifiers in UTF-8 (which are scalar values by themselves but should not be treated individually either). Quoting now from the str API:
fn chars(&self) -> Chars
Returns an iterator over the chars of a string slice.
As a string slice consists of valid UTF-8, we can iterate through a string slice by char. This method returns such an iterator.
It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.
Remember, chars may not match your human intuition about characters:
let y = "y̆";
let mut chars = y.chars();
assert_eq!(Some('y'), chars.next()); // not 'y̆'
assert_eq!(Some('\u{0306}'), chars.next());
assert_eq!(None, chars.next());
The unicode_segmentation crate provides a means to define grapheme cluster boundaries:
extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;
let s = "a̐éö̲\r\n";
let g = UnicodeSegmentation::graphemes(s, true).collect::<Vec<&str>>();
let b: &[_] = &["a̐", "é", "ö̲", "\r\n"];
assert_eq!(g, b);
If you do want to treat the string as an array of codepoints (which isn't strictly the same as characters; there are combining marks, emoji with separate skin-tone modifiers, etc.), you can collect it into a Vec:
fn main() {
let s = "£10 🙃!";
for (i,c) in s.char_indices() {
println!("{} {}", i, c);
}
let v: Vec<char> = s.chars().collect();
println!("v[5] = {}", v[5]);
}
Play link
With bonus demonstration of some varying character widths, this outputs:
0 £
2 1
3 0
4
5 🙃
9 !
v[5] = !