What is an efficient way to compare strings while ignoring case? - string

To compare two Strings, ignoring case, it looks like I first need to convert to a lower case version of the string:
let a_lower = a.to_lowercase();
let b_lower = b.to_lowercase();
a_lower.cmp(&b_lower)
Is there a method that compares strings, ignoring case, without creating the temporary lower case strings, i.e. that iterates over the characters, performs the to-lowercase conversion there and compares the result?

There is no built-in method, but you can write one to do exactly as you described, assuming you only care about ASCII input.
use itertools::{EitherOrBoth::*, Itertools as _}; // 0.9.0
use std::cmp::Ordering;
fn cmp_ignore_case_ascii(a: &str, b: &str) -> Ordering {
a.bytes()
.zip_longest(b.bytes())
.map(|ab| match ab {
Left(_) => Ordering::Greater,
Right(_) => Ordering::Less,
Both(a, b) => a.to_ascii_lowercase().cmp(&b.to_ascii_lowercase()),
})
.find(|&ordering| ordering != Ordering::Equal)
.unwrap_or(Ordering::Equal)
}
As some comments below have pointed out, case-insensitive comparison is not going to work properly for UTF-8, without operating on the whole string, and even then there are multiple representations of some case conversions, which could give unexpected results.
With those caveats, the following will work for a lot of extra cases compared with the ASCII version above (e.g. most accented Latin characters) and may be satisfactory, depending on your requirements:
fn cmp_ignore_case_utf8(a: &str, b: &str) -> Ordering {
a.chars()
.flat_map(char::to_lowercase)
.zip_longest(b.chars().flat_map(char::to_lowercase))
.map(|ab| match ab {
Left(_) => Ordering::Greater,
Right(_) => Ordering::Less,
Both(a, b) => a.cmp(&b),
})
.find(|&ordering| ordering != Ordering::Equal)
.unwrap_or(Ordering::Equal)
}

If you are only working with ASCII, you can use eq_ignore_ascii_case:
assert!("Ferris".eq_ignore_ascii_case("FERRIS"));

UNICODE
The best way for supporting UNICODE is using to_lowercase() or to_uppercase().
This is because UNICODE has many caveats and these functions handles most situations. There are some locale specific strings not handled correctly.
let left = "first".to_string();
let right = "FiRsT".to_string();
assert!(left.to_lowercase() == right.to_lowercase());
Efficiency
It is possible to iterate and return on first non-equal character, so in essence you only allocate one character at a time. However iterating using chars function does not account for all situations UNICODE can throw at us.
See the answer by Peter Hall for details on this.
ASCII
Most efficient if only using ASCII is to use eq_ignore_ascii_case (as per Ibraheem Ahmed's answer). This is does not allocate/copy temporaries.
This is only good if your code controls at least one side of the comparison and you are certain that it will only include ASCII.
assert!("Ferris".eq_ignore_ascii_case("FERRIS"));
Locale
Rusts case functions are best effort regarding locales and do not handle all locales. To support proper internationalisation, you will need to look for other crates that do this.

Related

How do I change a character in a string? [duplicate]

This isn't the exact use case, but it is basically what I am trying to do:
let mut username = "John_Smith";
println!("original username: {}",username);
username.set_char_at(4,'.'); // <------------- The part I don't know how to do
println!("new username: {}",username);
I can't figure out how to do this in constant time and using no additional space. I know I could use "replace" but replace is O(n). I could make a vector of the characters but that would require additional space.
I think you could create another variable that is a pointer using something like as_mut_slice, but this is deemed unsafe. Is there a safe way to replace a character in a string in constant time and space?
As of Rust 1.27 you can now use String::replace_range:
let mut username = String::from("John_Smith");
println!("original username: {}", username); // John_Smith
username.replace_range(4..5, ".");
println!("new username: {}", username); // John.Smith
(playground)
replace_range won't work with &mut str. If the size of the range and the size of the replacement string aren't the same, it has to be able to resize the underlying String, so &mut String is required. But in the case you ask about (replacing a single-byte character with another single-byte character) its memory usage and time complexity are both O(1).
There is a similar method on Vec, Vec::splice. The primary difference between them is that splice returns an iterator that yields the removed items.
In general ? For any pair of characters ? It's impossible.
A string is not an array. It may be implemented as an array, in some limited contexts.
Rust supports Unicode, which brings some challenges:
a Unicode code point might is an integral between 0 and 224
a grapheme may be composed of multiple Unicode code points
In order to represent this, a Rust string is (for now) a UTF-8 bytes sequence:
a single Unicode code point might be represented by 1 to 4 bytes
a grapheme might be represented by 1 or more bytes (no upper limit)
and therefore, the very notion of "replacing character i" brings a few challenges:
the position of character i is between the index i and the end of the string, it requires reading the string from the beginning to know exactly where though, which is O(N)
switching the i-th character in-place for another requires that both characters take up exactly the same amount of bytes
In general ? It's impossible.
In a particular and very specific case where the byte index is known and the byte encoding is known coincide length-wise, it is doable by directly modifying the byte sequence return by as_mut_bytes which is duly marked unsafe since you may inadvertently corrupt the string (remember, this bytes sequence must be a UTF-8 sequence).
If you want to handle only ASCII there is separate type for that:
use std::ascii::{AsciiCast, OwnedAsciiCast};
fn main() {
let mut ascii = "ascii string".to_string().into_ascii();
*ascii.get_mut(6) = 'S'.to_ascii();
println!("result = {}", ascii);
}
There are some missing pieces (like into_ascii for &str) but it does what you want.
Current implementaion of to_/into_ascii fails if input string is invalid ascii. There is to_ascii_opt (old naming of methods that might fail) but will probably be renamed to to_ascii in the future (and failing method removed or renamed).

How to convert big hex strings to decimal strings?

I am working with big numbers in Rust, my inputs come in as hex strings and I am using a hash function which has done all the heavy lifting with large numbers, all I need to do is to provide it with decimal string versions of the numbers (not hex).
What would be the best practice to convert hex strings to their equivalent decimal representation, whilst staying within strings in order not to cause any overflow issues?
I don't think implementing this by hand is a trivial problem. It will involve a lot of error-prone math on strings and it will be quite slow in the end. String based math in general is quite slow.
There are excellent big integer libraries out there, though. For example num:
use num::{BigInt, Num};
fn convert_hex_to_dec(hex_str: &str) -> String {
BigInt::from_str_radix(hex_str, 16).unwrap().to_string()
}
fn main() {
let hex_str = "978d54b635d7a829d5e3d9bee5a56018ba5e01c10";
let dec_str = convert_hex_to_dec(hex_str);
println!("{}", dec_str);
}
13843350254437927549667668733810160916738275810320
If you planned to do your entire math with strings, I'd advice using BigInts instead.

How to match all whitespace? [duplicate]

This question already has an answer here:
Is there a way to use custom patterns such as a regex or functions in a match?
(1 answer)
Closed 2 years ago.
Context: Rust has the match construct, which is really useful to make a (possibly) exhaustive list of cases and their corresponding results. The problem is: how do I create a case which encompasses a subset of many cases?
Regarding my specific problem, I'm making a lexer which reads a string character-by-character and spits out tokens. Its main function looks like this:
(...)
fn new(input: &str) -> Lexer {
let mut characters = input.chars();
for c in characters {
let mut token: Option<Token> = match c {
'+' => Some(Token::Add),
'-' => Some(Token::Minus),
'*' => Some(Token::Mul),
'/' => Some(Token::Div),
'e' => Some(Token::EulersNum),
'π' => Some(Token::Pi),
'(' => Some(Token::LeftParen),
')' => Some(Token::RightParen),
' ' | '\t' | '\n' => continue, //Whitespace
_ => None
};
if token == None {
continue;
}
}
todo!()
}
(...)
Now, the most important part, for the purposes of this question, is the one commented with 'Whitespace'. The problem with my handling of the whitespaces is that it may not correspond to the actual implementation of whitespaces in a given string format. Sure, I could handle all of the different kinds of ascii whitespaces, but what about Unicode? Making an exhaustive list of whitespaces is something that is not only cumbersome, but also something that obfuscates the code. It should be left to the language, not to it's users.
Is it possible to just match it with a 'Whitespace' expression, such as:
(...)
Whitespace => continue,
(...)
And if so, how do I do it?
You could use char::is_whitespace() in a match guard:
match c {
'+' => Some(Token::Add),
'-' => Some(Token::Minus),
'*' => Some(Token::Mul),
'/' => Some(Token::Div),
c if c.is_whitespace() => Some(Token::Whitespace),
_ => None,
};
Playground link
Sure, I could handle all of the different kinds of ascii whitespaces, but what about Unicode?
Use a library that provides a is_whitespace() function for many string formats or more complex matching capabilities, if you need it.
Otherwise, use functions like char::is_whitespace() if you just need to match against Unicode whitespace.
Making an exhaustive list of whitespaces is something that is not only cumbersome, but also something that obfuscates the code. It should be left to the language, not to it's users.
No, a match (and pattern matching etc.) are general tools. Rust is not a language specialized for language or string processing. Thus adding support for "whitespace matching" to pattern matching does not make sense.
Rust has basic ASCII, UTF-8, UTF-16 etc. support for practical reasons, but that's about it. Adding complex bits part of the standard library is debatable.

Slice a string containing Unicode chars

I have a piece of text with characters of different bytelength.
let text = "Hello привет";
I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this
let slice = &text[start..end];
and got the following error
thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'п' (bytes 6..8) of `Hello привет`'
I suppose it happens since Cyrillic letters are multi-byte and the [..] notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:
slice = text[start:end] ?
I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?
Possible solutions to codepoint slicing
I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?
If you know the exact byte indices, you can slice a string:
let text = "Hello привет";
println!("{}", &text[2..10]);
This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):
let text = "Hello привет";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("{}", &text[2..end]);
As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.
let text = "Hello привет";
let text_vec = text.chars().collect::<Vec<_>>();
println!("{}", text_vec[2..8].iter().cloned().collect::<String>());
Why is this not easier?
As you can see, neither of these solutions is all that great. This is intentional, for two reasons:
As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the [] operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).
But the more important reason:
Unicode codepoints are generally not a useful unit
What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).
But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:
>>> s = "Jürgen"
>>> s[0:2]
'Ju'
Surprising, right? This is because the string above is:
0x004A LATIN CAPITAL LETTER J
0x0075 LATIN SMALL LETTER U
0x0308 COMBINING DIAERESIS
...
This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.
Another example:
>>> s = "fire"
>>> s[0:2]
'fir'
Also not what you'd expect. This time, fi is actually the ligature fi, which is one codepoint.
There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.
So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.
Further resources on this topic:
Blogpost "Let's stop ascribing meaning to unicode codepoints"
Blogpost "Breaking our Latin-1 assumptions
http://utf8everywhere.org/
A UTF-8 encoded string may contain characters which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.
You may use str::char_indices() for solving this (remember, that getting to a position in a UTF-8 string is O(n)):
fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> {
assert!(end >= start);
string.char_indices().nth(start).and_then(|(start_pos, _)| {
string[start_pos..]
.char_indices()
.nth(end - start - 1)
.map(|(end_pos, _)| &string[start_pos..end_pos])
})
}
playground
You may use str::chars() if you are fine with getting a String:
let string: String = text.chars().take(end).skip(start).collect();
Here is a function which retrieves a utf8 slice, with the following pros:
handle all edge cases (empty input, 0-width output ranges, out-of-scope ranges);
never panics;
use start-inclusive, end-exclusive ranges.
pub fn utf8_slice(s: &str, start: usize, end: usize) -> Option<&str> {
let mut iter = s.char_indices()
.map(|(pos, _)| pos)
.chain(Some(s.len()))
.skip(start)
.peekable();
let start_pos = *iter.peek()?;
for _ in start..end { iter.next(); }
Some(&s[start_pos..*iter.peek()?])
}

What is the easiest way to determine if a character is in Unicode range, in Rust?

I'm looking for easiest way to determine if a character in Rust is between two Unicode values.
For example, I want to know if a character s is between [#x1-#x8] or [#x10FFFE-#x10FFFF]. Is there a function that does this already?
The simplest way for me to match a character was this
fn match_char(data: &char) -> bool {
match *data {
'\x01'...'\x08' |
'\u{10FFFE}'...'\u{10FFFF}' => true,
_ => false,
}
}
Pattern matching a character was the easiest route for me, compared to a bunch of if statements. It might not be the most performant solution, but it served me very well.
The simplest way, assuming that they are not Unicode categories (in which case you should be using std::unicode) is to use the regular comparison operators:
(s >= '\x01' && s <= '\x08') || s == '\U0010FFFE' || s == '\U0010FFFF'
(In case you weren't aware of the literal forms of these things, one gets 8-bit hexadecimal literals \xXX, 16-bit hexadecimal literals \uXXXX, and 32-bit hexadecimal literals \UXXXXXXXX. Matter of fact, casts would work fine too, e.g. 0x10FFFE as char, and would be just as efficient; just less easily readable.)

Resources