How to partition a string at a fixed index? - rust

I have a String (in particular a SHA1 hex digest) that I would like to split into two substrings - the first two characters and the rest of of the string. Is there a clean way to do this in Rust?

If you know that your string only contains ASCII characters (as in case with sha digests), you can use slices directly:
let s = "13e3f28a65a42bf6258cbd1d883d1ce3dac8f085";
let first = &s[..2]; // "13"
let rest = &s[2..]; // "e3f28a65a42bf6258cbd1d883d1ce3dac8f085"
It won't work correctly if your string contains non-ASCII characters because slicing uses byte offsets, and if any index used in slicing points into the middle of a code point representation, your program will panic.

There's a split_at method since Rust 1.4, use it like this:
let s = "13e3f28a65a42bf6258cbd1d883d1ce3dac8f085";
let (first, last) = s.split_at(2);
assert_eq!("13", first);
assert_eq!("e3f28a65a42bf6258cbd1d883d1ce3dac8f085", last);
Note that the index is a byte position and must lie on a character boundary. In this case this works because you know that your input string is ASCII.

If you are expecting two Strings instead of slices, you can use the chars() method and some Iterator methods to obtain them.
let text = "abcdefg".to_string();
let start: String = text.chars().take(2).collect();
let end: String = text.chars().skip(2).collect();
If you don't want to do heap allocations, you can use slices instead:
let start: &str = text.slice_chars(0, 2);
let end: &str = text.slice_chars(2, text.char_len());
Note that the slices version requires you to use unstable rust (nightly builds, not the beta)

Here is a way to efficiently split a String into two Strings, in case you have this owned string data case. The allocation of the input string is retained in the first piece by just using truncation.
/// Split a **String** at a particular index
///
/// **Panic** if **byte_index** is not a character boundary
fn split_string(mut s: String, byte_index: usize) -> (String, String)
{
let tail = s[byte_index..].into();
s.truncate(byte_index);
(s, tail)
}
Note: The .into() method is from the generic conversion trait Into and in this case it converts &str into String.

Related

How do I get a substring of a String object using a character position range?

Say I have a struct Foo that owns a string:
struct Foo {
owned_string: String
}
I want to implement some methods on this struct that return substrings from the owned String. For efficiency reasons, I don't want to allocate any new memory for this, I just want the return values to point to the original String.
Let's say I know the substring I want, it's characters 10 through 15.
I can't just slice it like self.owned_string[10..16], since that would give me bytes, not characters.
I can take the characters and collect them into a new String object, like self.owned_string.chars().skip(9).take(6).collect::<String>(), but that creates a new String object. String objects own their strings (AFAIK), so presumably new memory was allocated for this, which is not what I want.
How do I create string slices that reference a substring of a String object, but using character positions? (Without allocating any new memory)
You can use char_indices() then slice the string according to the positions the iterator gives you:
let mut iter = s.char_indices();
let (start, _) = iter.nth(10).unwrap();
let (end, _) = iter.nth(5).unwrap();
let slice = &s[start..end];
However, note that as mentioned in the documentation of chars():
It’s important to remember that char represents a Unicode Scalar Value, and might not match your idea of what a β€˜character’ is. Iteration over grapheme clusters may be what you actually want. This functionality is not provided by Rust’s standard library, check crates.io instead.
#ChayimFriedman's answer is of course correct, I just wanted to contribute a more telling example:
fn print_string(s: &str) {
println!("String: {}", s);
}
fn main() {
let s: String = "πŸ€£πŸ˜„πŸ˜πŸ˜†πŸ˜…".to_string();
let mut iter = s.char_indices();
// Retrieve the position of the char at pos 1
let (start, _) = iter.nth(1).unwrap();
// Now the next char will be at position `2`. Which would be
// equivalent of querying `.next()` or `.nth(0)`.
// So if we query for `nth(2)` we query 3 characters; meaning
// the position of character 4.
let (end, _) = iter.nth(2).unwrap();
// Gives you a &str, which is exactly what you want.
// A reference to a substring, zero allocations, zero overhead.
let substring = &s[start..end];
print_string(&s);
print_string(substring);
}
String: πŸ€£πŸ˜„πŸ˜πŸ˜†πŸ˜…
String: πŸ˜„πŸ˜πŸ˜†
I've done it with smileys because smileys are definitely multi-byte unicode characters.
As #ChayimFriedman already noted, the reason why we have to iterate through the char_indices is because unicode characters are variably sized. They can be anywhere from 1 to 8 bytes long, so the only way to find out where the character boundaries are is to actually read the string up to the character we desire.

trying to trim and lowercase my String in rust

I'm trying to trim and lowercase my String.
Currently I have
use dialoguer::Input;
let input: String = Input::new()
.with_prompt("Guess a 5 letter word")
.interact_text()
.unwrap();
let guess: &str = input.as_str(); // trim and lowercase
I'm trying to transform String into a trimmed and lowercased &str but some functions are only on &str and others only on String so I'm having trouble coming up with an elegant solution.
TL;DR: End goal have guess be a trimmed and lowercase &str
Rust stdlib is not about elegance, but about correctness and efficiency.
In your particular case, trim() is defined as str::trim(&self) -> &str because it always returns a substring of the original string, so it does not need to copy or allocate a new string, just compute the begin and end, and do the slice.
But to_lowercase() is defined as str::to_lowercase(&self) -> String because it changes each of its characters to the lowercase equivalent, so it must allocate and fill a new String.
You may thing that if you own the string you can mutate it to lowercase in-place. But that will not work in general because there is not a 1-to-1 map between lowercase and uppercase letters. Think of, for example ß <-> SS in German.
Naturally, you may know that your string only has ASCII characters... if so you can also use str::make_ascii_lowercase(&mut self) that does the change in-place, but only for ASCII characters that do have the 1-to-1 map.
So, summing up, the more ergonomic code would be, to trim input and copy to an owned lowercase:
let guess : String = input.trim().to_lowercase();
Or if you absolutely want to avoid allocating an extra string, but you are positive that only ASCII characters matter:
let mut input = input; //you could also add the mut above
input.make_ascii_lowercase();
let guess: &str = input.trim();
Try this:
let s = " aBcD ";
let s2 = s.trim().to_lowercase();
println!("[{s}], [{s2}]");
The above will work if s is &str (as in my example) or String and it will print:
[ aBcD ], [abcd]
So the last line in your code (if you insist on having guess as &str) should become:
let guess: &str = &input.trim().to_lowercase();
Otherwise if you write just:
let guess = input.trim().to_lowercase();
, guess will be of type String, as that's what to_lowercase() returns.

How to get the last character of a &str?

In Python, this would be final_char = mystring[-1]. How can I do the same in Rust?
I have tried
mystring[mystring.len() - 1]
but I get the error the type 'str' cannot be indexed by 'usize'
That is how you get the last char (which may not be what you think of as a "character"):
mystring.chars().last().unwrap();
Use unwrap only if you are sure that there is at least one char in your string.
Warning: About the general case (do the same thing as mystring[-n] in Python): UTF-8 strings are not to be used through indexing, because indexing is not a O(1) operation (a string in Rust is not an array). Please read this for more information.
However, if you want to index from the end like in Python, you must do this in Rust:
mystring.chars().rev().nth(n - 1) // Python: mystring[-n]
and check if there is such a character.
If you miss the simplicity of Python syntax, you can write your own extension:
trait StrExt {
fn from_end(&self, n: usize) -> char;
}
impl<'a> StrExt for &'a str {
fn from_end(&self, n: usize) -> char {
self.chars().rev().nth(n).expect("Index out of range in 'from_end'")
}
}
fn main() {
println!("{}", "foobar".from_end(2)) // prints 'b'
}
One option is to use slices. Here's an example:
let len = my_str.len();
let final_str = &my_str[len-1..];
This returns a string slice from position len-1 through the end of the string. That is to say, the last byte of your string. If your string consists of only ASCII values, then you'll get the final character of your string.
The reason why this only works with ASCII values is because they only ever require one byte of storage. Anything else, and Rust is likely to panic at runtime. This is what happens when you try to slice out one byte from a 2-byte character.
For a more detailed explanation, please see the strings section of the Rust book.
As #Boiethios mentioned
let last_ch = mystring.chars().last().unwrap();
Or
let last_ch = codes.chars().rev().nth(0).unwrap();
I would rather have (how hard is that!?)
let last_ch = codes.chars(-1); // Not implemented as rustc 1.56.1

Efficiently extract prefix substrings

Currently I'm using the following function to extract prefix substrings:
fn prefix(s: &String, k: usize) -> String {
s.chars().take(k).collect::<String>()
}
This can then be used for comparisons like so:
let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);
However, this allocates a new String for each call to prefix, in addition to the processing for the iteration. Most other languages I'm familiar with have an efficient way to do a comparison like this, using just a view of the strings. Is there a way in Rust?
Yes, you can take subslices of strings using the Index operation:
fn prefix(s: &str, k: usize) -> &str {
&s[..k]
}
fn main() {
let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);
println!("{}", same);
}
Note that slicing a string uses bytes as the unit, not characters. It is up to the programmer to ensure that the slice lengths lie on valid UTF-8 boundaries. Additionally, you have to ensure that you don't try to slice past the end of the string. Breaking either of these will result in a panic!.
A bit more defensive version would be
fn prefix(s: &str, k: usize) -> &str {
let idx = s.char_indices().nth(k).map(|(idx, _)| idx).unwrap_or(s.len());
&s[0..idx]
}
The key difference is that we use the char_indices iterator, which tells us the byte offsets corresponding to a character. Indexing into a UTF-8 string is an O(n) operation, and Rust doesn't want to hide that algorithmic complexity from you. This still isn't even complete, because there can be combining characters, for example. Dealing with strings is hard, thanks to the complexity of human language.
Most other languages I'm familiar with have an efficient way
Doubtful :-) To be efficient in time, they'd have to know how many bytes to skip ahead for every character. Either they'd have to keep a lookup table for every string or use a fixed-size character encoding. Both of those solutions can use more memory than needed, and a fixed size encoding doesn't even work when you have combining characters, for example.
Of course, other languages could just say "LOL, strings are just arrays of bytes, good luck with treating them correctly", and efficiently ignore your character encoding...
Two additional notes
Your predicate doesn't really make sense. A string of 2 letters will never match one of 3 letters. For strings to match, they must have the same amount of bytes.
You should never need to take &String as a function argument. Taking a &str is a more accepting argument in all cases except for one teeny tiny little case that no one needs β€” knowing the capacity of a String, but without being able to modify the string.
While Shepmaster's answer is absolutely correct for the general case of string slicing, I'd like to add that sometimes there are easier ways.
If you know in advance the set of characters you're working with ("ATGC" example suggests you're working with nucleobases, so it is possible that these are all the characters you need) then you can use slices of bytes &[u8] instead of string slices &str. You can always get a byte slice out of a string slice and a Vec<u8> out of a String, if necessary:
let s: String = "ATGC".into();
let ss: &str = &s;
let b: Vec<u8> = s.into_bytes();
let bs: &[u8] = ss.as_slice();
Also, there are byte slice and byte character literals, just prefix regular string/char literals with b:
let sl: &[u8] = b"ATGC";
let bl: u8 = b'G';
Working with byte slices give you constant-time indexing (and thus slicing) operations, so checking for prefix equality is easy (like Shepmaster's first variant but without possibility of panics (unless k is too large):
fn prefix(s: &[u8], k: usize) -> &[u8] {
&s[..k]
}
If you need, you can turn byte slices/vectors back to strings. This operation, of course, checks validity of UTF-8 encoding so it may fail, but if you only work with ASCII, you can safely ignore these errors and just unwrap():
let ss2: &str = str::from_utf8(bs).unwrap();
let s2: String = String::from_utf8(b).unwrap();

How do you iterate over a string by character

I have a string and I need to scan for every occurrence of "foo" and read all the text following it until a second ". Since Rust does not have a contains function for strings, I need to iterate by characters scanning for it. How would I do this?
Edit: Rust's &str has a contains() and find() method.
I need to iterate by characters scanning for it.
The .chars() method returns an iterator over characters in a string. e.g.
for c in my_str.chars() {
// do something with `c`
}
for (i, c) in my_str.chars().enumerate() {
// do something with character `c` and index `i`
}
If you are interested in the byte offsets of each char, you can use char_indices.
Look into .peekable(), and use peek() for looking ahead. It's wrapped like this because it supports UTF-8 codepoints instead of being a simple vector of characters.
You could also create a vector of chars and work on it from there, but that's more time and space intensive:
let my_chars: Vec<_> = mystr.chars().collect();
The concept of a "character" is very ambiguous and can mean many different things depending on the type of data you are working with. The most obvious answer is the chars method. However, this does not work as advertised. What looks like a single "character" to you may actually be made up of multiple Unicode code points, which can lead to unexpected results:
"a̐".chars() // => ['a', '\u{310}']
For a lot of string processing, you want to work with graphemes. A grapheme consists of one or more unicode code points represented as a string slice. These map better to the human perception of "characters". To create an iterator of graphemes, you can use the unicode-segmentation crate:
use unicode_segmentation::UnicodeSegmentation;
for grapheme in my_str.graphemes(true) {
// ...
}
If you are working with raw ASCII then none of the above applies to you, and you can simply use the bytes iterator:
for byte in my_str.bytes() {
// ...
}
Although, if you are working with ASCII then arguably you shouldn't be using String/&str at all and instead use Vec<u8>/&[u8] directly.
fn main() {
let s = "Rust is a programming language";
for i in s.chars() {
print!("{}", i);
}}
Output: Rust is a programming language
I use the chars() method to iterate over each element of the string.

Resources