How to "crop" characters off the beginning of a string in Rust? - string

I want a function that can take two arguments (string, number of letters to crop off front) and return the same string except with the letters before character x gone.
If I write
let mut example = "stringofletters";
CropLetters(example, 3);
println!("{}", example);
then the output should be:
ingofletters
Is there any way I can do this?

In many uses it would make sense to simply return a slice of the input, avoiding any copy. Converting #Shepmaster's solution to use immutable slices:
fn crop_letters(s: &str, pos: usize) -> &str {
match s.char_indices().skip(pos).next() {
Some((pos, _)) => &s[pos..],
None => "",
}
}
fn main() {
let example = "stringofletters"; // works with a String if you take a reference
let cropped = crop_letters(example, 3);
println!("{}", cropped);
}
Advantages over the mutating version are:
No copy is needed. You can call cropped.to_string() if you want a newly allocated result; but you don't have to.
It works with static string slices as well as mutable String etc.
The disadvantage is that if you really do have a mutable string you want to modify, it would be slightly less efficient as you'd need to allocate a new String.

Issues with your original code:
Functions use snake_case, types and traits use CamelCase.
"foo" is a string literal of type &str. These may not be changed. You will need something that has been heap-allocated, such as a String.
The call crop_letters(stringofletters, 3) would transfer ownership of stringofletters to the method, which means you wouldn't be able to use the variable anymore. You must pass in a mutable reference (&mut).
Rust strings are not ASCII, they are UTF-8. You need to figure out how many bytes each character requires. char_indices is a good tool here.
You need to handle the case of when the string is shorter than 3 characters.
Once you have the byte position of the new beginning of the string, you can use drain to move a chunk of bytes out of the string. We just drop these bytes and let the String move over the remaining bytes.
fn crop_letters(s: &mut String, pos: usize) {
match s.char_indices().nth(pos) {
Some((pos, _)) => {
s.drain(..pos);
}
None => {
s.clear();
}
}
}
fn main() {
let mut example = String::from("stringofletters");
crop_letters(&mut example, 3);
assert_eq!("ingofletters", example);
}
See Chris Emerson's answer if you don't actually need to modify the original String.

I found this answer which I don't consider really idiomatic:
fn crop_with_allocation(string: &str, len: usize) -> String {
string.chars().skip(len).collect()
}
fn crop_without_allocation(string: &str, len: usize) -> &str {
// optional length check
if string.len() < len {
return &"";
}
&string[len..]
}
fn main() {
let example = "stringofletters"; // works with a String if you take a reference
let cropped = crop_with_allocation(example, 3);
println!("{}", cropped);
let cropped = crop_without_allocation(example, 3);
println!("{}", cropped);
}

my version
fn crop_str(s: &str, n: usize) -> &str {
let mut it = s.chars();
for _ in 0..n {
it.next();
}
it.as_str()
}
#[test]
fn test_crop_str() {
assert_eq!(crop_str("123", 1), "23");
assert_eq!(crop_str("ЖФ1", 1), "Ф1");
assert_eq!(crop_str("ЖФ1", 2), "1");
}

Related

String recursion in Rust

In Rust, I am trying to obtain all possible combinations of a-z characters up to a fixed length with no repeating letters.
For example, for a limited set of a-f and a length of 3 I should get:
abc
abd
abe
abf
acb
acd
ace
acf
adb
... etc
I've been struggling to do this through recursion and have been banging my head on ownership and borrows. The only way I've managed to do it is as follows, but this is cloning strings all over the place and is very inefficient. There are probably standard permutation/combination functions for this in the standard library, I don't know, but I'm interested in understanding how this can be done manually.
fn main() {
run(&String::new());
}
fn run(target: &String) {
for a in 97..123 { // ASCII a..z
if !target.contains(char::from(a)) {
let next = target.clone() + char::from(a).to_string().as_str(); // Working but terrible
if next.len() == 3 { // Required string size
println!("{}", next);
} else {
run(&next);
}
}
}
}
First off, a couple of remarks:
&String is kind of an anti-pattern that is rarely seen. It serves no purpose; all the functionality that String has over str requires mutability. So it should either be &mut String or &str.
97..123 is uncommon ... use 'a'..='z'.
Now to the actual problem:
As long as you pass a non-mutable string into the recursion, you won't get around cloning the data. I'd make the string mutable, then you can simply append and remove single characters from it.
Like this:
fn main() {
run(&mut String::new());
}
fn run(target: &mut String) {
for a in 'a'..='z' {
if !target.contains(a) {
target.push(a);
if target.len() == 3 {
// Required string size
println!("{}", target);
} else {
run(target);
}
target.pop();
}
}
}
Just for comprehensiveness, here is an alternate way to do it, when not using recursion. It does use .permutations() from crate Itertools.
Also, It's probably cleaner to return the String object from the function directly, instead of passing a mutable reference by argument.
use std::ops::RangeInclusive;
use itertools::Itertools;
fn main() {
println!("result: {}",combine('a'..='d', 3));
println!("result: {}",combine('a'..='g', 4));
println!("result: {}",combine('a'..='c', 3));
println!("result: {}",combine('a'..='c', 4)); // assertion fail
}
fn combine(range: RangeInclusive<char>, depth: usize) -> String
{
assert!( *range.end() as usize - *range.start() as usize + 1 >= depth);
let perms = range.permutations(depth);
let mut result = String::new();
perms.for_each(|mut item| {
item.push(' ');
result += &item.into_iter().collect::<String>();
});
result.pop(); // pop last superfluous space char
result
}

How to convert a string of digits into a vector of digits?

I'm trying to store a string (or str) of digits, e.g. 12345 into a vector, such that the vector contains {1,2,3,4,5}.
As I'm totally new to Rust, I'm having problems with the types (String, str, char, ...) but also the lack of any information about conversion.
My current code looks like this:
fn main() {
let text = "731671";
let mut v: Vec<i32>;
let mut d = text.chars();
for i in 0..text.len() {
v.push( d.next().to_digit(10) );
}
}
You're close!
First, the index loop for i in 0..text.len() is not necessary since you're going to use an iterator anyway. It's simpler to loop directly over the iterator: for ch in text.chars(). Not only that, but your index loop and the character iterator are likely to diverge, because len() returns you the number of bytes and chars() returns you the Unicode scalar values. Being UTF-8, the string is likely to have fewer Unicode scalar values than it has bytes.
Next hurdle is that to_digit(10) returns an Option, telling you that there is a possibility the character won't be a digit. You can check whether to_digit(10) returned the Some variant of an Option with if let Some(digit) = ch.to_digit(10).
Pieced together, the code might now look like this:
fn main() {
let text = "731671";
let mut v = Vec::new();
for ch in text.chars() {
if let Some(digit) = ch.to_digit(10) {
v.push(digit);
}
}
println!("{:?}", v);
}
Now, this is rather imperative: you're making a vector and filling it digit by digit, all by yourself. You can try a more declarative or functional approach by applying a transformation over the string:
fn main() {
let text = "731671";
let v: Vec<u32> = text.chars().flat_map(|ch| ch.to_digit(10)).collect();
println!("{:?}", v);
}
ArtemGr's answer is pretty good, but their version will skip any characters that aren't digits. If you'd rather have it fail on bad digits, you can use this version instead:
fn to_digits(text: &str) -> Option<Vec<u32>> {
text.chars().map(|ch| ch.to_digit(10)).collect()
}
fn main() {
println!("{:?}", to_digits("731671"));
println!("{:?}", to_digits("731six71"));
}
Output:
Some([7, 3, 1, 6, 7, 1])
None
To mention the quick and dirty elephant in the room, if you REALLY know your string contains only digits in the range '0'..'9', than you can avoid memory allocations and copies and use the underlying &[u8] representation of String from str::as_bytes directly. Subtract b'0' from each element whenever you access it.
If you are doing competitive programming, this is one of the worthwhile speed and memory optimizations.
fn main() {
let text = "12345";
let digit = text.as_bytes();
println!("Text = {:?}", text);
println!("value of digit[3] = {}", digit[3] - b'0');
}
Output:
Text = "12345"
value of digit[3] = 4
This solution combines ArtemGr's + notriddle's solutions:
fn to_digits(string: &str) -> Vec<u32> {
let opt_vec: Option<Vec<u32>> = string
.chars()
.map(|ch| ch.to_digit(10))
.collect();
match opt_vec {
Some(vec_of_digits) => vec_of_digits,
None => vec![],
}
}
In my case, I implemented this function in &str.
pub trait ExtraProperties {
fn to_digits(self) -> Vec<u32>;
}
impl ExtraProperties for &str {
fn to_digits(self) -> Vec<u32> {
let opt_vec: Option<Vec<u32>> = self
.chars()
.map(|ch| ch.to_digit(10))
.collect();
match opt_vec {
Some(vec_of_digits) => vec_of_digits,
None => vec![],
}
}
}
In this way, I transform &str to a vector containing digits.
fn main() {
let cnpj: &str = "123456789";
let nums: Vec<u32> = cnpj.to_digits();
println!("cnpj: {cnpj}"); // cnpj: 123456789
println!("nums: {nums:?}"); // nums: [1, 2, 3, 4, 5, 6, 7, 8, 9]
}
See the Rust Playground.

Using str and String interchangably

Suppose I'm trying to do a fancy zero-copy parser in Rust using &str, but sometimes I need to modify the text (e.g. to implement variable substitution). I really want to do something like this:
fn main() {
let mut v: Vec<&str> = "Hello there $world!".split_whitespace().collect();
for t in v.iter_mut() {
if (t.contains("$world")) {
*t = &t.replace("$world", "Earth");
}
}
println!("{:?}", &v);
}
But of course the String returned by t.replace() doesn't live long enough. Is there a nice way around this? Perhaps there is a type which means "ideally a &str but if necessary a String"? Or maybe there is a way to use lifetime annotations to tell the compiler that the returned String should be kept alive until the end of main() (or have the same lifetime as v)?
Rust has exactly what you want in form of a Cow (Clone On Write) type.
use std::borrow::Cow;
fn main() {
let mut v: Vec<_> = "Hello there $world!".split_whitespace()
.map(|s| Cow::Borrowed(s))
.collect();
for t in v.iter_mut() {
if t.contains("$world") {
*t.to_mut() = t.replace("$world", "Earth");
}
}
println!("{:?}", &v);
}
as #sellibitze correctly notes, the to_mut() creates a new String which causes a heap allocation to store the previous borrowed value. If you are sure you only have borrowed strings, then you can use
*t = Cow::Owned(t.replace("$world", "Earth"));
In case the Vec contains Cow::Owned elements, this would still throw away the allocation. You can prevent that using the following very fragile and unsafe code (It does direct byte-based manipulation of UTF-8 strings and relies of the fact that the replacement happens to be exactly the same number of bytes.) inside your for loop.
let mut last_pos = 0; // so we don't start at the beginning every time
while let Some(pos) = t[last_pos..].find("$world") {
let p = pos + last_pos; // find always starts at last_pos
last_pos = pos + 5;
unsafe {
let s = t.to_mut().as_mut_vec(); // operating on Vec is easier
s.remove(p); // remove $ sign
for (c, sc) in "Earth".bytes().zip(&mut s[p..]) {
*sc = c;
}
}
}
Note that this is tailored exactly to the "$world" -> "Earth" mapping. Any other mappings require careful consideration inside the unsafe code.
std::borrow::Cow, specifically used as Cow<'a, str>, where 'a is the lifetime of the string being parsed.
use std::borrow::Cow;
fn main() {
let mut v: Vec<Cow<'static, str>> = vec![];
v.push("oh hai".into());
v.push(format!("there, {}.", "Mark").into());
println!("{:?}", v);
}
Produces:
["oh hai", "there, Mark."]

Iterate over a string, n elements at a time

I'm trying to iterate over a string, but iterating in slices of length n instead of iterator over every character. The following code accomplishes this manually, but is there a more functional way to do this?
fn main() {
let string = "AAABBBCCC";
let offset = 3;
for (i, _) in string.chars().enumerate() {
if i % offset == 0 {
println!("{}", &string[i..(i+offset)]);
}
}
}
I would use a combination of Peekable and Take:
fn main() {
let string = "AAABBBCCC";
let mut z = string.chars().peekable();
while z.peek().is_some() {
let chunk: String = z.by_ref().take(3).collect();
println!("{}", chunk);
}
}
In other cases, Itertools::chunks might do the trick:
extern crate itertools;
use itertools::Itertools;
fn main() {
let string = "AAABBBCCC";
for chunk in &string.chars().chunks(3) {
for c in chunk {
print!("{}", c);
}
println!();
}
}
Standard warning about splitting strings
Be aware of issues with bytes / characters / code points / graphemes whenever you start splitting strings. With anything more complicated than ASCII characters, one character is not one byte and string slicing operates on bytes! There is also the concept of Unicode code points, but multiple Unicode characters may combine to form what a human thinks of as a single character. This stuff is non-trivial.
If you actually just have ASCII data, it may be worth it to store it as such, perhaps in a Vec<u8>. At the very least, I'd create a newtype that wraps a &str and only exposes ASCII-safe method and validates that it is ASCII when created.
chunks() is not available for &str because it is not really well-defined on strings - do you want chunks with length in bytes, or characters, or grapheme clusters? If you know in advance that your string is in ASCII you can use the following code:
use std::str;
fn main() {
let string = "AAABBBCCC";
for chunk in str_chunks(string, 3) {
println!("{}", chunk);
}
}
fn str_chunks<'a>(s: &'a str, n: usize) -> Box<Iterator<Item=&'a str>+'a> {
Box::new(s.as_bytes().chunks(n).map(|c| str::from_utf8(c).unwrap()))
}
However, it will break immediately if your strings have non-ASCII characters inside them. I'm pretty sure that it is possible to implement an iterator which splits a string into chunks of code points or grapheme clusters - it is just there is no such thing in the standard library now.
You can always implement your own iterator. Of course that still requires quite some code, but it's not at the location where you are working with the string. Therefor your loop stays readable.
#![feature(collections)]
struct StringChunks<'a> {
s: &'a str,
step: usize,
n: usize,
}
impl<'a> StringChunks<'a> {
fn new(s: &'a str, step: usize) -> StringChunks<'a> {
StringChunks {
s: s,
step: step,
n: s.chars().count(),
}
}
}
impl<'a> Iterator for StringChunks<'a> {
type Item = &'a str;
fn next(&mut self) -> Option<&'a str> {
if self.step > self.n {
return None;
}
let ret = self.s.slice_chars(0, self.step);
self.s = self.s.slice_chars(self.step, self.n);
self.n -= self.step;
Some(ret)
}
}
fn main() {
let string = "AAABBBCCC";
for s in StringChunks::new(string, 3) {
println!("{}", s);
}
}
Note that this splits after n unicode chars. So graphemes or similar might end up split up.

Edit string in place with a function

I am trying to edit a string in place by passing it to mutate(), see below.
Simplified example:
fn mutate(string: &mut &str) -> &str {
string[0] = 'a'; // mutate string
string
}
fn do_something(string: &str) {
println!("{}", string);
}
fn main() {
let string = "Hello, world!";
loop {
string = mutate(&mut string);
do_something(string);
}
}
But I get the following compilation error:
main.rs:1:33: 1:37 error: missing lifetime specifier [E0106]
main.rs:1 fn mutate(string: &mut &str) -> &str {
^~~~
main.rs:1:33: 1:37 help: this function's return type contains a borrowed value, but the signature does not say which one of `string`'s 2 elided lifetimes it is borrowed from
main.rs:1 fn mutate(string: &mut &str) -> &str {
^~~~
Why do I get this error and how can I achieve what I want?
You can't change a string slice at all. &mut &str is not an appropriate type anyway, because it literally is a mutable pointer to an immutable slice. And all string slices are immutable.
In Rust strings are valid UTF-8 sequences, and UTF-8 is a variable-width encoding. Consequently, in general changing a character may change the length of the string in bytes. This can't be done with slices (because they always have fixed length) and it may cause reallocation for owned strings. Moreover, in 99% of cases changing a character inside a string is not what you really want.
In order to do what you want with unicode code points you need to do something like this:
fn replace_char_at(s: &str, idx: uint, c: char) -> String {
let mut r = String::with_capacity(s.len());
for (i, d) in s.char_indices() {
r.push(if i == idx { c } else { d });
}
r
}
However, this has O(n) efficiency because it has to iterate through the original slice, and it also won't work correctly with complex characters - it may replace a letter but leave an accent or vice versa.
More correct way for text processing is to iterate through grapheme clusters, it will take diacritics and other similar things correctly (mostly):
fn replace_grapheme_at(s: &str, idx: uint, c: &str) -> String {
let mut r = String::with_capacity(s.len());
for (i, g) in s.grapheme_indices(true) {
r.push_str(if i == idx { c } else { g });
}
r
}
There is also some support for pure ASCII strings in std::ascii module, but it is likely to be reformed soon. Anyway, that's how it could be used:
fn replace_ascii_char_at(s: String, idx: uint, c: char) -> String {
let mut ascii_s = s.into_ascii();
ascii_s[idx] = c.to_ascii();
String::from_utf8(ascii_s.into_bytes()).unwrap()
}
It will panic if either s contains non-ASCII characters or c is not an ASCII character.

Resources