Efficient String trim

Efficient String trim - string

I have a String value and I want to trim() it. I can do something like:
let trimmed = s.trim().to_string();
But that will always create a new String instance, even though in real life the string is much more likely to be already trimmed. In order to avoid the redundant new String creation, I could do something like this:
let ss = s.trim();
let trimmed = if ss.len() == s.len() { s } else { ss.to_string() };
But that is quite verbose. Is there a more concise way to do the above?

I can't think of a more concise way to do it. As it is, it seems maximally concise: You need to tell the compiler to trim the string, and then if the strings are the same length return the original string, otherwise make a new string. There isn't a "trim but return the original string if they're equal" method on String.
That said, you could make your own trait TrimOwned which had such a method, for example (implementation courtesy of StackOverflower):
trait TrimOwned {
fn trim_owned(self) -> Self;
}
impl TrimOwned for String {
fn trim_owned(self) -> Self {
let s = self.trim();
if s.len() == self.len() {
self
} else {
s.to_string()
}
}
}
fn main() {
let left = " left".to_string();
let right = "right ".to_string();
let both = " both ".to_string();
println!("'{}'->'{}'", left.clone(), left.trim_owned());
println!("'{}'->'{}'", right.clone(), right.trim_owned());
println!("'{}'->'{}'", both.clone(), both.trim_owned());
}
Playground

Sorry, I cannot give you a more concise version. But I can tell you that there still is more room for premature optimization:
When trimming on the right, you only need to shorten the string, there is obviously no need for reallocating
You can't easily do the "shorten the string" trick when trimming to the left, since a string must always start at index 0 of its allocated area. But when you allocate a new string, you end up copying its contents, so you could as well just move the string content inside the already allocated area.
One nifty and reasonably fast way of implementing this is String::drain:
fn trim_owned(mut trim: String) -> String {
trim.drain(trim.trim_end().len()..);
trim.drain(..(trim.len() - trim.trim_start().len()));
trim
}

Related

The best way to enumerate through a string in Rust? (chars() vs as_bytes())

I'm new to Rust, and I'm learning it using Rust Book.
Recently, I found this function there:
// Returns the number of characters in the first
// word of the given string
fn first_word(s: &String) -> usize {
let bytes = s.as_bytes();
for (i, &item) in bytes.iter().enumerate() {
if item == b' ' {
return i;
}
}
s.len()
}
As you see, the authors were using String::as_bytes() method here to enumerate through a string. Then, they were casting the char ' ' to u8 type to check whether we've reached the end of the first word.
As I know, ther is another option, which looks much better:
fn first_word(s: &String) -> usize {
for (i, item) in s.chars().enumerate() {
if item == ' ' {
return i;
}
}
s.len()
}
Here, I'm using String::chars() method, and the function looks much cleaner.
So the question is: is there any difference between these two things? If so, which one is better and why?

If your string happens to be purely ASCII (where there is only one byte per character), the two functions should behave identically.
However, Rust was designed to support UTF8 strings, where a single character could be composed of multiple bytes, therefore using s.chars() should be preferred, it will allow your function to still work as expected if you have non-ascii characters in your string.
As #eggyal points out, Rust has a str::split_whitespace method which returns an iterator over words, and this method will split all whitespace (instead of just spaces). You could use it like so:
fn first_word(s: &String) -> usize {
if let Some(word) = s.split_whitespace().next() {
word.len()
}
else {
s.len()
}
}

How to parse &str with named parameters?

I am trying to find the best way to parse a &str and extract out the COMMAND_TYPE and named parameters. The named parameters can be anything.
Here is the proposed string (it can be changed).
COMMAND_TYPE(param1:2222,param2:"the quick \"brown\" fox, blah,", param3:true)
I have been trying a few ways to extract the COMMAND_TYPE, which seems fairly simple:
pub fn parse_command(command: &str) -> Option<String> {
let mut matched = String::new();
let mut chars = command.chars();
while let Some(next) = chars.next() {
if next != '(' {
matched.push(next);
} else {
break;
}
}
if matched.is_empty() {
None
} else {
Some(matched)
}
}
Extracting the parameters from within the brackets seems straightforward to:
pub fn parse_params(command: &str) -> Option<&str> {
let start = command.find("(");
let end = command.rfind(")");
if start.is_some() && end.is_some() {
Some(&command[start.unwrap() + 1..end.unwrap()])
} else {
None
}
}
I have been looking at the nom crate and that seems fairly powerful (and complicated), so I am not sure if I really need to use it.
How do I extract the named parameters in between the brackets into a HashMap?

Your code seems to work for extracting the command and the full parameter list. If you don't need to parse something more complex than that, you can probably avoid using nom as a dependency.
But you will probably have problems if you want to parse individually each parameters : your format seems broken. In your example, there is no escape caracters neither for double quote nor comma. param2 just can't be extracted cleanly.

Why does to_ascii_lowercase return a String rather than a Cow<str>?

str::to_ascii_lowercase returns a String. Why doesn't it return a Cow<str> just like to_string_lossy or String::from_utf8_lossy?
The same applies to str::to_ascii_uppercase.

The reason why you might want to return a Cow<str> presumably is because the string may already be lower case. However, to detect this edge case might also introduce a performance degradation when the string is not already lower case, which intuitively seems like the most common scenario.
You can, of course, create your own function that wraps to_ascii_lowercase(), checks if it is already lower case, and return a Cow<str>:
fn my_to_ascii_lowercase<'a>(s: &'a str) -> Cow<'a, str> {
let bytes = s.as_bytes();
if !bytes.iter().any(u8::is_ascii_uppercase) {
Cow::Borrowed(s)
} else {
Cow::Owned(s.to_ascii_lowercase())
}
}

Lifetime of references in closures

I need a closure to refer to parts of an object in its enclosing environment. The object is created within the environment and is scoped to it, but once created it could be safely moved to the closure.
The use case is a function that does some preparatory work and returns a closure that will do the rest of the work. The reason for this design are execution constraints: the first part of the work involves allocation, and the remainder must do no allocation. Here is a minimal example:
fn stage_action() -> Box<Fn() -> ()> {
// split a freshly allocated string into pieces
let string = String::from("a:b:c");
let substrings = vec![&string[0..1], &string[2..3], &string[4..5]];
// the returned closure refers to the subtrings vector of
// slices without any further allocation or modification
Box::new(move || {
for sub in substrings.iter() {
println!("{}", sub);
}
})
}
fn main() {
let action = stage_action();
// ...executed some time later:
action();
}
This fails to compile, correctly stating that &string[0..1] and others must not outlive string. But if string were moved into the closure, there would be no problem. Is there a way to force that to happen, or another approach that would allow the closure to refer to parts of an object created just outside of it?
I've also tried creating a struct with the same functionality to make the move fully explicit, but that doesn't compile either. Again, compilation fails with the error that &later[0..1] and others only live until the end of function, but "borrowed value must be valid for the static lifetime".
Even completely avoiding a Box doesn't appear to help - the compiler complains that the object doesn't live long enough.

There's nothing specific to closures here; it's the equivalent of:
fn main() {
let string = String::from("a:b:c");
let substrings = vec![&string[0..1], &string[2..3], &string[4..5]];
let string = string;
}
You are attempting to move the String while there are outstanding borrows. In my example here, it's to another variable; in your example it's to the closure's environment. Either way, you are still moving it.
Additionally, you are trying to move the substrings into the same closure environment as the owning string. That's makes the entire problem equivalent to Why can't I store a value and a reference to that value in the same struct?:
struct Environment<'a> {
string: String,
substrings: Vec<&'a str>,
}
fn thing<'a>() -> Environment<'a> {
let string = String::from("a:b:c");
let substrings = vec![&string[0..1], &string[2..3], &string[4..5]];
Environment {
string: string,
substrings: substrings,
}
}
The object is created within the environment and is scoped to it
I'd disagree; string and substrings are created outside of the closure's environment and moved into it. It's that move that's tripping you up.
once created it could be safely moved to the closure.
In this case that's true, but only because you, the programmer, can guarantee that the address of the string data inside the String will remain constant. You know this for two reasons:
String is internally implemented with a heap allocation, so moving the String doesn't move the string data.
The String will never be mutated, which could cause the string to reallocate, invalidating any references.
The easiest solution for your example is to simply convert the slices to Strings and let the closure own them completely. This may even be a net benefit if that means you can free a large string in favor of a few smaller strings.
Otherwise, you meet the criteria laid out under "There is a special case where the lifetime tracking is overzealous" in Why can't I store a value and a reference to that value in the same struct?, so you can use crates like:
owning_ref
use owning_ref::RcRef; // 0.4.1
use std::rc::Rc;
fn stage_action() -> impl Fn() {
let string = RcRef::new(Rc::new(String::from("a:b:c")));
let substrings = vec![
string.clone().map(|s| &s[0..1]),
string.clone().map(|s| &s[2..3]),
string.clone().map(|s| &s[4..5]),
];
move || {
for sub in &substrings {
println!("{}", &**sub);
}
}
}
fn main() {
let action = stage_action();
action();
}
ouroboros
use ouroboros::self_referencing; // 0.2.3
fn stage_action() -> impl Fn() {
#[self_referencing]
struct Thing {
string: String,
#[borrows(string)]
substrings: Vec<&'this str>,
}
let thing = ThingBuilder {
string: String::from("a:b:c"),
substrings_builder: |s| vec![&s[0..1], &s[2..3], &s[4..5]],
}
.build();
move || {
thing.with_substrings(|substrings| {
for sub in substrings {
println!("{}", sub);
}
})
}
}
fn main() {
let action = stage_action();
action();
}
Note that I'm no expert user of either of these crates, so these examples may not be the best use of it.

error: use of moved value - should I use "&" or "mut" or something else?

My code:
enum MyEnum1 {
//....
}
struct Struct1 {
field1: MyEnum1,
field2: String
}
fn fn1(a: Struct1, b: String, c: String) -> String {
let let1 = fn2(a.field1);
let let2 = fn3(let1, b, c);
format!("{} something 123 {}", let1, let2)
}
fn fn2(a: MyEnum1) -> String {
//....
}
fn fn3(a: MyEnum1, b: Struct1) -> String {
//....
}
error: use of moved value: `a.field1`
error: use of moved value: `let1`
How can I fix them? Should I add & to the parameters of 'fn2andfn3? Ormut`? I can't understand the idea of how to fix these kind of errors.

These errors come from the most important concept in Rust - ownership. You should read the official book, especially the chapter on ownership - this would help you understand "how tho fix this kind of errors".
In short, specifically in your code, the problem is that String is a non-copyable type, that is, String values are not copied when passed to functions or assigned to local variables, they are moved. This means that wherever they were before, they are not accessible from there anymore.
Let's look at your function:
enum MyEnum1 {
//....
}
struct Struct1 {
field1: MyEnum1,
field2: String
}
fn fn1(a: Struct1, b: String, c: String) -> String {
let let1 = fn2(a.field1);
let let2 = fn3(let1, b, c);
format!("{} something 123 {}", let1, let2)
}
fn fn2(a: MyEnum1) -> String {
//....
}
All types here are not automatically copyable (they don't implement Copy trait). String is not copyable because it is a heap-allocated string and copying would need a fresh allocation (an expensive operation which better be not implicit), MyEnum1 is not copyable because it does not implement Copy (with #[deriving(Copy, Clone)], for example; and it is unclear if it can be made copyable because you didn't provide its variants), and Struct1 is not copyable because it contains non-copyable types.
In fn1 you invoke fn2, passing it field1 and getting a String back. Then you immediately passes this String to fn3. Because String is not copyable, whatever is stored in let1 is moved into the called function, making let1 inaccessible. This is what "use of moved value" error is about. (The code you provided can't cause "use of moved value: a.field1" error, so it probably came from the parts you omitted, but the basic idea is absolutely the same)
There are several ways to fix these errors, but the most natural and common one is indeed to use borrowed references. In general if you only want to read some non-copyable value in a function you should pass it there by reference:
fn use_myenum(e: &MyEnum1)
For strings and arrays, however, the better way would be to pass slices:
fn use_str(s: &str) { ... }
let s: String = ...;
use_str(&s); // here &String is automatically converted to &str
You can find more on slices in the book, here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Efficient String trim - string

Related

The best way to enumerate through a string in Rust? (chars() vs as_bytes())

How to parse &str with named parameters?

Why does to_ascii_lowercase return a String rather than a Cow<str>?

Lifetime of references in closures

error: use of moved value - should I use "&" or "mut" or something else?

Categories

Resources