Most efficient way to keep collection of string references - string

What is the most efficient way to keep a collection of references to strings in Rust?
Specifically, I have the following as the beginning of some code to parse command line arguments (option parsing to be added):
let args: Vec<String> = env::args().collect();
let mut files: Vec<&String> = Vec::new();
let mut i = 1;
while i < args.len() {
let arg = &args[i];
i += 1;
if arg.as_bytes()[0] != b'-' {
files.push(arg);
continue;
}
}
args is as recommended in https://doc.rust-lang.org/book/ch12-01-accepting-command-line-arguments.html declared as Vec<String>. As I understand it, that means new strings are constructed, which is mildly surprising; I would've expected that the command line arguments already exist in memory, and it would only be necessary to make a vector of references to the existing strings. But the compiler seems to concur that it needs to be Vec<String>.
It would seem inefficient to do the same for files; there is surely no need for further copying. Instead, I have declared it as Vec<&String>, which as I understand it, means only creating a vector of references to the existing strings, which is optimal. (Not that it makes a measurable performance difference for command line arguments, but I want to figure this out now, so I can get it right later when dealing with much larger data.)
Where I am slightly confused is that Rust seems to frequently recommend str over String, and indeed the compiler is happy to have files hold either str or &str.
My best guess right now is that str, being an object that refers to a slice of a string, is most efficient when you want to keep a reference to just part of the string, but when you know you want the whole string, it is better to skip the overhead of creating a slice object, and just keep &String.
Is the above correct, or am I missing something?

args is as recommended in https://doc.rust-lang.org/book/ch12-01-accepting-command-line-arguments.html declared as Vec<String>. As I understand it, that means new strings are constructed, which is mildly surprising; I would've expected that the command line arguments already exist in memory
The command-line arguments do exist in memory but
they are not String, they are not even guaranteed to be UTF8
they are not in a Vec layout
Fundamentally there isn't even any prescription as to their storage, all you know is they're C strings (nul-terminated) and you get an array of pointers to those, whose last element is a null pointer.
Which is why args is an iterator of String: it will lazily decode and validate each argument as you request it, in fact you can check its source code:
pub fn args() -> Args {
Args { inner: args_os() }
}
#[stable(feature = "env", since = "1.0.0")]
impl Iterator for Args {
type Item = String;
fn next(&mut self) -> Option<String> {
self.inner.next().map(|s| s.into_string().unwrap())
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.inner.size_hint()
}
}
Now I couldn't tell you why args_os yields OsString rather than OsStr, I would assume portability of some sort (e.g. some platforms might not guarantee the args data lives for the entirety of the program).
My best guess right now is that str, being an object that refers to a slice of a string, is most efficient when you want to keep a reference to just part of the string, but when you know you want the whole string, it is better to skip the overhead of creating a slice object, and just keep &String.
Is the above correct, or am I missing something?
&String exists only for regularity (in the sense that it's a natural outgrowth of shared references and String existing concurrently), it's not actually useful: an &String only lets you access readonly / immutable methods of String, all of which are really provided by str aside from capacity() (which is rarely useful) and a handful of methods duplicated from str to String (I assume for efficiency) like len or is_empty.
&str is also generally more efficient than &String: while its size is 2 words (pointer, length) rather than one (pointer), it points directly to the relevant data rather than pointing to a pointer to the relevant data (and requiring a dereference to access the length property). As such, &String is rarely considered useful and clippy will warn against it by default (also &Vec as &[] is usually better for the same reason).

Related

Rust Manipulating Strings in Functions

I'm new to Rust, and I want to process strings in a function in Rust and then return a struct that contains the results of that processing to use in more functions. This is very simplified and a bit messier because of all my attempts to get this working, but:
struct Strucc<'a> {
string: &'a str,
booool: bool
}
fn do_stuff2<'a>(input: &'a str) -> Result<Strucc, &str> {
let to_split = input.to_lowercase();
let splitter = to_split.split("/");
let mut array: Vec<&str> = Vec::new();
for split in splitter {
array.push(split);
}
let var = array[0];
println!("{}", var);
let result = Strucc{
string: array[0],
booool: false
};
Ok(result)
}
The issue is that to convert the &str to lowercase, I have to create a new String that's owned by the function.
As I understand it, the reason this won't compile is because when I split the new String I created, all the &strs I get from it are substrings of the String, which are all still owned by the function, and so when the value is returned and that String goes out of scope, the value in the struct I returned gets erased.
I tried to fix this with lifetimes (as you can see in the function definition), but from what I can tell I'd have to give the String a lifetime which I can't do as far as I'm aware because it isn't borrowed. Either that or I need to make the struct own that String (which I also don't understand how to do, nor does it seem reasonable as I'd have to make the struct mutable).
Also as a sidenote: Previously I have tried to just use a String in the struct but I want to define constants which won't work with that, and I still don't think it would solve the issue. I've also tried to use .clone() in various places just in case but had no luck (though I know why this shouldn't work anyway).
I have been looking for some solution for this for hours and it feels like such a small step so I feel I may be asking the wrong questions or have missed something simple but please explain it like I'm five because I'm very confused.
I think you misunderstand what &str actually is. &str is just a pointer to the string data plus a length. The point of &str is to be an immutable reference to a specific string, which enables all sorts of nice optimizations. When you attempt to turn the &str lowercase, Rust needs somewhere to put the data, and the only place to put it would be a String, because Strings own their data. Take a look at this post for more information.
Your goal is unachievable without Strucc containing a String, since .to_lowercase() has to create new data, and you have to allocate the resulting data somewhere in order to own a reference to it. The best place to put the resulting data would be the returned struct, i.e. Strucc, and therefore Strucc must contain a String.
Also as a sidenote: Previously I have tried to just use a String in the struct but I want to define constants which won't work with that, and I still don't think it would solve the issue.
You can use "x".to_owned() to create a String literal.
If you're trying to create a global constant, look at once_cell's lazy global initialization.

Confused about ownership in situations involving lines and map

fn problem() -> Vec<&'static str> {
let my_string = String::from("First Line\nSecond Line");
my_string.lines().collect()
}
This fails with the compilation error:
|
7 | my_string.lines().collect()
| ---------^^^^^^^^^^^^^^^^^^
| |
| returns a value referencing data owned by the current function
| `my_string` is borrowed here
I understand what this error means - it's to stop you returning a reference to a value which has gone out of scope. Having looked at the type signatures of the functions involved, it appears that the problem is with the lines method, which borrows the string it's called on. But why does this matter? I'm iterating over the lines of the string in order to get a vector of the parts, and what I'm returning is this "new" vector, not anything that would (illegally) directly reference my_string.
(I'm aware I could fix this particular example very easily by just using the string literal rather than converting to an "owned" string with String::from. This is a toy example to reproduce the problem - in my "real" code the string variable is read from a file, so I obviously can't use a literal.)
What's even more mysterious to me is that the following variation on the function, which to me ought to suffer from the same problem, works fine:
fn this_is_ok() -> Vec<i32> {
let my_string = String::from("1\n2\n3\n4");
my_string.lines().map(|n| n.parse().unwrap()).collect()
}
The reason can't be map doing some magic, because this also fails:
fn also_fails() -> Vec<&'static str> {
let my_string = String::from("First Line\nSecond Line");
my_string.lines().map(|s| s).collect()
}
I've been playing about for quite a while, trying various different functions inside the map - and some pass and some fail, and I've honestly no idea what the difference is. And all this is making me realise that I have very little handle on how Rust's ownership/borrowing rules work in non-trivial cases, even though I thought I at least understood the basics. So if someone could give me a relatively clear and comprehensive guide to what is going on in all these examples, and how it might be possible to fix those which fail, in some straightforward way, I would be extremely grateful!
The key is in the type of the value yielded by lines: &str. In order to avoid unnecessary clones, lines actually returns references to slices of the string it's called on, and when you collect it to a Vec, that Vec's elements are simply references to slices of your string. So, of course when your function exits and the string is dropped, the references inside the Vec will be dropped and invalid. Remember, &str is a borrowed string, and String is an owned string.
The parsing works because you take those &strs then you read them into an i32, so the data is transferred to a new value and you no longer need a reference to the original string.
To fix your problem, simply use str::to_owned to convert each element into a String:
fn problem() -> Vec<String> {
let my_string = String::from("First Line\nSecond Line");
my_string.lines().map(|v| v.to_owned()).collect()
}
It should be noted that to_string also works, and that to_owned is actually part of the ToOwned trait, so it is useful for other borrowed types as well.
For references to sized values (str is unsized so this doesn't apply), such as an Iterator<Item = &i32>, you can simply use Iterator::cloned to clone every element so they are no longer references.
An alternative solution would be to take the String as an argument so it, and therefore references to it, can live past the scope of the function:
fn problem(my_string: &str) -> Vec<&str> {
my_string.lines().collect()
}
The problem here is that this line:
let my_string = String::from("First Line\nSecond Line");
copies the string data to a buffer allocated on the heap (so no longer 'static). Then lines returns references to that heap-allocated buffer.
Note that &str also implements a lines method, so you don't need to copy the string data to the heap, you can use your string directly:
fn problem() -> Vec<&'static str> {
let my_string = "First Line\nSecond Line";
my_string.lines().collect()
}
Playground
which avoids all unnecessary allocations and copying.

How to return a Result containing a formatted string, as &str [duplicate]

There are several questions that seem to be about the same problem I'm having. For example see here and here. Basically I'm trying to build a String in a local function, but then return it as a &str. Slicing isn't working because the lifetime is too short. I can't use str directly in the function because I need to build it dynamically. However, I'd also prefer not to return a String since the nature of the object this is going into is static once it's built. Is there a way to have my cake and eat it too?
Here's a minimal non-compiling reproduction:
fn return_str<'a>() -> &'a str {
let mut string = "".to_string();
for i in 0..10 {
string.push_str("ACTG");
}
&string[..]
}
No, you cannot do it. There are at least two explanations why it is so.
First, remember that references are borrowed, i.e. they point to some data but do not own it, it is owned by someone else. In this particular case the string, a slice to which you want to return, is owned by the function because it is stored in a local variable.
When the function exits, all its local variables are destroyed; this involves calling destructors, and the destructor of String frees the memory used by the string. However, you want to return a borrowed reference pointing to the data allocated for that string. It means that the returned reference immediately becomes dangling - it points to invalid memory!
Rust was created, among everything else, to prevent such problems. Therefore, in Rust it is impossible to return a reference pointing into local variables of the function, which is possible in languages like C.
There is also another explanation, slightly more formal. Let's look at your function signature:
fn return_str<'a>() -> &'a str
Remember that lifetime and generic parameters are, well, parameters: they are set by the caller of the function. For example, some other function may call it like this:
let s: &'static str = return_str();
This requires 'a to be 'static, but it is of course impossible - your function does not return a reference to a static memory, it returns a reference with a strictly lesser lifetime. Thus such function definition is unsound and is prohibited by the compiler.
Anyway, in such situations you need to return a value of an owned type, in this particular case it will be an owned String:
fn return_str() -> String {
let mut string = String::new();
for _ in 0..10 {
string.push_str("ACTG");
}
string
}
In certain cases, you are passed a string slice and may conditionally want to create a new string. In these cases, you can return a Cow. This allows for the reference when possible and an owned String otherwise:
use std::borrow::Cow;
fn return_str<'a>(name: &'a str) -> Cow<'a, str> {
if name.is_empty() {
let name = "ACTG".repeat(10);
name.into()
} else {
name.into()
}
}
You can choose to leak memory to convert a String to a &'static str:
fn return_str() -> &'static str {
let string = "ACTG".repeat(10);
Box::leak(string.into_boxed_str())
}
This is a really bad idea in many cases as the memory usage will grow forever every time this function is called.
If you wanted to return the same string every call, see also:
How to create a static string at compile time
The problem is that you are trying to create a reference to a string that will disappear when the function returns.
A simple solution in this case is to pass in the empty string to the function. This will explicitly ensure that the referred string will still exist in the scope where the function returns:
fn return_str(s: &mut String) -> &str {
for _ in 0..10 {
s.push_str("ACTG");
}
&s[..]
}
fn main() {
let mut s = String::new();
let s = return_str(&mut s);
assert_eq!("ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG", s);
}
Code in Rust Playground:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=2499ded42d3ee92d6023161fe82e9b5f
This is an old question but a very common one. There are many answers but none of them addresses the glaring misconception people have about the strings and string slices, which stems from not knowing their true nature.
But lets start with the obvious question before addressing the implied one: Can we return a reference to a local variable?
What we are asking to achieve is the textbook definition of a dangling pointer. Local variables will be dropped when the function completes its execution. In other words they will be pop off the execution stack and any reference to the local variables then on will be pointing to some garbage data.
Best course of action is either returning the string or its clone. No need to obsess over the speed.
However, I believe the essence of the question is if there is a way to convert a String into an str? The answer is no and this is where the misconception lies:
You can not turn a String into an str by borrowing it. Because a String is heap allocated. If you take a reference to it, you still be using heap allocated data but through a reference. str, on the other hand, is stored directly in the data section of the executable file and it is static. When you take a reference to a string, you will get matching type signature for common string manipulations, not an actual &str.
You can check out this post for detailed explanation:
What are the differences between Rust's `String` and `str`?
Now, there may be a workaround for this particular use case if you absolutely use static text:
Since you use combinations of four bases A, C, G, T, in groups of four, you can create a list of all possible outcomes as &str and use them through some data structure. You will jump some hoops but certainly doable.
if it is possible to create the resulting STRING in a static way at compile time, this would be a solution without memory leaking
#[macro_use]
extern crate lazy_static;
fn return_str<'a>() -> &'a str {
lazy_static! {
static ref STRING: String = {
"ACTG".repeat(10)
};
}
&STRING
}
Yes you can - the method replace_range provides a work around -
let a = "0123456789";
//println!("{}",a[3..5]); fails - doesn't have a size known at compile-time
let mut b = String::from(a);
b.replace_range(5..,"");
b.replace_range(0..2,"");
println!("{}",b); //succeeds
It took blood sweat and tears to achieve this!

How to return a vector of &str? [duplicate]

There are several questions that seem to be about the same problem I'm having. For example see here and here. Basically I'm trying to build a String in a local function, but then return it as a &str. Slicing isn't working because the lifetime is too short. I can't use str directly in the function because I need to build it dynamically. However, I'd also prefer not to return a String since the nature of the object this is going into is static once it's built. Is there a way to have my cake and eat it too?
Here's a minimal non-compiling reproduction:
fn return_str<'a>() -> &'a str {
let mut string = "".to_string();
for i in 0..10 {
string.push_str("ACTG");
}
&string[..]
}
No, you cannot do it. There are at least two explanations why it is so.
First, remember that references are borrowed, i.e. they point to some data but do not own it, it is owned by someone else. In this particular case the string, a slice to which you want to return, is owned by the function because it is stored in a local variable.
When the function exits, all its local variables are destroyed; this involves calling destructors, and the destructor of String frees the memory used by the string. However, you want to return a borrowed reference pointing to the data allocated for that string. It means that the returned reference immediately becomes dangling - it points to invalid memory!
Rust was created, among everything else, to prevent such problems. Therefore, in Rust it is impossible to return a reference pointing into local variables of the function, which is possible in languages like C.
There is also another explanation, slightly more formal. Let's look at your function signature:
fn return_str<'a>() -> &'a str
Remember that lifetime and generic parameters are, well, parameters: they are set by the caller of the function. For example, some other function may call it like this:
let s: &'static str = return_str();
This requires 'a to be 'static, but it is of course impossible - your function does not return a reference to a static memory, it returns a reference with a strictly lesser lifetime. Thus such function definition is unsound and is prohibited by the compiler.
Anyway, in such situations you need to return a value of an owned type, in this particular case it will be an owned String:
fn return_str() -> String {
let mut string = String::new();
for _ in 0..10 {
string.push_str("ACTG");
}
string
}
In certain cases, you are passed a string slice and may conditionally want to create a new string. In these cases, you can return a Cow. This allows for the reference when possible and an owned String otherwise:
use std::borrow::Cow;
fn return_str<'a>(name: &'a str) -> Cow<'a, str> {
if name.is_empty() {
let name = "ACTG".repeat(10);
name.into()
} else {
name.into()
}
}
You can choose to leak memory to convert a String to a &'static str:
fn return_str() -> &'static str {
let string = "ACTG".repeat(10);
Box::leak(string.into_boxed_str())
}
This is a really bad idea in many cases as the memory usage will grow forever every time this function is called.
If you wanted to return the same string every call, see also:
How to create a static string at compile time
The problem is that you are trying to create a reference to a string that will disappear when the function returns.
A simple solution in this case is to pass in the empty string to the function. This will explicitly ensure that the referred string will still exist in the scope where the function returns:
fn return_str(s: &mut String) -> &str {
for _ in 0..10 {
s.push_str("ACTG");
}
&s[..]
}
fn main() {
let mut s = String::new();
let s = return_str(&mut s);
assert_eq!("ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG", s);
}
Code in Rust Playground:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=2499ded42d3ee92d6023161fe82e9b5f
This is an old question but a very common one. There are many answers but none of them addresses the glaring misconception people have about the strings and string slices, which stems from not knowing their true nature.
But lets start with the obvious question before addressing the implied one: Can we return a reference to a local variable?
What we are asking to achieve is the textbook definition of a dangling pointer. Local variables will be dropped when the function completes its execution. In other words they will be pop off the execution stack and any reference to the local variables then on will be pointing to some garbage data.
Best course of action is either returning the string or its clone. No need to obsess over the speed.
However, I believe the essence of the question is if there is a way to convert a String into an str? The answer is no and this is where the misconception lies:
You can not turn a String into an str by borrowing it. Because a String is heap allocated. If you take a reference to it, you still be using heap allocated data but through a reference. str, on the other hand, is stored directly in the data section of the executable file and it is static. When you take a reference to a string, you will get matching type signature for common string manipulations, not an actual &str.
You can check out this post for detailed explanation:
What are the differences between Rust's `String` and `str`?
Now, there may be a workaround for this particular use case if you absolutely use static text:
Since you use combinations of four bases A, C, G, T, in groups of four, you can create a list of all possible outcomes as &str and use them through some data structure. You will jump some hoops but certainly doable.
if it is possible to create the resulting STRING in a static way at compile time, this would be a solution without memory leaking
#[macro_use]
extern crate lazy_static;
fn return_str<'a>() -> &'a str {
lazy_static! {
static ref STRING: String = {
"ACTG".repeat(10)
};
}
&STRING
}
Yes you can - the method replace_range provides a work around -
let a = "0123456789";
//println!("{}",a[3..5]); fails - doesn't have a size known at compile-time
let mut b = String::from(a);
b.replace_range(5..,"");
b.replace_range(0..2,"");
println!("{}",b); //succeeds
It took blood sweat and tears to achieve this!

Rust lifetime with a vec! does not work as intended [duplicate]

There are several questions that seem to be about the same problem I'm having. For example see here and here. Basically I'm trying to build a String in a local function, but then return it as a &str. Slicing isn't working because the lifetime is too short. I can't use str directly in the function because I need to build it dynamically. However, I'd also prefer not to return a String since the nature of the object this is going into is static once it's built. Is there a way to have my cake and eat it too?
Here's a minimal non-compiling reproduction:
fn return_str<'a>() -> &'a str {
let mut string = "".to_string();
for i in 0..10 {
string.push_str("ACTG");
}
&string[..]
}
No, you cannot do it. There are at least two explanations why it is so.
First, remember that references are borrowed, i.e. they point to some data but do not own it, it is owned by someone else. In this particular case the string, a slice to which you want to return, is owned by the function because it is stored in a local variable.
When the function exits, all its local variables are destroyed; this involves calling destructors, and the destructor of String frees the memory used by the string. However, you want to return a borrowed reference pointing to the data allocated for that string. It means that the returned reference immediately becomes dangling - it points to invalid memory!
Rust was created, among everything else, to prevent such problems. Therefore, in Rust it is impossible to return a reference pointing into local variables of the function, which is possible in languages like C.
There is also another explanation, slightly more formal. Let's look at your function signature:
fn return_str<'a>() -> &'a str
Remember that lifetime and generic parameters are, well, parameters: they are set by the caller of the function. For example, some other function may call it like this:
let s: &'static str = return_str();
This requires 'a to be 'static, but it is of course impossible - your function does not return a reference to a static memory, it returns a reference with a strictly lesser lifetime. Thus such function definition is unsound and is prohibited by the compiler.
Anyway, in such situations you need to return a value of an owned type, in this particular case it will be an owned String:
fn return_str() -> String {
let mut string = String::new();
for _ in 0..10 {
string.push_str("ACTG");
}
string
}
In certain cases, you are passed a string slice and may conditionally want to create a new string. In these cases, you can return a Cow. This allows for the reference when possible and an owned String otherwise:
use std::borrow::Cow;
fn return_str<'a>(name: &'a str) -> Cow<'a, str> {
if name.is_empty() {
let name = "ACTG".repeat(10);
name.into()
} else {
name.into()
}
}
You can choose to leak memory to convert a String to a &'static str:
fn return_str() -> &'static str {
let string = "ACTG".repeat(10);
Box::leak(string.into_boxed_str())
}
This is a really bad idea in many cases as the memory usage will grow forever every time this function is called.
If you wanted to return the same string every call, see also:
How to create a static string at compile time
The problem is that you are trying to create a reference to a string that will disappear when the function returns.
A simple solution in this case is to pass in the empty string to the function. This will explicitly ensure that the referred string will still exist in the scope where the function returns:
fn return_str(s: &mut String) -> &str {
for _ in 0..10 {
s.push_str("ACTG");
}
&s[..]
}
fn main() {
let mut s = String::new();
let s = return_str(&mut s);
assert_eq!("ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG", s);
}
Code in Rust Playground:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=2499ded42d3ee92d6023161fe82e9b5f
This is an old question but a very common one. There are many answers but none of them addresses the glaring misconception people have about the strings and string slices, which stems from not knowing their true nature.
But lets start with the obvious question before addressing the implied one: Can we return a reference to a local variable?
What we are asking to achieve is the textbook definition of a dangling pointer. Local variables will be dropped when the function completes its execution. In other words they will be pop off the execution stack and any reference to the local variables then on will be pointing to some garbage data.
Best course of action is either returning the string or its clone. No need to obsess over the speed.
However, I believe the essence of the question is if there is a way to convert a String into an str? The answer is no and this is where the misconception lies:
You can not turn a String into an str by borrowing it. Because a String is heap allocated. If you take a reference to it, you still be using heap allocated data but through a reference. str, on the other hand, is stored directly in the data section of the executable file and it is static. When you take a reference to a string, you will get matching type signature for common string manipulations, not an actual &str.
You can check out this post for detailed explanation:
What are the differences between Rust's `String` and `str`?
Now, there may be a workaround for this particular use case if you absolutely use static text:
Since you use combinations of four bases A, C, G, T, in groups of four, you can create a list of all possible outcomes as &str and use them through some data structure. You will jump some hoops but certainly doable.
if it is possible to create the resulting STRING in a static way at compile time, this would be a solution without memory leaking
#[macro_use]
extern crate lazy_static;
fn return_str<'a>() -> &'a str {
lazy_static! {
static ref STRING: String = {
"ACTG".repeat(10)
};
}
&STRING
}
Yes you can - the method replace_range provides a work around -
let a = "0123456789";
//println!("{}",a[3..5]); fails - doesn't have a size known at compile-time
let mut b = String::from(a);
b.replace_range(5..,"");
b.replace_range(0..2,"");
println!("{}",b); //succeeds
It took blood sweat and tears to achieve this!

Resources