How do I go beyond the starting index of a string slice? - rust

I have a string slice:
fn my_func(s: &str) {
let chars: Vec<char> = s.chars().collect();
let some_char = chars[5];
//.........
}
Since a string slice is basically a pointer to a string and it's a pointer that can point to any character of a string, not necessarily the 0-th or 1st one, right?, is there a way to go back N characters and refer to the -Nth character of the original string? Not the -Nth character of a slice, but of the original string.
fn my_func(s: &str) {
let chars: Vec<char> = s.chars().collect();
//let some_char2 = chars[-10]; // minus ???
//.........
}
Provided that I know that it may cause a memory segmentation error and am fine with it.

it's a pointer that can point to any character of a String, not necessarily the 0-th or 1st one, right?
Yes.
is there a way to go back N characters and refer to -N character of an original String? Not to -Nth character of a slice, but an original String.
The slice itself retains no knowledge of whatever originally created it (which might not even be a String), it only has a pointer and a length.
Provided that I know that it may cause memory segmentation error and am fine with it.
You can always do that with unsafe and raw pointers but as the name notes this is wildly unsafe, not just because there's no guarantees whatsoever as to where's what, but because the other segments might be mutably borrowed or any other nonsense, which land you straight into UB. A segfault is the least of your worries.
What are you actually trying to achieve?

Related

What is the difference between the following two programs in terms of string usage? [duplicate]

Why does Rust have String and str? What are the differences between String and str? When does one use String instead of str and vice versa? Is one of them getting deprecated?
String is the dynamic heap string type, like Vec: use it when you need to own or modify your string data.
str is an immutable1 sequence of UTF-8 bytes of dynamic length somewhere in memory. Since the size is unknown, one can only handle it behind a pointer. This means that str most commonly2 appears as &str: a reference to some UTF-8 data, normally called a "string slice" or just a "slice". A slice is just a view onto some data, and that data can be anywhere, e.g.
In static storage: a string literal "foo" is a &'static str. The data is hardcoded into the executable and loaded into memory when the program runs.
Inside a heap allocated String: String dereferences to a &str view of the String's data.
On the stack: e.g. the following creates a stack-allocated byte array, and then gets a view of that data as a &str:
use std::str;
let x: &[u8] = &[b'a', b'b', b'c'];
let stack_str: &str = str::from_utf8(x).unwrap();
In summary, use String if you need owned string data (like passing strings to other threads, or building them at runtime), and use &str if you only need a view of a string.
This is identical to the relationship between a vector Vec<T> and a slice &[T], and is similar to the relationship between by-value T and by-reference &T for general types.
1 A str is fixed-length; you cannot write bytes beyond the end, or leave trailing invalid bytes. Since UTF-8 is a variable-width encoding, this effectively forces all strs to be immutable in many cases. In general, mutation requires writing more or fewer bytes than there were before (e.g. replacing an a (1 byte) with an ä (2+ bytes) would require making more room in the str). There are specific methods that can modify a &mut str in place, mostly those that handle only ASCII characters, like make_ascii_uppercase.
2 Dynamically sized types allow things like Rc<str> for a sequence of reference counted UTF-8 bytes since Rust 1.2. Rust 1.21 allows easily creating these types.
I have a C++ background and I found it very useful to think about String and &str in C++ terms:
A Rust String is like a std::string; it owns the memory and does the dirty job of managing memory.
A Rust &str is like a char* (but a little more sophisticated); it points us to the beginning of a chunk in the same way you can get a pointer to the contents of std::string.
Are either of them going to disappear? I do not think so. They serve two purposes:
String keeps the buffer and is very practical to use. &str is lightweight and should be used to "look" into strings. You can search, split, parse, and even replace chunks without needing to allocate new memory.
&str can look inside of a String as it can point to some string literal. The following code needs to copy the literal string into the String managed memory:
let a: String = "hello rust".into();
The following code lets you use the literal itself without a copy (read-only though):
let a: &str = "hello rust";
It is str that is analogous to String, not the slice of it.
An str is a string literal, basically a pre-allocated text:
"Hello World"
This text has to be stored somewhere, so it is stored in the data section of the executable file along with the program’s machine code, as sequence of bytes ([u8]). Because text can be of any length, they are dynamically-sized:
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ H │ e │ l │ l │ o │ │ W │ o │ r │ l │ d │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ 72 │ 101 │ 108 │ 108 │ 111 │ 32 │ 87 │ 111 │ 114 │ 108 │ 100 │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
We need a way to access a stored text and that is where the slice comes in.
A slice,[T], is a view into a block of memory. Whether mutable or not, a slice always borrows and that is why it is always behind a pointer, &.
Lets explain the meaning of being dynamically sized. Some programming languages, like C, appends a zero byte (\0) at the end of its strings and keeps a record of the starting address. To determine a string's length, program has to walk through the raw bytes from starting position until finding this zero byte. So, length of a text can be of any size hence it is dynamically sized.
However Rust takes a different approach: It uses a slice. A slice stores the address where a str starts and how many byte it takes. It is better than appending zero byte because calculation is done in advance during compilation. Since text can be of any size, from type system perspective it is still dynamically sized.
So, "Hello World" expression returns a fat pointer, containing both the address of the actual data and its length. This pointer will be our handle to the actual data and it will also be stored in our program. Now data is behind a pointer and the compiler knows its size at compile time.
Since text is stored in the source code, it will be valid for the entire lifetime of the running program, hence will have the static lifetime.
So, return value of "Hello Word" expression should reflect these two characteristics, and it does:
let s: &'static str = "Hello World";
You may ask why its type is written as str but not as [u8], it is because data is always guaranteed to be a valid UTF-8 sequence. Not all UTF-8 characters are single byte, some take 4 bytes. So [u8] would be inaccurate.
If you disassemble a compiled Rust program and inspect the executable file, you will see multiple strs are stored adjacent to each other in the data section without any indication where one starts and the other ends.
Compiler takes it even further. If identical static text is used at multiple locations in your program, Rust compiler will optimize your program and create a single binary block in the executable's data section and each slice in your code point to this binary block.
For example, compiler creates a single continuous binary with the content of "Hello World" for the following code even though we use three different literals with "Hello World":
let x: &'static str = "Hello World";
let y: &'static str = "Hello World";
let z: &'static str = "Hello World";
String, on the other hand, is a specialized type that stores its value as vector of u8. Here is how String type is defined in the source code:
pub struct String {
vec: Vec<u8>,
}
Being vector means it is heap allocated and resizable like any other vector value.
Being specialized means it does not permit arbitrary access and enforces certain checks that data is always valid UTF-8. Other than that, it is just a vector.
So a String is a resizable buffer holding UTF-8 text. This buffer is allocated on the heap, so it can grow as needed or requested. We can fill this buffer anyway we see fit. We can change its content.
If you look carefully vec field is kept private to enforce validity. Since it is private, we can not create a String instance directly. The reason why it is kept private because not all stream of bytes produce valid utf-8 characters and direct interaction with the underlying bytes may corrupt the string. We create u8 bytes through methods and methods runs certain checks. We can say that being private and having controlled interaction via methods provides certain guarantees.
There are several methods defined on String type to create String instance, new is one of them:
pub const fn new() -> String {
String { vec: Vec::new() }
}
We can use it to create a valid String.
let s = String::new();
println("{}", s);
Unfortunately it does not accept input parameter. So result will be valid but an empty string but it will grow like any other vector when capacity is not enough to hold the assigned value. But application performance will take a hit, as growing requires re-allocation.
We can fill the underlying vector with initial values from different sources:
From a string literal
let a = "Hello World";
let s = String::from(a);
Please note that an str is still created and its content is copied to the heap allocated vector via String.from. If we check the executable binary we will see raw bytes in data section with the content "Hello World". This is very important detail some people miss.
From raw parts
let ptr = s.as_mut_ptr();
let len = s.len();
let capacity = s.capacity();
let s = String::from_raw_parts(ptr, len, capacity);
From a character
let ch = 'c';
let s = ch.to_string();
From vector of bytes
let hello_world = vec![72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100];
// We know it is valid sequence, so we can use unwrap
let hello_world = String::from_utf8(hello_world).unwrap();
println!("{}", hello_world); // Hello World
Here we have another important detail. A vector might have any value, there is no guarantee its content will be a valid UTF-8, so Rust forces us to take this into consideration by returning a Result<String, FromUtf8Error> rather than a String.
From input buffer
use std::io::{self, Read};
fn main() -> io::Result<()> {
let mut buffer = String::new();
let stdin = io::stdin();
let mut handle = stdin.lock();
handle.read_to_string(&mut buffer)?;
Ok(())
}
Or from any other type that implements ToString trait
Since String is a vector under the hood, it will exhibit some vector characteristics:
a pointer: The pointer points to an internal buffer that stores the data.
length: The length is the number of bytes currently stored in the buffer.
capacity: The capacity is the size of the buffer in bytes. So, the length will always be less than or equal to the capacity.
And it delegates some properties and methods to vectors:
pub fn capacity(&self) -> usize {
self.vec.capacity()
}
Most of the examples uses String::from, so people get confused thinking why create String from another string.
It is a long read, hope it helps.
They are actually completely different. First off, a str is nothing but a type level thing; it can only be reasoned about at the type level because it's a so-called dynamically-sized type (DST). The size the str takes up cannot be known at compile time and depends on runtime information — it cannot be stored in a variable because the compiler needs to know at compile time what the size of each variable is. A str is conceptually just a row of u8 bytes with the guarantee that it forms valid UTF-8. How large is the row? No one knows until runtime hence it can't be stored in a variable.
The interesting thing is that a &str or any other pointer to a str like Box<str> does exist at runtime. This is a so-called "fat pointer"; it's a pointer with extra information (in this case the size of the thing it's pointing at) so it's twice as large. In fact, a &str is quite close to a String (but not to a &String). A &str is two words; one pointer to a the first byte of a str and another number that describes how many bytes long the the str is.
Contrary to what is said, a str does not need to be immutable. If you can get a &mut str as an exclusive pointer to the str, you can mutate it and all the safe functions that mutate it guarantee that the UTF-8 constraint is upheld because if that is violated then we have undefined behaviour as the library assumes this constraint is true and does not check for it.
So what is a String? That's three words; two are the same as for &str but it adds a third word which is the capacity of the str buffer on the heap, always on the heap (a str is not necessarily on the heap) it manages before it's filled and has to re-allocate. the String basically owns a str as they say; it controls it and can resize it and reallocate it when it sees fit. So a String is as said closer to a &str than to a str.
Another thing is a Box<str>; this also owns a str and its runtime representation is the same as a &str but it also owns the str unlike the &str but it cannot resize it because it does not know its capacity so basically a Box<str> can be seen as a fixed-length String that cannot be resized (you can always convert it into a String if you want to resize it).
A very similar relationship exists between [T] and Vec<T> except there is no UTF-8 constraint and it can hold any type whose size is not dynamic.
The use of str on the type level is mostly to create generic abstractions with &str; it exists on the type level to be able to conveniently write traits. In theory str as a type thing didn't need to exist and only &str but that would mean a lot of extra code would have to be written that can now be generic.
&str is super useful to be able to to have multiple different substrings of a String without having to copy; as said a String owns the str on the heap it manages and if you could only create a substring of a String with a new String it would have to be copied because everything in Rust can only have one single owner to deal with memory safety. So for instance you can slice a string:
let string: String = "a string".to_string();
let substring1: &str = &string[1..3];
let substring2: &str = &string[2..4];
We have two different substring strs of the same string. string is the one that owns the actual full str buffer on the heap and the &str substrings are just fat pointers to that buffer on the heap.
str, only used as &str, is a string slice, a reference to a UTF-8 byte array.
String is what used to be ~str, a growable, owned UTF-8 byte array.
Rust &str and String
String:
Rust owned String type, the string itself lives on the heap and therefore is mutable and can alter its size and contents.
Because String is owned when the variables which owns the string goes out of scope the memory on the heap will be freed.
Variables of type String are fat pointers (pointer + associated metadata)
The fat pointer is 3 * 8 bytes (wordsize) long consists of the following 3 elements:
Pointer to actual data on the heap, it points to the first character
Length of the string (# of characters)
Capacity of the string on the heap
&str:
Rust non owned String type and is immutable by default. The string itself lives somewhere else in memory usually on the heap or 'static memory.
Because String is non owned when &str variables goes out of scope the memory of the string will not be freed.
Variables of type &str are fat pointers (pointer + associated metadata)
The fat pointer is 2 * 8 bytes (wordsize) long consists of the following 2 elements:
Pointer to actual data on the heap, it points to the first character
Length of the string (# of characters)
Example:
use std::mem;
fn main() {
// on 64 bit architecture:
println!("{}", mem::size_of::<&str>()); // 16
println!("{}", mem::size_of::<String>()); // 24
let string1: &'static str = "abc";
// string will point to `static memory which lives through the whole program
let ptr = string1.as_ptr();
let len = string1.len();
println!("{}, {}", unsafe { *ptr as char }, len); // a, 3
// len is 3 characters long so 3
// pointer to the first character points to letter a
{
let mut string2: String = "def".to_string();
let ptr = string2.as_ptr();
let len = string2.len();
let capacity = string2.capacity();
println!("{}, {}, {}", unsafe { *ptr as char }, len, capacity); // d, 3, 3
// pointer to the first character points to letter d
// len is 3 characters long so 3
// string has now 3 bytes of space on the heap
string2.push_str("ghijk"); // we can mutate String type, capacity and length will aslo change
println!("{}, {}", string2, string2.capacity()); // defghijk, 8
} // memory of string2 on the heap will be freed here because owner goes out of scope
}
std::String is simply a vector of u8. You can find its definition in source code . It's heap-allocated and growable.
#[derive(PartialOrd, Eq, Ord)]
#[stable(feature = "rust1", since = "1.0.0")]
pub struct String {
vec: Vec<u8>,
}
str is a primitive type, also called string slice. A string slice has fixed size. A literal string like let test = "hello world" has &'static str type. test is a reference to this statically allocated string.
&str cannot be modified, for example,
let mut word = "hello world";
word[0] = 's';
word.push('\n');
str does have mutable slice &mut str, for example:
pub fn split_at_mut(&mut self, mid: usize) -> (&mut str, &mut str)
let mut s = "Per Martin-Löf".to_string();
{
let (first, last) = s.split_at_mut(3);
first.make_ascii_uppercase();
assert_eq!("PER", first);
assert_eq!(" Martin-Löf", last);
}
assert_eq!("PER Martin-Löf", s);
But a small change to UTF-8 can change its byte length, and a slice cannot reallocate its referent.
In easy words, String is datatype stored on heap (just like Vec), and you have access to that location.
&str is a slice type. That means it is just reference to an already present String somewhere in the heap.
&str doesn't do any allocation at runtime. So, for memory reasons, you can use &str over String. But, keep in mind that when using &str you might have to deal with explicit lifetimes.
For C# and Java people:
Rust' String === StringBuilder
Rust's &str === (immutable) string
I like to think of a &str as a view on a string, like an interned string in Java / C# where you can't change it, only create a new one.
Some Usages
example_1.rs
fn main(){
let hello = String::("hello");
let any_char = hello[0];//error
}
example_2.rs
fn main(){
let hello = String::("hello");
for c in hello.chars() {
println!("{}",c);
}
}
example_3.rs
fn main(){
let hello = String::("String are cool");
let any_char = &hello[5..6]; // = let any_char: &str = &hello[5..6];
println!("{:?}",any_char);
}
Shadowing
fn main() {
let s: &str = "hello"; // &str
let s: String = s.to_uppercase(); // String
println!("{}", s) // HELLO
}
function
fn say_hello(to_whom: &str) { //type coercion
println!("Hey {}!", to_whom)
}
fn main(){
let string_slice: &'static str = "you";
let string: String = string_slice.into(); // &str => String
say_hello(string_slice);
say_hello(&string);// &String
}
Concat
// String is at heap, and can be increase or decrease in its size
// The size of &str is fixed.
fn main(){
let a = "Foo";
let b = "Bar";
let c = a + b; //error
// let c = a.to_string + b;
}
Note that String and &str are different types and for 99% of the time, you only should care about &str.
In Rust, str is a primitive type that represents a sequence of Unicode scalar values, also known as a string slice. This means that it is a read-only view into a string, and it does not own the memory that it points to. On the other hand, String is a growable, mutable, owned string type. This means that when you create a String, it will allocate memory on the heap to store the contents of the string, and it will deallocate this memory when the String goes out of scope. Because String is growable and mutable, you can change the contents of a String after you have created it.
In general, str is used when you want to refer to a string slice that is stored in another data structure, such as a String. String is used when you want to create and own a string value.
String is an Object.
&str is a pointer at a part of object.
In these 3 different types
let noodles = "noodles".to_string();
let oodles = &noodles[1..];
let poodles = "ಠ_ಠ"; // this is string literal
A String has a resizable buffer holding UTF-8 text. The buffer is allocated on the heap, so it can resize its buffer as needed or
requested. In the example, "noodles" is a String that owns an
eight-byte buffer, of which seven are in use. You can think of a
String as a Vec that is guaranteed to hold well-formed UTF-8; in
fact, this is how String is implemented.
A &str is a reference to a run of UTF-8 text owned by someone else: it “borrows” the text. In the example, oodles is a &str
referring to the last six bytes of the text belonging to "noodles", so
it represents the text “oodles.” Like other slice references, a &str
is a fat pointer, containing both the address of the actual data and
its length. You can think of a &str as being nothing more than a
&[u8] that is guaranteed to hold well-formed UTF-8.
A string literal is a &str that refers to preallocated text, typically stored in read-only memory along with the program’s machine
code. In the preceding example, poodles is a string literal, pointing
to seven bytes that are created when the program begins execution and
that last until it exits.
This is how they are stored in memory
Reference:Programming Rust,by Jim Blandy, Jason Orendorff, Leonora F . S. Tindall
Here is a quick and easy explanation.
String - A growable, ownable heap-allocated data structure. It can be coerced to a &str.
str - is (now, as Rust evolves) mutable, fixed-length string that lives on the heap or in the binary. You can only interact with str as a borrowed type via a string slice view, such as &str.
Usage considerations:
Prefer String if you want to own or mutate a string - such as passing the string to another thread, etc.
Prefer &str if you want to have a read-only view of a string.

How do I get a substring of a String object using a character position range?

Say I have a struct Foo that owns a string:
struct Foo {
owned_string: String
}
I want to implement some methods on this struct that return substrings from the owned String. For efficiency reasons, I don't want to allocate any new memory for this, I just want the return values to point to the original String.
Let's say I know the substring I want, it's characters 10 through 15.
I can't just slice it like self.owned_string[10..16], since that would give me bytes, not characters.
I can take the characters and collect them into a new String object, like self.owned_string.chars().skip(9).take(6).collect::<String>(), but that creates a new String object. String objects own their strings (AFAIK), so presumably new memory was allocated for this, which is not what I want.
How do I create string slices that reference a substring of a String object, but using character positions? (Without allocating any new memory)
You can use char_indices() then slice the string according to the positions the iterator gives you:
let mut iter = s.char_indices();
let (start, _) = iter.nth(10).unwrap();
let (end, _) = iter.nth(5).unwrap();
let slice = &s[start..end];
However, note that as mentioned in the documentation of chars():
It’s important to remember that char represents a Unicode Scalar Value, and might not match your idea of what a ‘character’ is. Iteration over grapheme clusters may be what you actually want. This functionality is not provided by Rust’s standard library, check crates.io instead.
#ChayimFriedman's answer is of course correct, I just wanted to contribute a more telling example:
fn print_string(s: &str) {
println!("String: {}", s);
}
fn main() {
let s: String = "🤣😄😁😆😅".to_string();
let mut iter = s.char_indices();
// Retrieve the position of the char at pos 1
let (start, _) = iter.nth(1).unwrap();
// Now the next char will be at position `2`. Which would be
// equivalent of querying `.next()` or `.nth(0)`.
// So if we query for `nth(2)` we query 3 characters; meaning
// the position of character 4.
let (end, _) = iter.nth(2).unwrap();
// Gives you a &str, which is exactly what you want.
// A reference to a substring, zero allocations, zero overhead.
let substring = &s[start..end];
print_string(&s);
print_string(substring);
}
String: 🤣😄😁😆😅
String: 😄😁😆
I've done it with smileys because smileys are definitely multi-byte unicode characters.
As #ChayimFriedman already noted, the reason why we have to iterate through the char_indices is because unicode characters are variably sized. They can be anywhere from 1 to 8 bytes long, so the only way to find out where the character boundaries are is to actually read the string up to the character we desire.

Why do I need to use &str when defining a string literal in Rust?

Why can I not use str here?
let question: &str = "why";
What's the difference between str and &str?
I get that & denotes a reference, but I'm confused about what &str is referencing.
A str is a sequence of UTF-8 encoded bytes of unknown length, somewhere in memory.
Because its size is not known at compile time, it can't be put on the stack directly, instead, a reference must be used.
A string literal (i.e. the "why" syntax) creates a space in the data segment of the binary, and returns a reference to that location, which is an &str (in particular, an &'static str, because it is never dropped).
If you write let question: str = "why";, this won't compile for the same reason: let i: i32 = &123; won't compile.
P.S. ("hello") is not a tuple, it is just a &str in brackets. If you want to make a tuple with a single element, add a trailing comma: let hello: (&str,) = ("hello",);

Why is capitalizing the first letter of a string so convoluted in Rust?

I'd like to capitalize the first letter of a &str. It's a simple problem and I hope for a simple solution. Intuition tells me to do something like this:
let mut s = "foobar";
s[0] = s[0].to_uppercase();
But &strs can't be indexed like this. The only way I've been able to do it seems overly convoluted. I convert the &str to an iterator, convert the iterator to a vector, upper case the first item in the vector, which creates an iterator, which I index into, creating an Option, which I unwrap to give me the upper-cased first letter. Then I convert the vector into an iterator, which I convert into a String, which I convert to a &str.
let s1 = "foobar";
let mut v: Vec<char> = s1.chars().collect();
v[0] = v[0].to_uppercase().nth(0).unwrap();
let s2: String = v.into_iter().collect();
let s3 = &s2;
Is there an easier way than this, and if so, what? If not, why is Rust designed this way?
Similar question
Why is it so convoluted?
Let's break it down, line-by-line
let s1 = "foobar";
We've created a literal string that is encoded in UTF-8. UTF-8 allows us to encode the 1,114,112 code points of Unicode in a manner that's pretty compact if you come from a region of the world that types in mostly characters found in ASCII, a standard created in 1963. UTF-8 is a variable length encoding, which means that a single code point might take from 1 to 4 bytes. The shorter encodings are reserved for ASCII, but many Kanji take 3 bytes in UTF-8.
let mut v: Vec<char> = s1.chars().collect();
This creates a vector of characters. A character is a 32-bit number that directly maps to a code point. If we started with ASCII-only text, we've quadrupled our memory requirements. If we had a bunch of characters from the astral plane, then maybe we haven't used that much more.
v[0] = v[0].to_uppercase().nth(0).unwrap();
This grabs the first code point and requests that it be converted to an uppercase variant. Unfortunately for those of us who grew up speaking English, there's not always a simple one-to-one mapping of a "small letter" to a "big letter". Side note: we call them upper- and lower-case because one box of letters was above the other box of letters back in the day.
This code will panic when a code point has no corresponding uppercase variant. I'm not sure if those exist, actually. It could also semantically fail when a code point has an uppercase variant that has multiple characters, such as the German ß. Note that ß may never actually be capitalized in The Real World, this is the just example I can always remember and search for. As of 2017-06-29, in fact, the official rules of German spelling have been updated so that both "ẞ" and "SS" are valid capitalizations!
let s2: String = v.into_iter().collect();
Here we convert the characters back into UTF-8 and require a new allocation to store them in, as the original variable was stored in constant memory so as to not take up memory at run time.
let s3 = &s2;
And now we take a reference to that String.
It's a simple problem
Unfortunately, this is not true. Perhaps we should endeavor to convert the world to Esperanto?
I presume char::to_uppercase already properly handles Unicode.
Yes, I certainly hope so. Unfortunately, Unicode isn't enough in all cases.
Thanks to huon for pointing out the Turkish I, where both the upper (İ) and lower case (i) versions have a dot. That is, there is no one proper capitalization of the letter i; it depends on the locale of the the source text as well.
why the need for all data type conversions?
Because the data types you are working with are important when you are worried about correctness and performance. A char is 32-bits and a string is UTF-8 encoded. They are different things.
indexing could return a multi-byte, Unicode character
There may be some mismatched terminology here. A char is a multi-byte Unicode character.
Slicing a string is possible if you go byte-by-byte, but the standard library will panic if you are not on a character boundary.
One of the reasons that indexing a string to get a character was never implemented is because so many people misuse strings as arrays of ASCII characters. Indexing a string to set a character could never be efficient - you'd have to be able to replace 1-4 bytes with a value that is also 1-4 bytes, causing the rest of the string to bounce around quite a lot.
to_uppercase could return an upper case character
As mentioned above, ß is a single character that, when capitalized, becomes two characters.
Solutions
See also trentcl's answer which only uppercases ASCII characters.
Original
If I had to write the code, it'd look like:
fn some_kind_of_uppercase_first_letter(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().chain(c).collect(),
}
}
fn main() {
println!("{}", some_kind_of_uppercase_first_letter("joe"));
println!("{}", some_kind_of_uppercase_first_letter("jill"));
println!("{}", some_kind_of_uppercase_first_letter("von Hagen"));
println!("{}", some_kind_of_uppercase_first_letter("ß"));
}
But I'd probably search for uppercase or unicode on crates.io and let someone smarter than me handle it.
Improved
Speaking of "someone smarter than me", Veedrac points out that it's probably more efficient to convert the iterator back into a slice after the first capital codepoints are accessed. This allows for a memcpy of the rest of the bytes.
fn some_kind_of_uppercase_first_letter(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
}
}
Is there an easier way than this, and if so, what? If not, why is Rust designed this way?
Well, yes and no. Your code is, as the other answer pointed out, not correct, and will panic if you give it something like བོད་སྐད་ལ་. So doing this with Rust's standard library is even harder than you initially thought.
However, Rust is designed to encourage code reuse and make bringing in libraries easy. So the idiomatic way to capitalize a string is actually quite palatable:
extern crate inflector;
use inflector::Inflector;
let capitalized = "some string".to_title_case();
It's not especially convoluted if you are able to limit your input to ASCII-only strings.
Since Rust 1.23, str has a make_ascii_uppercase method (in older Rust versions, it was available through the AsciiExt trait). This means you can uppercase ASCII-only string slices with relative ease:
fn make_ascii_titlecase(s: &mut str) {
if let Some(r) = s.get_mut(0..1) {
r.make_ascii_uppercase();
}
}
This will turn "taylor" into "Taylor", but it won't turn "édouard" into "Édouard". (playground)
Use with caution.
I did it this way:
fn str_cap(s: &str) -> String {
format!("{}{}", (&s[..1].to_string()).to_uppercase(), &s[1..])
}
If it is not an ASCII string:
fn str_cap(s: &str) -> String {
format!("{}{}", s.chars().next().unwrap().to_uppercase(),
s.chars().skip(1).collect::<String>())
}
The OP's approach taken further:
replace the first character with its uppercase representation
let mut s = "foobar".to_string();
let r = s.remove(0).to_uppercase().to_string() + &s;
or
let r = format!("{}{s}", s.remove(0).to_uppercase());
println!("{r}");
works with Unicode characters as well eg. "😎foobar"
The first guaranteed to be an ASCII character, can changed to a capital letter in place:
let mut s = "foobar".to_string();
if !s.is_empty() {
s[0..1].make_ascii_uppercase(); // Foobar
}
Panics with a non ASCII character in first position!
Since the method to_uppercase() returns a new string, you should be able to just add the remainder of the string like so.
this was tested in rust version 1.57+ but is likely to work in any version that supports slice.
fn uppercase_first_letter(s: &str) -> String {
s[0..1].to_uppercase() + &s[1..]
}
Here's a version that is a bit slower than #Shepmaster's improved version, but also more idiomatic:
fn capitalize_first(s: &str) -> String {
let mut chars = s.chars();
chars
.next()
.map(|first_letter| first_letter.to_uppercase())
.into_iter()
.flatten()
.chain(chars)
.collect()
}
This is how I solved this problem, notice I had to check if self is not ascii before transforming to uppercase.
trait TitleCase {
fn title(&self) -> String;
}
impl TitleCase for &str {
fn title(&self) -> String {
if !self.is_ascii() || self.is_empty() {
return String::from(*self);
}
let (head, tail) = self.split_at(1);
head.to_uppercase() + tail
}
}
pub fn main() {
println!("{}", "bruno".title());
println!("{}", "b".title());
println!("{}", "🦀".title());
println!("{}", "ß".title());
println!("{}", "".title());
println!("{}", "བོད་སྐད་ལ".title());
}
Output
Bruno
B
🦀
ß
བོད་སྐད་ལ
Inspired by get_mut examples I code something like this:
fn make_capital(in_str : &str) -> String {
let mut v = String::from(in_str);
v.get_mut(0..1).map(|s| { s.make_ascii_uppercase(); &*s });
v
}

Efficiently extract prefix substrings

Currently I'm using the following function to extract prefix substrings:
fn prefix(s: &String, k: usize) -> String {
s.chars().take(k).collect::<String>()
}
This can then be used for comparisons like so:
let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);
However, this allocates a new String for each call to prefix, in addition to the processing for the iteration. Most other languages I'm familiar with have an efficient way to do a comparison like this, using just a view of the strings. Is there a way in Rust?
Yes, you can take subslices of strings using the Index operation:
fn prefix(s: &str, k: usize) -> &str {
&s[..k]
}
fn main() {
let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);
println!("{}", same);
}
Note that slicing a string uses bytes as the unit, not characters. It is up to the programmer to ensure that the slice lengths lie on valid UTF-8 boundaries. Additionally, you have to ensure that you don't try to slice past the end of the string. Breaking either of these will result in a panic!.
A bit more defensive version would be
fn prefix(s: &str, k: usize) -> &str {
let idx = s.char_indices().nth(k).map(|(idx, _)| idx).unwrap_or(s.len());
&s[0..idx]
}
The key difference is that we use the char_indices iterator, which tells us the byte offsets corresponding to a character. Indexing into a UTF-8 string is an O(n) operation, and Rust doesn't want to hide that algorithmic complexity from you. This still isn't even complete, because there can be combining characters, for example. Dealing with strings is hard, thanks to the complexity of human language.
Most other languages I'm familiar with have an efficient way
Doubtful :-) To be efficient in time, they'd have to know how many bytes to skip ahead for every character. Either they'd have to keep a lookup table for every string or use a fixed-size character encoding. Both of those solutions can use more memory than needed, and a fixed size encoding doesn't even work when you have combining characters, for example.
Of course, other languages could just say "LOL, strings are just arrays of bytes, good luck with treating them correctly", and efficiently ignore your character encoding...
Two additional notes
Your predicate doesn't really make sense. A string of 2 letters will never match one of 3 letters. For strings to match, they must have the same amount of bytes.
You should never need to take &String as a function argument. Taking a &str is a more accepting argument in all cases except for one teeny tiny little case that no one needs — knowing the capacity of a String, but without being able to modify the string.
While Shepmaster's answer is absolutely correct for the general case of string slicing, I'd like to add that sometimes there are easier ways.
If you know in advance the set of characters you're working with ("ATGC" example suggests you're working with nucleobases, so it is possible that these are all the characters you need) then you can use slices of bytes &[u8] instead of string slices &str. You can always get a byte slice out of a string slice and a Vec<u8> out of a String, if necessary:
let s: String = "ATGC".into();
let ss: &str = &s;
let b: Vec<u8> = s.into_bytes();
let bs: &[u8] = ss.as_slice();
Also, there are byte slice and byte character literals, just prefix regular string/char literals with b:
let sl: &[u8] = b"ATGC";
let bl: u8 = b'G';
Working with byte slices give you constant-time indexing (and thus slicing) operations, so checking for prefix equality is easy (like Shepmaster's first variant but without possibility of panics (unless k is too large):
fn prefix(s: &[u8], k: usize) -> &[u8] {
&s[..k]
}
If you need, you can turn byte slices/vectors back to strings. This operation, of course, checks validity of UTF-8 encoding so it may fail, but if you only work with ASCII, you can safely ignore these errors and just unwrap():
let ss2: &str = str::from_utf8(bs).unwrap();
let s2: String = String::from_utf8(b).unwrap();

Resources