I have a string and I need to scan for every occurrence of "foo" and read all the text following it until a second ". Since Rust does not have a contains function for strings, I need to iterate by characters scanning for it. How would I do this?
Edit: Rust's &str has a contains() and find() method.
I need to iterate by characters scanning for it.
The .chars() method returns an iterator over characters in a string. e.g.
for c in my_str.chars() {
// do something with `c`
}
for (i, c) in my_str.chars().enumerate() {
// do something with character `c` and index `i`
}
If you are interested in the byte offsets of each char, you can use char_indices.
Look into .peekable(), and use peek() for looking ahead. It's wrapped like this because it supports UTF-8 codepoints instead of being a simple vector of characters.
You could also create a vector of chars and work on it from there, but that's more time and space intensive:
let my_chars: Vec<_> = mystr.chars().collect();
The concept of a "character" is very ambiguous and can mean many different things depending on the type of data you are working with. The most obvious answer is the chars method. However, this does not work as advertised. What looks like a single "character" to you may actually be made up of multiple Unicode code points, which can lead to unexpected results:
"a̐".chars() // => ['a', '\u{310}']
For a lot of string processing, you want to work with graphemes. A grapheme consists of one or more unicode code points represented as a string slice. These map better to the human perception of "characters". To create an iterator of graphemes, you can use the unicode-segmentation crate:
use unicode_segmentation::UnicodeSegmentation;
for grapheme in my_str.graphemes(true) {
// ...
}
If you are working with raw ASCII then none of the above applies to you, and you can simply use the bytes iterator:
for byte in my_str.bytes() {
// ...
}
Although, if you are working with ASCII then arguably you shouldn't be using String/&str at all and instead use Vec<u8>/&[u8] directly.
fn main() {
let s = "Rust is a programming language";
for i in s.chars() {
print!("{}", i);
}}
Output: Rust is a programming language
I use the chars() method to iterate over each element of the string.
Related
I have a string that I'm reading from input stream:
"['App01', 'App02', 'App03' , 'App04']"
What's the most efficient way to convert it to a Vec<String> type in Rust?
As a Code Golf answer:
serde_json::from_str(&s.replace('\'', "\"")).unwrap()
But both replace and from_str have to copy the strings data so memory wise it's probably not the best.
There's a lot of ways to parse a string like this, and it depends what you mean by "efficent" (performance? lines of code?), but here's a possible solution:
fn main() {
let input = "['App01', 'App02', 'App03' , 'App04']";
// Start by removing the surrounding braces.
let trimmed = input.trim_matches(['[', ']'].as_slice());
// Make the vector by:
let vector: Vec<String> = trimmed
.split(',') // separating the string parts by the comma
.map(str::trim) // cleaning each part by removing surrounding space
.map(|item| item.trim_matches('\'')) // then removing the single quotes
.map(String::from) // and converting each item from a slice to an owned String
.collect(); // and putting it all into a vector
println!("{vector:#?}");
}
And its output:
[
"App01",
"App02",
"App03",
"App04",
]
That's assuming you truly want an owned String. But it'd be a bit better if you opted to use a reference instead. To do that, it'd look like this instead:
let vector: Vec<&str> = trimmed
.split(',') // separating the string parts by the comma
.map(str::trim) // cleaning each part by removing surrounding space
.map(|item| item.trim_matches('\'')) // then removing the single quotes
.collect(); // and putting it all into a vector
Notice that the .map(String::from) call is gone and the resulting type is Vec<&str>. This avoids duplicating the strings and instead just uses a reference to the original input, using less memory overall.
Hope this helps.
Based on the Rust book, the String::len method returns the number of bytes composing the string, which may not correspond to the length in characters.
For example if we consider the following string in Japanese, len() would return 30, which is the number of bytes and not the number of characters, which would be 10:
let s = String::from("ラウトは難しいです!");
s.len() // returns 30.
The only way I have found to get the number of characters is using the following function:
s.chars().count()
which returns 10, and is the correct number of characters.
Is there any method on String that returns the characters count, aside from the one I am using above?
Is there any method on String that returns the characters count, aside from the one I am using above?
No. Using s.chars().count() is correct. Note that this is an O(N) operation (because UTF-8 is complex) while getting the number of bytes is an O(1) operation.
You can see all the methods on str for yourself.
As pointed out in the comments, a char is a specific concept:
It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.
One such example is with precomposed characters:
fn main() {
println!("{}", "é".chars().count()); // 2
println!("{}", "é".chars().count()); // 1
}
You may prefer to use graphemes from the unicode-segmentation crate instead:
use unicode_segmentation::UnicodeSegmentation; // 1.6.0
fn main() {
println!("{}", "é".graphemes(true).count()); // 1
println!("{}", "é".graphemes(true).count()); // 1
}
I'm trying to write the Rust equivalent of the following C++ code:
result += consonants[rand() % consonants.length()];
It is meant to take a random character from the string consonants and append it to the string result.
I seem to have found a working Rust equivalent, but it's... monstrous, to say the least. What would be a more idiomatic equivalent?
format!("{}{}", result, consonants.chars().nth(rand::thread_rng().gen_range(1, consonants.chars().count())).unwrap().to_string());
A few things:
You don't need to use format!() here. There is String::push() which appends a single char.
There is also the rand::sample() function which can randomly choose multiple elements from an iterator. This looks like the perfect fit!
So let's see how this fits together! I created three different versions for different use cases.
1. Unicode string (the general case)
let consonants = "bcdfghjklmnpqrstvwxyz";
let mut result = String::new();
result.push(rand::sample(&mut rand::thread_rng(), consonants.chars(), 1)[0]);
// | |
// sample one element from the iterator --+ |
// |
// get the first element from the returned vector --+
(Playground)
We sample only one element from the iterator and immediately push it to the string. Still not as short as with C's rand(), but please note that rand() is considered harmful for any kind of serious use! Using C++'s <random> header is a lot better, but will require a little bit more code, too. Additionally, your C version can't handle multi-byte characters (e.g. UTF-8 encoding), while the Rust version has full UTF-8 support.
2. ASCII string
However, if you only want to have a string with English consonants, then UTF-8 is not needed and we can make use of O(1) indexing, by using a byte slice:
use rand::{thread_rng, Rng};
let consonants = b"bcdfghjklmnpqrstvwxyz";
let mut result = String::new();
result.push(thread_rng().choose(consonants).cloned().unwrap().into());
// convert Option<&u8> into Option<u8> ^^^^^^
// unwrap, because we know `consonants` is not empty ^^^^^^
// convert `u8` into `char` ^^^^
(Playground)
3. Collection of characters with Unicode support
As mentioned in the comments, you probably just want a collection of characters ("consonants"). This means, we don't have to use a string, but rather an array of chars. So here is one last version which does have UTF-8 support and avoids O(n) indexing:
use rand::{thread_rng, Rng};
// If you need to avoid the heap allocation here, you can create a static
// array like this: let consonants = ['b', 'c', 'd', ...];
let consonants: Vec<_> = "bcdfghjklmnpqrstvwxyz".chars().collect();
let mut result = String::new();
result.push(*thread_rng().choose(&consonants).unwrap());
(Playground)
Rust has FromStr, however as far as I can see this only takes Unicode text input. Is there an equivalent to this for [u8] arrays?
By "parse" I mean take ASCII characters and return an integer, like C's atoi does.
Or do I need to either...
Convert the u8 array to a string first, then call FromStr.
Call out to libc's atoi.
Write an atoi in Rust.
In nearly all cases the first option is reasonable, however there are cases where files maybe be very large, with no predefined encoding... or contain mixed binary and text, where its most straightforward to read integer numbers as bytes.
No, the standard library has no such feature, but it doesn't need one.
As stated in the comments, the raw bytes can be converted to a &str via:
str::from_utf8
str::from_utf8_unchecked
Neither of these perform extra allocation. The first one ensures the bytes are valid UTF-8, the second does not. Everyone should use the checked form until such time as profiling proves that it's a bottleneck, then use the unchecked form once it's proven safe to do so.
If bytes deeper in the data need to be parsed, a slice of the raw bytes can be obtained before conversion:
use std::str;
fn main() {
let raw_data = b"123132";
let the_bytes = &raw_data[1..4];
let the_string = str::from_utf8(the_bytes).expect("not UTF-8");
let the_number: u64 = the_string.parse().expect("not a number");
assert_eq!(the_number, 231);
}
As in other code, these these lines can be extracted into a function or a trait to allow for reuse. However, once that path is followed, it would be a good idea to look into one of the many great crates aimed at parsing. This is especially true if there's a need to parse binary data in addition to textual data.
I do not know of any way in the standard library, but maybe the atoi crate works for you? Full disclosure: I am its author.
use atoi::atoi;
let (number, digits) = atoi::<u32>(b"42 is the answer"); //returns (42,2)
You can check if the second element of the tuple is a zero to see if the slice starts with a digit.
let (number, digits) = atoi::<u32>(b"x"); //returns (0,0)
let (number, digits) = atoi::<u32>(b"0"); //returns (0,1)
I'd like to capitalize the first letter of a &str. It's a simple problem and I hope for a simple solution. Intuition tells me to do something like this:
let mut s = "foobar";
s[0] = s[0].to_uppercase();
But &strs can't be indexed like this. The only way I've been able to do it seems overly convoluted. I convert the &str to an iterator, convert the iterator to a vector, upper case the first item in the vector, which creates an iterator, which I index into, creating an Option, which I unwrap to give me the upper-cased first letter. Then I convert the vector into an iterator, which I convert into a String, which I convert to a &str.
let s1 = "foobar";
let mut v: Vec<char> = s1.chars().collect();
v[0] = v[0].to_uppercase().nth(0).unwrap();
let s2: String = v.into_iter().collect();
let s3 = &s2;
Is there an easier way than this, and if so, what? If not, why is Rust designed this way?
Similar question
Why is it so convoluted?
Let's break it down, line-by-line
let s1 = "foobar";
We've created a literal string that is encoded in UTF-8. UTF-8 allows us to encode the 1,114,112 code points of Unicode in a manner that's pretty compact if you come from a region of the world that types in mostly characters found in ASCII, a standard created in 1963. UTF-8 is a variable length encoding, which means that a single code point might take from 1 to 4 bytes. The shorter encodings are reserved for ASCII, but many Kanji take 3 bytes in UTF-8.
let mut v: Vec<char> = s1.chars().collect();
This creates a vector of characters. A character is a 32-bit number that directly maps to a code point. If we started with ASCII-only text, we've quadrupled our memory requirements. If we had a bunch of characters from the astral plane, then maybe we haven't used that much more.
v[0] = v[0].to_uppercase().nth(0).unwrap();
This grabs the first code point and requests that it be converted to an uppercase variant. Unfortunately for those of us who grew up speaking English, there's not always a simple one-to-one mapping of a "small letter" to a "big letter". Side note: we call them upper- and lower-case because one box of letters was above the other box of letters back in the day.
This code will panic when a code point has no corresponding uppercase variant. I'm not sure if those exist, actually. It could also semantically fail when a code point has an uppercase variant that has multiple characters, such as the German ß. Note that ß may never actually be capitalized in The Real World, this is the just example I can always remember and search for. As of 2017-06-29, in fact, the official rules of German spelling have been updated so that both "ẞ" and "SS" are valid capitalizations!
let s2: String = v.into_iter().collect();
Here we convert the characters back into UTF-8 and require a new allocation to store them in, as the original variable was stored in constant memory so as to not take up memory at run time.
let s3 = &s2;
And now we take a reference to that String.
It's a simple problem
Unfortunately, this is not true. Perhaps we should endeavor to convert the world to Esperanto?
I presume char::to_uppercase already properly handles Unicode.
Yes, I certainly hope so. Unfortunately, Unicode isn't enough in all cases.
Thanks to huon for pointing out the Turkish I, where both the upper (İ) and lower case (i) versions have a dot. That is, there is no one proper capitalization of the letter i; it depends on the locale of the the source text as well.
why the need for all data type conversions?
Because the data types you are working with are important when you are worried about correctness and performance. A char is 32-bits and a string is UTF-8 encoded. They are different things.
indexing could return a multi-byte, Unicode character
There may be some mismatched terminology here. A char is a multi-byte Unicode character.
Slicing a string is possible if you go byte-by-byte, but the standard library will panic if you are not on a character boundary.
One of the reasons that indexing a string to get a character was never implemented is because so many people misuse strings as arrays of ASCII characters. Indexing a string to set a character could never be efficient - you'd have to be able to replace 1-4 bytes with a value that is also 1-4 bytes, causing the rest of the string to bounce around quite a lot.
to_uppercase could return an upper case character
As mentioned above, ß is a single character that, when capitalized, becomes two characters.
Solutions
See also trentcl's answer which only uppercases ASCII characters.
Original
If I had to write the code, it'd look like:
fn some_kind_of_uppercase_first_letter(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().chain(c).collect(),
}
}
fn main() {
println!("{}", some_kind_of_uppercase_first_letter("joe"));
println!("{}", some_kind_of_uppercase_first_letter("jill"));
println!("{}", some_kind_of_uppercase_first_letter("von Hagen"));
println!("{}", some_kind_of_uppercase_first_letter("ß"));
}
But I'd probably search for uppercase or unicode on crates.io and let someone smarter than me handle it.
Improved
Speaking of "someone smarter than me", Veedrac points out that it's probably more efficient to convert the iterator back into a slice after the first capital codepoints are accessed. This allows for a memcpy of the rest of the bytes.
fn some_kind_of_uppercase_first_letter(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
}
}
Is there an easier way than this, and if so, what? If not, why is Rust designed this way?
Well, yes and no. Your code is, as the other answer pointed out, not correct, and will panic if you give it something like བོད་སྐད་ལ་. So doing this with Rust's standard library is even harder than you initially thought.
However, Rust is designed to encourage code reuse and make bringing in libraries easy. So the idiomatic way to capitalize a string is actually quite palatable:
extern crate inflector;
use inflector::Inflector;
let capitalized = "some string".to_title_case();
It's not especially convoluted if you are able to limit your input to ASCII-only strings.
Since Rust 1.23, str has a make_ascii_uppercase method (in older Rust versions, it was available through the AsciiExt trait). This means you can uppercase ASCII-only string slices with relative ease:
fn make_ascii_titlecase(s: &mut str) {
if let Some(r) = s.get_mut(0..1) {
r.make_ascii_uppercase();
}
}
This will turn "taylor" into "Taylor", but it won't turn "édouard" into "Édouard". (playground)
Use with caution.
I did it this way:
fn str_cap(s: &str) -> String {
format!("{}{}", (&s[..1].to_string()).to_uppercase(), &s[1..])
}
If it is not an ASCII string:
fn str_cap(s: &str) -> String {
format!("{}{}", s.chars().next().unwrap().to_uppercase(),
s.chars().skip(1).collect::<String>())
}
The OP's approach taken further:
replace the first character with its uppercase representation
let mut s = "foobar".to_string();
let r = s.remove(0).to_uppercase().to_string() + &s;
or
let r = format!("{}{s}", s.remove(0).to_uppercase());
println!("{r}");
works with Unicode characters as well eg. "😎foobar"
The first guaranteed to be an ASCII character, can changed to a capital letter in place:
let mut s = "foobar".to_string();
if !s.is_empty() {
s[0..1].make_ascii_uppercase(); // Foobar
}
Panics with a non ASCII character in first position!
Since the method to_uppercase() returns a new string, you should be able to just add the remainder of the string like so.
this was tested in rust version 1.57+ but is likely to work in any version that supports slice.
fn uppercase_first_letter(s: &str) -> String {
s[0..1].to_uppercase() + &s[1..]
}
Here's a version that is a bit slower than #Shepmaster's improved version, but also more idiomatic:
fn capitalize_first(s: &str) -> String {
let mut chars = s.chars();
chars
.next()
.map(|first_letter| first_letter.to_uppercase())
.into_iter()
.flatten()
.chain(chars)
.collect()
}
This is how I solved this problem, notice I had to check if self is not ascii before transforming to uppercase.
trait TitleCase {
fn title(&self) -> String;
}
impl TitleCase for &str {
fn title(&self) -> String {
if !self.is_ascii() || self.is_empty() {
return String::from(*self);
}
let (head, tail) = self.split_at(1);
head.to_uppercase() + tail
}
}
pub fn main() {
println!("{}", "bruno".title());
println!("{}", "b".title());
println!("{}", "🦀".title());
println!("{}", "ß".title());
println!("{}", "".title());
println!("{}", "བོད་སྐད་ལ".title());
}
Output
Bruno
B
🦀
ß
བོད་སྐད་ལ
Inspired by get_mut examples I code something like this:
fn make_capital(in_str : &str) -> String {
let mut v = String::from(in_str);
v.get_mut(0..1).map(|s| { s.make_ascii_uppercase(); &*s });
v
}