How to shuffle a str in place - rust

I want to shuffle a String in place in Rust, but I seem to miss something. The fix is probably trivial...
use std::rand::{Rng, thread_rng};
fn main() {
// I want to shuffle this string...
let mut value: String = "SomeValue".to_string();
let mut bytes = value.as_bytes();
let mut slice: &mut [u8] = bytes.as_mut_slice();
thread_rng().shuffle(slice);
println!("{}", value);
}
The error I get is
<anon>:8:36: 8:41 error: cannot borrow immutable dereference of `&`-pointer `*bytes` as mutable
<anon>:8 let mut slice: &mut [u8] = bytes.as_mut_slice();
^~~~~
I read about String::as_mut_vec() but it's unsafe so I'd rather not use it.

There's no very good way to do this, partly due to the nature of the UTF-8 encoding of strings, and partly due to the inherent properties of Unicode and text.
There's at least three layers of things that could be shuffled in a UTF-8 string:
the raw bytes
the encoded codepoints
the graphemes
Shuffling raw bytes is likely to give an invalid UTF-8 string as output unless the string is entirely ASCII. Non-ASCII characters are encoded as special sequences of multiple bytes, and shuffling these will almostly certainly not get them in the right order at the end. Hence shuffling bytes is often not good.
Shuffling codepoints (char in Rust) makes a little bit more sense, but there is still the concept of "special sequences", where so-called combining characters can be layered on to a single letter adding diacritics etc (e.g. letters like ä can be written as a plus U+0308, the codepoint representing the diaeresis). Hence shuffling characters won't give an invalid UTF-8 string, but it may break up these codepoint sequences and give nonsense output.
This brings me to graphemes: the sequences of codepoints that make up a single visible character (like ä is still a single grapheme when written as one or as two codepoints). This will give the most reliably sensible answer.
Then, once you've decided which you want to shuffle the shuffling strategy can be made:
if the string is guaranteed to be purely ASCII, shuffling the bytes with .shuffle is sensible (with the ASCII assumption, this is equivalent to the others)
otherwise, there's no standard way to operate in-place, one would get the elements as an iterator (.chars() for codepoints or .graphemes(true) for graphemes), place them into a vector with .collect::<Vec<_>>(), shuffle the vector, and then collect everything back into a new String with e.g. .iter().map(|x| *x).collect::<String>().
The difficulty of handling codepoints and graphemes is because UTF-8 does not encode them as fixed width, so there's no way to take a random codepoint/grapheme out and insert it somewhere else, or otherwise swap two elements efficiently... Without just decoding everything into an external Vec.
Not being in-place is unfortunate, but strings are hard.
(If your strings are guaranteed to be ASCII, then using a type like the Ascii provided by ascii would be a good way to keep things straight, at the type-level.)
As an example of the difference of the three things, take a look at:
fn main() {
let s = "U͍̤͕̜̲̼̜n̹͉̭͜ͅi̷̪c̠͍̖̻o̸̯̖de̮̻͍̤";
println!("bytes: {}", s.bytes().count());
println!("chars: {}", s.chars().count());
println!("graphemes: {}", s.graphemes(true).count());
}
It prints:
bytes: 57
chars: 32
graphemes: 7
(Generate your own, it demonstrates putting multiple combining character on to a single letter.)

Putting together the suggestion above:
use std::rand::{Rng, thread_rng};
fn str_shuffled(s: &str) -> String {
let mut graphemes = s.graphemes(true).collect::<Vec<&str>>();
let mut gslice = graphemes.as_mut_slice();
let mut rng = thread_rng();
rng.shuffle(gslice);
gslice.iter().map(|x| *x).collect::<String>()
}
fn main() {
println!("{}", str_shuffled("Hello, World!"));
println!("{}", str_shuffled("selam dünya"));
println!("{}", str_shuffled("你好世界"));
println!("{}", str_shuffled("γειά σου κόσμος"));
println!("{}", str_shuffled("Здравствулте мир"));
}

I am also a beginner with Rust, but what about:
fn main() {
// I want to shuffle this string...
let value = "SomeValue".to_string();
let mut bytes = value.into_bytes();
bytes[0] = bytes[1]; // Shuffle takes place.. sorry but std::rand::thread_rng is not available in the Rust installed on my current machine.
match String::from_utf8(bytes) { // Should not copy the contents according to documentation.
Ok(s) => println!("{}", s),
_ => println!("Error occurred!")
}
}
Also keep in mind that Rust default string encoding is UTF-8 when fiddling around with sequences of bytes. ;)
This was a great suggestion, lead me to the following solution, thanks!
use std::rand::{Rng, thread_rng};
fn main() {
// I want to shuffle this string...
let value: String = "SomeValue".to_string();
let mut bytes = value.into_bytes();
thread_rng().shuffle(&mut *bytes.as_mut_slice());
match String::from_utf8(bytes) { // Should not copy the contents according to documentation.
Ok(s) => println!("{}", s),
_ => println!("Error occurred!")
}
}
rustc 0.13.0-nightly (ad9e75938 2015-01-05 00:26:28 +0000)

Related

How to change a String into a Vec and can also modify Vec's value in Rust?

I want to change a String into a vector of bytes and also modify its value, I have looked up and find How do I convert a string into a vector of bytes in rust?
but this can only get a reference and I cannot modify the vector. I want a to be 0, b to be 1 and so on, so after changing it into bytes I also need to subtract 97. Here is my attempt:
fn main() {
let s: String = "abcdefg".to_string();
let mut vec = (s.as_bytes()).clone();
println!("{:?}", vec);
for i in 0..vec.len() {
vec[i] -= 97;
}
println!("{:?}", vec);
}
but the compiler says
error[E0594]: cannot assign to `vec[_]`, which is behind a `&` reference
Can anyone help me to fix this?
You could get a Vec<u8> out of the String with the into_bytes method. An even better way, though, may be to iterate over the String's bytes with the bytes method, do the maths on the fly, and then collect the result:
fn main() {
let s = "abcdefg";
let vec: Vec<u8> = s.bytes().map(|b| b - b'a').collect();
println!("{:?}", vec); // [0, 1, 2, 3, 4, 5, 6]
}
But as #SvenMarnach correctly points out, this won't re-use s's buffer but allocate a new one. So, unless you need s again, the into_bytes method will be more efficient.
Strings in Rust are encoded in UTF-8. The (safe) interface of the String type enforces that the underlying buffer always is valid UTF-8, so it can't allow direct arbitrary byte modifications. However, you can convert a String into a Vec<u8> using the into_bytes() mehod. You can then modify the vector, and potentially convert it back to a string using String::from_utf8() if desired. The last step will verify that the buffer still is vaid UTF-8, and will fail if it isn't.
Instead of modifying the bytes of the string, you could also consider modifying the characters, which are potentially encoded by multiple bytes in the UTF-8 encoding. You can iterate over the characters of the string using the chars() method, convert each character to whatever you want, and then collect into a new string, or alternatively into a vector of integers, depending on your needs.
To understand what's going on, check the type of the vec variable. If you don't have an IDE/editor that can display the type to you, you can do this:
let mut vec: () = (s.as_bytes()).clone();
The resulting error message is explanative:
3 | let mut vec: () = (s.as_bytes()).clone();
| -- ^^^^^^^^^^^^^^^^^^^^^^ expected `()`, found `&[u8]`
So, what's happening is that the .clone() simply cloned the reference returned by as_bytes(), and didn't create a Vec<u8> from the &[u8]. In general, you can use .to_owned() in this kind of case, but in this specific case, using .into_bytes() on the String is best.

How to generate a random String of alphanumeric chars?

The first part of the question is probably pretty common and there are enough code samples that explain how to generate a random string of alphanumerics. The piece of code I use is from here.
use rand::{thread_rng, Rng};
use rand::distributions::Alphanumeric;
fn main() {
let rand_string: String = thread_rng()
.sample_iter(&Alphanumeric)
.take(30)
.collect();
println!("{}", rand_string);
}
This piece of code does however not compile, (note: I'm on nightly):
error[E0277]: a value of type `String` cannot be built from an iterator over elements of type `u8`
--> src/main.rs:8:10
|
8 | .collect();
| ^^^^^^^ value of type `String` cannot be built from `std::iter::Iterator<Item=u8>`
|
= help: the trait `FromIterator<u8>` is not implemented for `String`
Ok, the elements that are generated are of type u8. So I guess this is an array or vector of u8:
use rand::{thread_rng, Rng};
use rand::distributions::Alphanumeric;
fn main() {
let r = thread_rng()
.sample_iter(&Alphanumeric)
.take(30)
.collect::<Vec<_>>();
let s = String::from_utf8_lossy(&r);
println!("{}", s);
}
And this compiles and works!
2dCsTqoNUR1f0EzRV60IiuHlaM4TfK
All good, except that I would like to ask if someone could explain what exactly happens regarding the types and how this can be optimised.
Questions
.sample_iter(&Alphanumeric) produces u8 and not chars?
How can I avoid the second variable s and directly interpret an u8 as a utf-8 character? I guess the representation in memory would not change at all?
The length of these strings should always be 30. How can I optimise the heap allocation of a Vec away? Also they could actually be char[] instead of Strings.
.sample_iter(&Alphanumeric) produces u8 and not chars?
Yes, this was changed in rand v0.8. You can see in the docs for 0.7.3:
impl Distribution<char> for Alphanumeric
But then in the docs for 0.8.0:
impl Distribution<u8> for Alphanumeric
How can I avoid the second variable s and directly interpret an u8 as a utf-8 character? I guess the representation in memory would not change at all?
There are a couple of ways to do this, the most obvious being to just cast every u8 to a char:
let s: String = thread_rng()
.sample_iter(&Alphanumeric)
.take(30)
.map(|x| x as char)
.collect();
Or, using the From<u8> instance of char:
let s: String = thread_rng()
.sample_iter(&Alphanumeric)
.take(30)
.map(char::from)
.collect();
Of course here, since you know every u8 must be valid UTF-8, you can use String::from_utf8_unchecked, which is faster than from_utf8_lossy (although probably around the same speed as the as char method):
let s = unsafe {
String::from_utf8_unchecked(
thread_rng()
.sample_iter(&Alphanumeric)
.take(30)
.collect::<Vec<_>>(),
)
};
If, for some reason, the unsafe bothers you and you want to stay safe, then you can use the slower String::from_utf8 and unwrap the Result so you get a panic instead of UB (even though the code should never panic or UB):
let s = String::from_utf8(
thread_rng()
.sample_iter(&Alphanumeric)
.take(30)
.collect::<Vec<_>>(),
).unwrap();
The length of these strings should always be 30. How can I optimise the heap allocation of a Vec away? Also they could actually be char[] instead of Strings.
First of all, trust me, you don't want arrays of chars. They are not fun to work with. If you want a stack string, have a u8 array then use a function like std::str::from_utf8 or the faster std::str::from_utf8_unchecked (again only usable since you know valid utf8 will be generated.)
As to optimizing the heap allocation away, refer to this answer. Basically, it's not possible with a bit of hackiness/ugliness (such as making your own function that collects an iterator into an array of 30 elements).
Once const generics are finally stabilized, there'll be a much prettier solution.
The first example in the docs for rand::distributions::Alphanumeric shows that if you want to convert the u8s into chars you should map them using the char::from function:
use rand::{thread_rng, Rng};
use rand::distributions::Alphanumeric;
fn main() {
let rand_string: String = thread_rng()
.sample_iter(&Alphanumeric)
.map(char::from) // map added here
.take(30)
.collect();
println!("{}", rand_string);
}
playground

Simpler way to check if string start with a digit in Rust?

What I am currently using is this
fn main() {
let a = "abc123";
let b = "1a2b3c";
println!("{}", a[0..1].chars().all(char::is_numeric));
println!("{}", b[0..1].chars().all(char::is_numeric));
}
Are there a more idiomatic and/or simpler way to do this?
Note: The string is guaranteed to be non empty and made of ASCII characters.
If you are sure that it is non-empty and made out of ascii, you can operate directly on bytes (u8):
a.as_bytes()[0].is_ascii_digit()
or
(b'0'..=b'9').contains(&a.as_bytes()[0])
More general setting (and, in my opinion, more idiomatic):
a.chars().next().unwrap().is_numeric()
The reason all this looks a bit unwieldy is that there may be some things going wrong (that are easily overlooked in other languages):
string might be empty => leads us into Option/unwrap-land
strings in rust are UTF-8 (which basically complicates random-accessing into string; note that rust does not only consider 0-9 as numeric, as shown here)
Starting from your original solution and parse:
fn main() {
let a = "abc123";
let b = "1a2b3c";
println!("{:?}", a[0..1].parse::<u8>().is_ok()); // false
println!("{:?}", b[0..1].parse::<u8>().is_ok()); // true
}
If the first character is guaranteed to be ASCII and the string is not empty.
Playground

Why is capitalizing the first letter of a string so convoluted in Rust?

I'd like to capitalize the first letter of a &str. It's a simple problem and I hope for a simple solution. Intuition tells me to do something like this:
let mut s = "foobar";
s[0] = s[0].to_uppercase();
But &strs can't be indexed like this. The only way I've been able to do it seems overly convoluted. I convert the &str to an iterator, convert the iterator to a vector, upper case the first item in the vector, which creates an iterator, which I index into, creating an Option, which I unwrap to give me the upper-cased first letter. Then I convert the vector into an iterator, which I convert into a String, which I convert to a &str.
let s1 = "foobar";
let mut v: Vec<char> = s1.chars().collect();
v[0] = v[0].to_uppercase().nth(0).unwrap();
let s2: String = v.into_iter().collect();
let s3 = &s2;
Is there an easier way than this, and if so, what? If not, why is Rust designed this way?
Similar question
Why is it so convoluted?
Let's break it down, line-by-line
let s1 = "foobar";
We've created a literal string that is encoded in UTF-8. UTF-8 allows us to encode the 1,114,112 code points of Unicode in a manner that's pretty compact if you come from a region of the world that types in mostly characters found in ASCII, a standard created in 1963. UTF-8 is a variable length encoding, which means that a single code point might take from 1 to 4 bytes. The shorter encodings are reserved for ASCII, but many Kanji take 3 bytes in UTF-8.
let mut v: Vec<char> = s1.chars().collect();
This creates a vector of characters. A character is a 32-bit number that directly maps to a code point. If we started with ASCII-only text, we've quadrupled our memory requirements. If we had a bunch of characters from the astral plane, then maybe we haven't used that much more.
v[0] = v[0].to_uppercase().nth(0).unwrap();
This grabs the first code point and requests that it be converted to an uppercase variant. Unfortunately for those of us who grew up speaking English, there's not always a simple one-to-one mapping of a "small letter" to a "big letter". Side note: we call them upper- and lower-case because one box of letters was above the other box of letters back in the day.
This code will panic when a code point has no corresponding uppercase variant. I'm not sure if those exist, actually. It could also semantically fail when a code point has an uppercase variant that has multiple characters, such as the German ß. Note that ß may never actually be capitalized in The Real World, this is the just example I can always remember and search for. As of 2017-06-29, in fact, the official rules of German spelling have been updated so that both "ẞ" and "SS" are valid capitalizations!
let s2: String = v.into_iter().collect();
Here we convert the characters back into UTF-8 and require a new allocation to store them in, as the original variable was stored in constant memory so as to not take up memory at run time.
let s3 = &s2;
And now we take a reference to that String.
It's a simple problem
Unfortunately, this is not true. Perhaps we should endeavor to convert the world to Esperanto?
I presume char::to_uppercase already properly handles Unicode.
Yes, I certainly hope so. Unfortunately, Unicode isn't enough in all cases.
Thanks to huon for pointing out the Turkish I, where both the upper (İ) and lower case (i) versions have a dot. That is, there is no one proper capitalization of the letter i; it depends on the locale of the the source text as well.
why the need for all data type conversions?
Because the data types you are working with are important when you are worried about correctness and performance. A char is 32-bits and a string is UTF-8 encoded. They are different things.
indexing could return a multi-byte, Unicode character
There may be some mismatched terminology here. A char is a multi-byte Unicode character.
Slicing a string is possible if you go byte-by-byte, but the standard library will panic if you are not on a character boundary.
One of the reasons that indexing a string to get a character was never implemented is because so many people misuse strings as arrays of ASCII characters. Indexing a string to set a character could never be efficient - you'd have to be able to replace 1-4 bytes with a value that is also 1-4 bytes, causing the rest of the string to bounce around quite a lot.
to_uppercase could return an upper case character
As mentioned above, ß is a single character that, when capitalized, becomes two characters.
Solutions
See also trentcl's answer which only uppercases ASCII characters.
Original
If I had to write the code, it'd look like:
fn some_kind_of_uppercase_first_letter(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().chain(c).collect(),
}
}
fn main() {
println!("{}", some_kind_of_uppercase_first_letter("joe"));
println!("{}", some_kind_of_uppercase_first_letter("jill"));
println!("{}", some_kind_of_uppercase_first_letter("von Hagen"));
println!("{}", some_kind_of_uppercase_first_letter("ß"));
}
But I'd probably search for uppercase or unicode on crates.io and let someone smarter than me handle it.
Improved
Speaking of "someone smarter than me", Veedrac points out that it's probably more efficient to convert the iterator back into a slice after the first capital codepoints are accessed. This allows for a memcpy of the rest of the bytes.
fn some_kind_of_uppercase_first_letter(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
}
}
Is there an easier way than this, and if so, what? If not, why is Rust designed this way?
Well, yes and no. Your code is, as the other answer pointed out, not correct, and will panic if you give it something like བོད་སྐད་ལ་. So doing this with Rust's standard library is even harder than you initially thought.
However, Rust is designed to encourage code reuse and make bringing in libraries easy. So the idiomatic way to capitalize a string is actually quite palatable:
extern crate inflector;
use inflector::Inflector;
let capitalized = "some string".to_title_case();
It's not especially convoluted if you are able to limit your input to ASCII-only strings.
Since Rust 1.23, str has a make_ascii_uppercase method (in older Rust versions, it was available through the AsciiExt trait). This means you can uppercase ASCII-only string slices with relative ease:
fn make_ascii_titlecase(s: &mut str) {
if let Some(r) = s.get_mut(0..1) {
r.make_ascii_uppercase();
}
}
This will turn "taylor" into "Taylor", but it won't turn "édouard" into "Édouard". (playground)
Use with caution.
I did it this way:
fn str_cap(s: &str) -> String {
format!("{}{}", (&s[..1].to_string()).to_uppercase(), &s[1..])
}
If it is not an ASCII string:
fn str_cap(s: &str) -> String {
format!("{}{}", s.chars().next().unwrap().to_uppercase(),
s.chars().skip(1).collect::<String>())
}
The OP's approach taken further:
replace the first character with its uppercase representation
let mut s = "foobar".to_string();
let r = s.remove(0).to_uppercase().to_string() + &s;
or
let r = format!("{}{s}", s.remove(0).to_uppercase());
println!("{r}");
works with Unicode characters as well eg. "😎foobar"
The first guaranteed to be an ASCII character, can changed to a capital letter in place:
let mut s = "foobar".to_string();
if !s.is_empty() {
s[0..1].make_ascii_uppercase(); // Foobar
}
Panics with a non ASCII character in first position!
Since the method to_uppercase() returns a new string, you should be able to just add the remainder of the string like so.
this was tested in rust version 1.57+ but is likely to work in any version that supports slice.
fn uppercase_first_letter(s: &str) -> String {
s[0..1].to_uppercase() + &s[1..]
}
Here's a version that is a bit slower than #Shepmaster's improved version, but also more idiomatic:
fn capitalize_first(s: &str) -> String {
let mut chars = s.chars();
chars
.next()
.map(|first_letter| first_letter.to_uppercase())
.into_iter()
.flatten()
.chain(chars)
.collect()
}
This is how I solved this problem, notice I had to check if self is not ascii before transforming to uppercase.
trait TitleCase {
fn title(&self) -> String;
}
impl TitleCase for &str {
fn title(&self) -> String {
if !self.is_ascii() || self.is_empty() {
return String::from(*self);
}
let (head, tail) = self.split_at(1);
head.to_uppercase() + tail
}
}
pub fn main() {
println!("{}", "bruno".title());
println!("{}", "b".title());
println!("{}", "🦀".title());
println!("{}", "ß".title());
println!("{}", "".title());
println!("{}", "བོད་སྐད་ལ".title());
}
Output
Bruno
B
🦀
ß
བོད་སྐད་ལ
Inspired by get_mut examples I code something like this:
fn make_capital(in_str : &str) -> String {
let mut v = String::from(in_str);
v.get_mut(0..1).map(|s| { s.make_ascii_uppercase(); &*s });
v
}

Efficiently extract prefix substrings

Currently I'm using the following function to extract prefix substrings:
fn prefix(s: &String, k: usize) -> String {
s.chars().take(k).collect::<String>()
}
This can then be used for comparisons like so:
let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);
However, this allocates a new String for each call to prefix, in addition to the processing for the iteration. Most other languages I'm familiar with have an efficient way to do a comparison like this, using just a view of the strings. Is there a way in Rust?
Yes, you can take subslices of strings using the Index operation:
fn prefix(s: &str, k: usize) -> &str {
&s[..k]
}
fn main() {
let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);
println!("{}", same);
}
Note that slicing a string uses bytes as the unit, not characters. It is up to the programmer to ensure that the slice lengths lie on valid UTF-8 boundaries. Additionally, you have to ensure that you don't try to slice past the end of the string. Breaking either of these will result in a panic!.
A bit more defensive version would be
fn prefix(s: &str, k: usize) -> &str {
let idx = s.char_indices().nth(k).map(|(idx, _)| idx).unwrap_or(s.len());
&s[0..idx]
}
The key difference is that we use the char_indices iterator, which tells us the byte offsets corresponding to a character. Indexing into a UTF-8 string is an O(n) operation, and Rust doesn't want to hide that algorithmic complexity from you. This still isn't even complete, because there can be combining characters, for example. Dealing with strings is hard, thanks to the complexity of human language.
Most other languages I'm familiar with have an efficient way
Doubtful :-) To be efficient in time, they'd have to know how many bytes to skip ahead for every character. Either they'd have to keep a lookup table for every string or use a fixed-size character encoding. Both of those solutions can use more memory than needed, and a fixed size encoding doesn't even work when you have combining characters, for example.
Of course, other languages could just say "LOL, strings are just arrays of bytes, good luck with treating them correctly", and efficiently ignore your character encoding...
Two additional notes
Your predicate doesn't really make sense. A string of 2 letters will never match one of 3 letters. For strings to match, they must have the same amount of bytes.
You should never need to take &String as a function argument. Taking a &str is a more accepting argument in all cases except for one teeny tiny little case that no one needs — knowing the capacity of a String, but without being able to modify the string.
While Shepmaster's answer is absolutely correct for the general case of string slicing, I'd like to add that sometimes there are easier ways.
If you know in advance the set of characters you're working with ("ATGC" example suggests you're working with nucleobases, so it is possible that these are all the characters you need) then you can use slices of bytes &[u8] instead of string slices &str. You can always get a byte slice out of a string slice and a Vec<u8> out of a String, if necessary:
let s: String = "ATGC".into();
let ss: &str = &s;
let b: Vec<u8> = s.into_bytes();
let bs: &[u8] = ss.as_slice();
Also, there are byte slice and byte character literals, just prefix regular string/char literals with b:
let sl: &[u8] = b"ATGC";
let bl: u8 = b'G';
Working with byte slices give you constant-time indexing (and thus slicing) operations, so checking for prefix equality is easy (like Shepmaster's first variant but without possibility of panics (unless k is too large):
fn prefix(s: &[u8], k: usize) -> &[u8] {
&s[..k]
}
If you need, you can turn byte slices/vectors back to strings. This operation, of course, checks validity of UTF-8 encoding so it may fail, but if you only work with ASCII, you can safely ignore these errors and just unwrap():
let ss2: &str = str::from_utf8(bs).unwrap();
let s2: String = String::from_utf8(b).unwrap();

Resources