Problem splitting regular ASCII symbols in a string

Problem splitting regular ASCII symbols in a string - string

Just had this error pop up while messing around with some graphics for a terminal interface...
thread 'main' panicked at 'byte index 2 is not a char boundary; it is inside '░' (bytes 1..4) of ░▒▓█', src/main.rs:38:6
Can I not use these characters, or do I need to work some magic to support what I thought were default ASCII characters?
(Here's the related code for those wondering.)
// Example call with the same parameters that led to this issue.
charlist(" ░▒▓█".to_string(), 0.66);
// Returns the n-th character in a string.
// (Where N is a float value from 0 to 1,
// 0 being the start of the string and 1 the end.)
fn charlist<'a>(chars: &'a String, amount: f64) -> &'a str {
let chl: f64 = chars.chars().count() as f64; // Length of the string
let chpos = -((amount*chl)%chl) as i32; // Scalar converted to integer position in the string
&chars[chpos as usize..chpos as usize+1] // Slice the single requested character
}

There are couple misconceptions you seem to have. So let me address them in order.
░, ▒, ▓ and █ are not ASCII characters! They are unicode code points. You can determine this with following simple experiment.
fn main() {
let slice = " ░▒▓█";
for c in slice.chars() {
println!("{}, {}", c, c.len_utf8());
}
}
This code has output:
, 1
░, 3
▒, 3
▓, 3
█, 3
As you can see this "boxy" characters have a length of 3 bytes each! Rust uses utf-8 encoding for all of it's strings. This leads to another misconception.
I this line &chars[chpos as usize..chpos as usize+1] you are trying to get a slice of one byte in length. String slices in rust are indexed with bytes. But you tried to slice in the middle of a character (it has length of 3 bytes). In general characters in utf-8 encoding can be from one to four bytes in length. To get char's length in bytes you can use method len_utf8.
You can get an iterator of characters in a string slice using method chars. Then getting n-th character is as easy as using iterators method nth So the following is true:
assert_eq!(" ░▒▓█".chars().nth(3).unwrap(), '▒');
If you want to have also indices of this chars you can use method char_indices.
Using f64 values to represent nth character is odd and I would encourage you rethink if you really want to do this. But if you do you have two options.
You must remember that since characters have a variable length, string's slice method len doesn't return number of characters, but slice's length in bytes. To know how many characters are in the string you have no other option than iterating over it. So if you for example want to have a middle character you must first know how many there are. I can think of two ways you can do this.
You can either collect characters for Vec<char> (or something similar). Then you will know how many characters are there and can in O(1) index nth one. However this will result in additional memory allocation.
You can fist count how many characters there are with slice.chars().len(). Then calculate position of the nth one and get it by again iterating over chars and getting the nth one (as I showed above). This won't result in any additional memory allocation, but it will have complexity of O(2n), since you will have to iterate over whole string twice.
Which one you pick is up to you. You will have to make a compromise.
This isn't really a correctness problem, but prefer using &str over &String in the arguments of functions (as it will provide more flexibility to your callers). And you don't have to specify lifetime if you have only one reference in the arguments and the other one is in the returned type. Rust will infer that they have to have the same lifetime.

Related

How do I change a character in a string? [duplicate]

This isn't the exact use case, but it is basically what I am trying to do:
let mut username = "John_Smith";
println!("original username: {}",username);
username.set_char_at(4,'.'); // <------------- The part I don't know how to do
println!("new username: {}",username);
I can't figure out how to do this in constant time and using no additional space. I know I could use "replace" but replace is O(n). I could make a vector of the characters but that would require additional space.
I think you could create another variable that is a pointer using something like as_mut_slice, but this is deemed unsafe. Is there a safe way to replace a character in a string in constant time and space?

As of Rust 1.27 you can now use String::replace_range:
let mut username = String::from("John_Smith");
println!("original username: {}", username); // John_Smith
username.replace_range(4..5, ".");
println!("new username: {}", username); // John.Smith
(playground)
replace_range won't work with &mut str. If the size of the range and the size of the replacement string aren't the same, it has to be able to resize the underlying String, so &mut String is required. But in the case you ask about (replacing a single-byte character with another single-byte character) its memory usage and time complexity are both O(1).
There is a similar method on Vec, Vec::splice. The primary difference between them is that splice returns an iterator that yields the removed items.

In general ? For any pair of characters ? It's impossible.
A string is not an array. It may be implemented as an array, in some limited contexts.
Rust supports Unicode, which brings some challenges:
a Unicode code point might is an integral between 0 and 224
a grapheme may be composed of multiple Unicode code points
In order to represent this, a Rust string is (for now) a UTF-8 bytes sequence:
a single Unicode code point might be represented by 1 to 4 bytes
a grapheme might be represented by 1 or more bytes (no upper limit)
and therefore, the very notion of "replacing character i" brings a few challenges:
the position of character i is between the index i and the end of the string, it requires reading the string from the beginning to know exactly where though, which is O(N)
switching the i-th character in-place for another requires that both characters take up exactly the same amount of bytes
In general ? It's impossible.
In a particular and very specific case where the byte index is known and the byte encoding is known coincide length-wise, it is doable by directly modifying the byte sequence return by as_mut_bytes which is duly marked unsafe since you may inadvertently corrupt the string (remember, this bytes sequence must be a UTF-8 sequence).

If you want to handle only ASCII there is separate type for that:
use std::ascii::{AsciiCast, OwnedAsciiCast};
fn main() {
let mut ascii = "ascii string".to_string().into_ascii();
*ascii.get_mut(6) = 'S'.to_ascii();
println!("result = {}", ascii);
}
There are some missing pieces (like into_ascii for &str) but it does what you want.
Current implementaion of to_/into_ascii fails if input string is invalid ascii. There is to_ascii_opt (old naming of methods that might fail) but will probably be renamed to to_ascii in the future (and failing method removed or renamed).

Why is in this rust tutorial the string beeing converted to bytes?

I am reading the rust tutorial and in this section the tutorial converts a string into a byte array
like so:
fn first_word(s: &String) -> usize {
let bytes = s.as_bytes();
for (i, &item) in bytes.iter().enumerate() {
if item == b' ' {
return i;
}
}
s.len()
}
They state that this conversion is because we want to find the first instance of the space character so we need to compare to it. My question is why do we need to convert to bytes? What if instead of converting the string to bytes we convert the byte ' ' into a String and compare to that?

Strings in Rust are UTF-8 encoded. You can iterate over chars but that will be a bit slower because Unicode code points are variable length, and the char type is 4 bytes long so you can't fit as many in a cache line.
A space has the same byte representation regardless of whether you are using ASCII or UTF-8 encoding, so this is an easy optimisation. It's also the same amount of code as iterating over chars.
But, probably more importantly, the function in question is returning an index for where the character is found. Finding the index by iterating over chars would tell you how many unicode code points to skip to get to the that position, but you'd have to iterate again each time you wanted to use the index because each preceding codepoint could be anywhere from 1 to 4 bytes long. An index into bytes is much more straightforward and efficient.
For example, with a byte index:
let words = String::from("Hello there");
let index = first_word(&words); // byte index
// just a slice
let first_word = str::from_utf8(&words.as_bytes()[0..index]).unwrap();
Indexing Unicode code points:
let words = String::from("Hello there");
let index = first_word(&words); // code point index
// having to iterate again, and allocate a new String
let first_word: String = words.chars().take(index).collect();
Any method to take a slice here would involve calculating the byte position first.

How capacity of []rune is determined when converting from a string

Can someone explain why I got different capacity when converting the same string in []rune?
Take a look at this code
package main
import (
"fmt"
)
func main() {
input := "你好"
runes := []rune(input)
fmt.Printf("len %d\n", len(input))
fmt.Printf("len %d\n", len(runes))
fmt.Printf("cap %d\n", cap(runes))
fmt.Println(runes[:3])
}
Which return
len 6
len 2
cap 2
panic: runtime error: slice bounds out of range [:3] with capacity 2
But when commenting the fmt.Println(runes[:3]) it return :
len 6
len 2
cap 32
See how the []rune capacity has changed in the main from 2 to 32. How ? Why ?
If you want to test => Go playground

The capacity may change to whatever as long as the result slice of the conversion contains the runes of the input string. This is the only thing the spec requires and guarantees. The compiler may make decisions to use lower capacity if you pass it to fmt.Println() as this signals that the slice may escape. Again, the decision made by the compiler is out of your hands.
Escape means the value may escape from the function, and as such, it must be allocated on the heap (and not on the stack), because the stack may get destroyed / overwritten once the function returns, and if the value "escapes" from the function, its memory area must be retained as long as there is a reference to the value. The Go compiler performs escape analysis, and if it can't prove a value does not escape the function it's declared in, the value will be allocated on the heap.
See related question: Calculating sha256 gives different results after appending slices depending on if I print out the slice before or not

The reason the string and []rune return different results from len is that it's counting different things; len(string) returns the length in bytes (which may be more than the number of characters, for multi-byte characters), while len([]rune) returns the length of the rune slice, which in turn is the number of UTF-8 runes (generally the number of characters).
This blog post goes into detail how exactly Go treats text in various forms: https://blog.golang.org/strings

Slice a string containing Unicode chars

I have a piece of text with characters of different bytelength.
let text = "Hello привет";
I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this
let slice = &text[start..end];
and got the following error
thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'п' (bytes 6..8) of `Hello привет`'
I suppose it happens since Cyrillic letters are multi-byte and the [..] notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:
slice = text[start:end] ?
I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?

Possible solutions to codepoint slicing
I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?
If you know the exact byte indices, you can slice a string:
let text = "Hello привет";
println!("{}", &text[2..10]);
This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):
let text = "Hello привет";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("{}", &text[2..end]);
As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.
let text = "Hello привет";
let text_vec = text.chars().collect::<Vec<_>>();
println!("{}", text_vec[2..8].iter().cloned().collect::<String>());
Why is this not easier?
As you can see, neither of these solutions is all that great. This is intentional, for two reasons:
As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the [] operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).
But the more important reason:
Unicode codepoints are generally not a useful unit
What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).
But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:
>>> s = "Jürgen"
>>> s[0:2]
'Ju'
Surprising, right? This is because the string above is:
0x004A LATIN CAPITAL LETTER J
0x0075 LATIN SMALL LETTER U
0x0308 COMBINING DIAERESIS
...
This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.
Another example:
>>> s = "ﬁre"
>>> s[0:2]
'ﬁr'
Also not what you'd expect. This time, fi is actually the ligature ﬁ, which is one codepoint.
There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.
So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.
Further resources on this topic:
Blogpost "Let's stop ascribing meaning to unicode codepoints"
Blogpost "Breaking our Latin-1 assumptions
http://utf8everywhere.org/

A UTF-8 encoded string may contain characters which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.
You may use str::char_indices() for solving this (remember, that getting to a position in a UTF-8 string is O(n)):
fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> {
assert!(end >= start);
string.char_indices().nth(start).and_then(|(start_pos, _)| {
string[start_pos..]
.char_indices()
.nth(end - start - 1)
.map(|(end_pos, _)| &string[start_pos..end_pos])
})
}
playground
You may use str::chars() if you are fine with getting a String:
let string: String = text.chars().take(end).skip(start).collect();

Here is a function which retrieves a utf8 slice, with the following pros:
handle all edge cases (empty input, 0-width output ranges, out-of-scope ranges);
never panics;
use start-inclusive, end-exclusive ranges.
pub fn utf8_slice(s: &str, start: usize, end: usize) -> Option<&str> {
let mut iter = s.char_indices()
.map(|(pos, _)| pos)
.chain(Some(s.len()))
.skip(start)
.peekable();
let start_pos = *iter.peek()?;
for _ in start..end { iter.next(); }
Some(&s[start_pos..*iter.peek()?])
}

[] operator for strings, link with slices for vectors

Why do you have to walk over the string to find the nᵗʰ letter of a string when you do s[n] where s is a string. (According to https://doc.rust-lang.org/book/strings.html)
From what I understood, a string is an array of chars and a char is an array of 4 bytes or a number of 4 bytes. So is getting the nth letter would be similar as doing this : v[4*n..4*n+4] where v is a vector ?
What is the cost of v[i..j] ?
I would assume that the cost of v[i..j] is j-i and so that the cost of s[n] should be 4.

Note: The second edition of The Rust Programming Language has an improved and smooth explanation to Strings in Rust, which you might wish to read as well. The answer below, although still accurate, quotes from the first edition of the book.
I will try to clarify these misconceptions about strings in Rust by quoting from the book (https://doc.rust-lang.org/book/strings.html).
A ‘string’ is a sequence of Unicode scalar values encoded as a stream of UTF-8 bytes. All strings are guaranteed to be a valid encoding of UTF-8 sequences.
With this in mind, plus that UTF-8 code points are variably sized (1 to 4 bytes depending on the character), all strings in Rust, whether they are &str or String, are not arrays of characters, and can not be treated like such. It is further explained why on Slicing:
Because strings are valid UTF-8, they do not support indexing:
let s = "hello";
println!("The first letter of s is {}", s[0]); // ERROR!!!
Usually, access to a vector with [] is very fast. But, because each character in a UTF-8 encoded string can be multiple bytes, you have to walk over the string to find the nᵗʰ letter of a string. This is a significantly more expensive operation, and we don’t want to be misleading.
Unlike what was mentioned in the question, one cannot do s[n], because although in theory this would allows us to fetch the nth byte in constant time, that byte is not guaranteed to make any sense on its own.
What is the cost of v[i..j] ?
The cost of slicing is actually constant, because it is done at byte-level:
You can get a slice of a string with slicing syntax:
let dog = "hachiko";
let hachi = &dog[0..5];
But note that these are byte offsets, not character offsets. So this will fail at runtime:
let dog = "忠犬ハチ公";
let hachi = &dog[0..2];
with this error:
thread '' panicked at 'index 0 and/or 2 in 忠犬ハチ公 do not lie on
character boundary'
Basically, slicing is acceptable and will yield a new view of that string, so no copies are made. However, it should only be used when you are completely sure that the offsets are right in terms of character boundaries.
In order to iterate over each character of a string, you may instead call chars():
let c = s.chars().nth(n);
Even with that in mind, note that handling Unicode character might not be exactly what you want if you wish to handle character modifiers in UTF-8 (which are scalar values by themselves but should not be treated individually either). Quoting now from the str API:
fn chars(&self) -> Chars
Returns an iterator over the chars of a string slice.
As a string slice consists of valid UTF-8, we can iterate through a string slice by char. This method returns such an iterator.
It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.
Remember, chars may not match your human intuition about characters:
let y = "y̆";
let mut chars = y.chars();
assert_eq!(Some('y'), chars.next()); // not 'y̆'
assert_eq!(Some('\u{0306}'), chars.next());
assert_eq!(None, chars.next());
The unicode_segmentation crate provides a means to define grapheme cluster boundaries:
extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;
let s = "a̐éö̲\r\n";
let g = UnicodeSegmentation::graphemes(s, true).collect::<Vec<&str>>();
let b: &[_] = &["a̐", "é", "ö̲", "\r\n"];
assert_eq!(g, b);

If you do want to treat the string as an array of codepoints (which isn't strictly the same as characters; there are combining marks, emoji with separate skin-tone modifiers, etc.), you can collect it into a Vec:
fn main() {
let s = "£10 🙃!";
for (i,c) in s.char_indices() {
println!("{} {}", i, c);
}
let v: Vec<char> = s.chars().collect();
println!("v[5] = {}", v[5]);
}
Play link
With bonus demonstration of some varying character widths, this outputs:
0 £
2 1
3 0
4
5 🙃
9 !
v[5] = !

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Problem splitting regular ASCII symbols in a string - string

Related

How do I change a character in a string? [duplicate]

Why is in this rust tutorial the string beeing converted to bytes?

How capacity of []rune is determined when converting from a string

Slice a string containing Unicode chars

[] operator for strings, link with slices for vectors

Categories

Resources