Efficient truncating string copy `str` to `[u8]` (utf8 aware strlcpy)? - string

While Rust provides str.as_bytes, I'm looking to copy a string into a fixed sized buffer, where only full unicode-scalar-values are copied into the buffer, and are instead truncated with a null terminator written at the end, in C terms, I'd call this a utf8 aware strlcpy (that is - it copies into a fixed size buffer and ensures its null terminated).
This is a function I came up with, but I expect there are better ways to do this in Rust:
// return the number of bytes written to
pub fn strlcpy_utf8(utf8_dst: &mut [u8], str_src: &str) -> usize {
let utf8_dst_len = utf8_dst.len();
if utf8_dst_len == 0 {
return 0;
}
let mut index: usize = 0;
if utf8_dst_len > 1 {
let mut utf8_buf: [u8; 4] = [0; 4];
for c in str_src.chars() {
let len_utf8 = c.len_utf8();
let index_next = index + len_utf8;
c.encode_utf8(&mut utf8_buf);
if index_next >= utf8_dst_len {
break;
}
utf8_dst[index..index_next].clone_from_slice(&utf8_buf[0..len_utf8]);
index = index_next;
}
}
utf8_dst[index] = 0;
return index + 1;
}
Note): I realize this isn't ideal since multiple UCS may make up a single glyph, however the result will at least be able to decoded back into a str.

Rust's str has a handy method char_indices for when you need to know the actual character boundaries. This would immediately simplify your function somewhat:
pub fn strlcpy_utf8(utf8_dst: &mut [u8], str_src: &str) -> usize {
let utf8_dst_len = utf8_dst.len();
if utf8_dst_len == 0 {
return 0;
}
let mut last_index = 0;
for (idx, _) in str_src.char_indices() {
if (idx+1) > utf8_dst_len {
break;
}
last_index = idx;
}
utf8_dst[0..last_index].copy_from_slice(&str_src.as_bytes()[0..last_index]);
utf8_dst[last_index] = 0;
return last_index + 1;
}
Playground
However you don't actually need to iterate through every character except when copying, as it turns out it's easy to find a boundary in UTF8; Rust has str::is_char_boundary(). This lets you instead look backwards from the end:
pub fn strlcpy_utf8(utf8_dst: &mut [u8], str_src: &str) -> usize {
let utf8_dst_len = utf8_dst.len();
if utf8_dst_len == 0 {
return 0;
}
let mut last_index = min(utf8_dst_len-1, str_src.len());
while last_index > 0 && !str_src.is_char_boundary(last_index) {
last_index -= 1;
}
utf8_dst[0..last_index].copy_from_slice(&str_src.as_bytes()[0..last_index]);
utf8_dst[last_index] = 0;
return last_index + 1;
}
Playground

Based on Chris Emerson's answer and #Matthieu-m's suggestion to remove a redundant check.
// returns the number of bytes written to
pub fn strlcpy_utf8(utf8_dst: &mut [u8], str_src: &str) -> usize {
let utf8_dst_len = utf8_dst.len();
if utf8_dst_len == 0 {
return 0;
}
// truncate if 'str_src' is too long
let mut last_index = str_src.len();
if last_index >= utf8_dst_len {
last_index = utf8_dst_len - 1;
// no need to check last_index > 0 here,
// is_char_boundary covers that case
while !str_src.is_char_boundary(last_index) {
last_index -= 1;
}
}
utf8_dst[0..last_index].clone_from_slice(&str_src.as_bytes()[0..last_index]);
utf8_dst[last_index] = 0;
return last_index + 1;
}
#ChrisEmerson: I'm posting this since it's the code I'm going to use for my project, feel free to update your answer with the changes if you like and I'll remove this answer.

Related

How to properly initialize a struct in Rust, with good enough encapsulations?

How to properly initialize a struct in Rust, with good enough encapsulations?
Or more naively:
how to leverage object/instance methods in the initialization/constructing process of structs?
For example, as the initialization block in Kotlin:
private class BinaryIndexedTree(nums: IntArray) {
private val nNums = nums.size
private val fenwick = IntArray(nNums + 1) { 0 }
// where to put this block in Rust?
init {
for (idx in nums.indices) {
update(idx, nums[idx])
}
}
fun update(index: Int, value: Int) {
var idx = index + 1
while (idx <= nNums) {
fenwick[idx] += value
idx += (idx and -idx)
}
}
fun query(index: Int): Int {
var sum = 0
var idx = index + 1
while (idx > 0) {
sum += fenwick[idx]
idx -= (idx and -idx)
}
return sum
}
}
According to Rust Design Patterns, there is no regular constructors as other languages, the convention is to use an associated function.
Correspondingly, in Rust:
struct BinaryIndexedTree{
len_ns: isize,
fenwick: Vec<i32>,
}
impl BinaryIndexedTree{
pub fn new(nums: &Vec<i32>) -> Self{
let len_ns: usize = nums.len();
let fenwick: Vec<i32> = vec![0; len_ns + 1];
for (idx, num) in nums.iter().enumerate(){
// how to leverage `update()` for initialization
// update(idx as isize, num);
// or even earlier: where/how to put the initialization logic?
}
Self{
len_ns: len_ns as isize,
fenwick,
}
}
pub fn update(&mut self, index: isize, value: i32){
let mut idx = index + 1;
while idx <= self.len_ns{
self.fenwick[idx as usize] += value;
idx += (idx & -idx);
}
}
pub fn query(&self, index: isize) -> i32{
let mut sum: i32 = 0;
let mut idx = index + 1;
while idx > 0{
sum += self.fenwick[idx as usize];
idx -= (idx & -idx);
}
sum
}
}
Is there any way to properly leverage the update method?
As a rule of thumbs, how to properly handle the initialization work after the creation of (all the fields of) the struct?
The builder pattern is a way to go, which introduces much more code just for initialization.
Yes, you can construct the struct then call a function on it before returning it. There is nothing special about the new function name or how the struct is constructed at the end of the function.
pub fn new(nums: &Vec<i32>) -> Self {
let len_ns: usize = nums.len();
let fenwick: Vec<i32> = vec![0; len_ns + 1];
// Construct an incomplete version of the struct.
let mut new_self = Self {
len_ns: len_ns as isize,
fenwick,
};
// Do stuff with the struct
for (idx, num) in nums.iter().enumerate(){
new_self.update(idx as isize, num);
}
// Return it
new_self
}

Is there a way to update a string in place in rust?

You can also consider this as, is it possible to URLify a string in place in rust?
For example,
Problem statement: Replace whitespace with %20
Assumption: String will have enough capacity left to accommodate new characters.
Input: Hello how are you
Output: Hello%20how%20are%20you
I know there are ways to do this if we don't have to do this "in place". I am solving a problem that explicitly states that you have to update in place.
If there isn't any safe way to do this, is there any particular reason behind that?
[Edit]
I was able to solve this using unsafe approach, but would appreciate a better approach than this. More idiomatic approach if there is.
fn space_20(sentence: &mut String) {
if !sentence.is_ascii() {
panic!("Invalid string");
}
let chars: Vec<usize> = sentence.char_indices().filter(|(_, ch)| ch.is_whitespace()).map(|(idx, _)| idx ).collect();
let char_count = chars.len();
if char_count == 0 {
return;
}
let sentence_len = sentence.len();
sentence.push_str(&"*".repeat(char_count*2)); // filling string with * so that bytes array becomes of required size.
unsafe {
let bytes = sentence.as_bytes_mut();
let mut final_idx = sentence_len + (char_count * 2) - 1;
let mut i = sentence_len - 1;
let mut char_ptr = char_count - 1;
loop {
if i != chars[char_ptr] {
bytes[final_idx] = bytes[i];
if final_idx == 0 {
// all elements are filled.
println!("all elements are filled.");
break;
}
final_idx -= 1;
} else {
bytes[final_idx] = '0' as u8;
bytes[final_idx - 1] = '2' as u8;
bytes[final_idx - 2] = '%' as u8;
// final_idx is of type usize cannot be less than 0.
if final_idx < 3 {
println!("all elements are filled at start.");
break;
}
final_idx -= 3;
// char_ptr is of type usize cannot be less than 0.
if char_ptr > 0 {
char_ptr -= 1;
}
}
if i == 0 {
// all elements are parsed.
println!("all elements are parsed.");
break;
}
i -= 1;
}
}
}
fn main() {
let mut sentence = String::with_capacity(1000);
sentence.push_str(" hello, how are you?");
// sentence.push_str("hello, how are you?");
// sentence.push_str(" hello, how are you? ");
// sentence.push_str(" ");
// sentence.push_str("abcd");
space_20(&mut sentence);
println!("{}", sentence);
}
An O(n) solution that neither uses unsafe nor allocates (provided that the string has enough capacity), using std::mem::take:
fn urlify_spaces(text: &mut String) {
const SPACE_REPLACEMENT: &[u8] = b"%20";
// operating on bytes for simplicity
let mut buffer = std::mem::take(text).into_bytes();
let old_len = buffer.len();
let space_count = buffer.iter().filter(|&&byte| byte == b' ').count();
let new_len = buffer.len() + (SPACE_REPLACEMENT.len() - 1) * space_count;
buffer.resize(new_len, b'\0');
let mut write_pos = new_len;
for read_pos in (0..old_len).rev() {
let byte = buffer[read_pos];
if byte == b' ' {
write_pos -= SPACE_REPLACEMENT.len();
buffer[write_pos..write_pos + SPACE_REPLACEMENT.len()]
.copy_from_slice(SPACE_REPLACEMENT);
} else {
write_pos -= 1;
buffer[write_pos] = byte;
}
}
*text = String::from_utf8(buffer).expect("invalid UTF-8 during URL-ification");
}
(playground)
Basically, it calculates the final length of the string, sets up a reading pointer and a writing pointer, and translates the string from right to left. Since "%20" has more characters than " ", the writing pointer never catches up with the reading pointer.
Is it possible to do this without unsafe?
Yes like this:
fn main() {
let mut my_string = String::from("Hello how are you");
let mut insert_positions = Vec::new();
let mut char_counter = 0;
for c in my_string.chars() {
if c == ' ' {
insert_positions.push(char_counter);
char_counter += 2; // Because we will insert two extra chars here later.
}
char_counter += 1;
}
for p in insert_positions.iter() {
my_string.remove(*p);
my_string.insert(*p, '0');
my_string.insert(*p, '2');
my_string.insert(*p, '%');
}
println!("{}", my_string);
}
Here is the Playground.
But should you do it?
As discussed for example here on Reddit this is almost always not the recommended way of doing this, because both remove and insert are O(n) operations as noted in the documentation.
Edit
A slightly better version:
fn main() {
let mut my_string = String::from("Hello how are you");
let mut insert_positions = Vec::new();
let mut char_counter = 0;
for c in my_string.chars() {
if c == ' ' {
insert_positions.push(char_counter);
char_counter += 2; // Because we will insert two extra chars here later.
}
char_counter += 1;
}
for p in insert_positions.iter() {
my_string.remove(*p);
my_string.insert_str(*p, "%20");
}
println!("{}", my_string);
}
and the corresponding Playground.

Insertion sort algorithm gives overflow error

When trying to run the insertion sort algorithm as shown below in Rust 1.15.
fn main() {
println!("The sorted set is now: {:?}", insertion_sort(vec![5,2,4,6,1,3]));
}
fn insertion_sort(set: Vec<i32>) -> Vec<i32> {
let mut A = set.to_vec();
for j in 1..set.len() {
let key = A[j];
let mut i = j - 1;
while (i >= 0) && (A[i] > key) {
A[i + 1] = A[i];
i = i - 1;
}
A[i + 1] = key;
}
A
}
I get the error:
thread 'main' panicked at 'attempt to subtract with overflow', insertion_sort.rs:12
note: Run with `RUST_BACKTRACE=1` for a backtrace
Why does an overflow happen here and how is the problem alleviated?
The reason is you tried to calculate 0 - 1 in usize type, which is unsigned (nonnegative). This may lead to an error in Rust.
Why usize? Because Rust expects usize for lengths and indices. You can explicitly convert them into/from signed ones e.g. isize.
fn main() {
println!("The sorted set is now: {:?}", insertion_sort(vec![5,2,4,6,1,3]));
}
fn insertion_sort(set: Vec<i32>) -> Vec<i32> {
let mut A = set.to_vec();
for j in 1..set.len() as isize {
let key = A[j as usize];
let mut i = j - 1;
while (i >= 0) && (A[i as usize] > key) {
A[(i + 1) as usize] = A[i as usize];
i = i - 1;
}
A[(i + 1) as usize] = key;
}
A
}
Another solution, which I recommend, is to avoid negative indices at all. In this case you can use i + 1 instead of i like this:
fn main() {
println!("The sorted set is now: {:?}", insertion_sort(vec![5,2,4,6,1,3]));
}
fn insertion_sort(set: Vec<i32>) -> Vec<i32> {
let mut A = set.to_vec();
for j in 1..set.len() {
let key = A[j];
let mut i = j;
while (i > 0) && (A[i - 1] > key) {
A[i] = A[i - 1];
i = i - 1;
}
A[i] = key;
}
A
}

How to create a very large array? [duplicate]

I'm implementing combsort. I'd like to create fixed-size array on the stack, but it shows stack overflow. When I change it to be on the heap (Rust by Example says to allocate in the heap we must use Box), it still shows stack overflow.
fn new_gap(gap: usize) -> usize {
let ngap = ((gap as f64) / 1.3) as usize;
if ngap == 9 || ngap == 10 {
return 11;
}
if ngap < 1 {
return 1;
}
return ngap;
}
fn comb_sort(a: &mut Box<[f64]>) {
// previously: [f64]
let xlen = a.len();
let mut gap = xlen;
let mut swapped: bool;
let mut temp: f64;
loop {
swapped = false;
gap = new_gap(gap);
for i in 0..(xlen - gap) {
if a[i] > a[i + gap] {
swapped = true;
temp = a[i];
a[i] = a[i + gap];
a[i + gap] = temp;
}
}
if !(gap > 1 || swapped) {
break;
}
}
}
const N: usize = 10000000;
fn main() {
let mut arr: Box<[f64]> = Box::new([0.0; N]); // previously: [f64; N] = [0.0; N];
for z in 0..(N) {
arr[z] = (N - z) as f64;
}
comb_sort(&mut arr);
for z in 1..(N) {
if arr[z] < arr[z - 1] {
print!("!")
}
}
}
The output:
thread '<main>' has overflowed its stack
Illegal instruction (core dumped)
Or
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
I know that my stack size is not enough, the same as C++ when creating a non-heap array that is too big inside a function, but this code is using heap but still shows stack overflow. What's really wrong with this code?
As far as I can tell, it seems like that code is still trying to allocate the array on the stack first, and then move it into the box after.
It works for me if I switch to Vec<f64> in place of Box<[f64]> like this:
fn new_gap(gap: usize) -> usize {
let ngap = ((gap as f64) / 1.3) as usize;
if ngap == 9 || ngap == 10 {
return 11;
}
if ngap < 1 {
return 1;
}
return ngap;
}
fn comb_sort(a: &mut [f64]) {
// previously: [f64]
let xlen = a.len();
let mut gap = xlen;
let mut swapped: bool;
let mut temp: f64;
loop {
swapped = false;
gap = new_gap(gap);
for i in 0..(xlen - gap) {
if a[i] > a[i + gap] {
swapped = true;
temp = a[i];
a[i] = a[i + gap];
a[i + gap] = temp;
}
}
if !(gap > 1 || swapped) {
break;
}
}
}
const N: usize = 10000000;
fn main() {
let mut arr: Vec<f64> = std::iter::repeat(0.0).take(N).collect();
//let mut arr: Box<[f64]> = Box::new([0.0; N]); // previously: [f64; N] = [0.0; N];
for z in 0..(N) {
arr[z] = (N - z) as f64;
}
comb_sort(arr.as_mut_slice());
for z in 1..(N) {
if arr[z] < arr[z - 1] {
print!("!")
}
}
}
In the future, the box syntax will be stabilized. When it is, it will support this large allocation, as no function call to Box::new will be needed, thus the array will never be placed on the stack. For example:
#![feature(box_syntax)]
fn main() {
let v = box [0i32; 5_000_000];
println!("{}", v[1_000_000])
}

Thread '<main>' has overflowed its stack when allocating a large array using Box

I'm implementing combsort. I'd like to create fixed-size array on the stack, but it shows stack overflow. When I change it to be on the heap (Rust by Example says to allocate in the heap we must use Box), it still shows stack overflow.
fn new_gap(gap: usize) -> usize {
let ngap = ((gap as f64) / 1.3) as usize;
if ngap == 9 || ngap == 10 {
return 11;
}
if ngap < 1 {
return 1;
}
return ngap;
}
fn comb_sort(a: &mut Box<[f64]>) {
// previously: [f64]
let xlen = a.len();
let mut gap = xlen;
let mut swapped: bool;
let mut temp: f64;
loop {
swapped = false;
gap = new_gap(gap);
for i in 0..(xlen - gap) {
if a[i] > a[i + gap] {
swapped = true;
temp = a[i];
a[i] = a[i + gap];
a[i + gap] = temp;
}
}
if !(gap > 1 || swapped) {
break;
}
}
}
const N: usize = 10000000;
fn main() {
let mut arr: Box<[f64]> = Box::new([0.0; N]); // previously: [f64; N] = [0.0; N];
for z in 0..(N) {
arr[z] = (N - z) as f64;
}
comb_sort(&mut arr);
for z in 1..(N) {
if arr[z] < arr[z - 1] {
print!("!")
}
}
}
The output:
thread '<main>' has overflowed its stack
Illegal instruction (core dumped)
Or
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
I know that my stack size is not enough, the same as C++ when creating a non-heap array that is too big inside a function, but this code is using heap but still shows stack overflow. What's really wrong with this code?
As far as I can tell, it seems like that code is still trying to allocate the array on the stack first, and then move it into the box after.
It works for me if I switch to Vec<f64> in place of Box<[f64]> like this:
fn new_gap(gap: usize) -> usize {
let ngap = ((gap as f64) / 1.3) as usize;
if ngap == 9 || ngap == 10 {
return 11;
}
if ngap < 1 {
return 1;
}
return ngap;
}
fn comb_sort(a: &mut [f64]) {
// previously: [f64]
let xlen = a.len();
let mut gap = xlen;
let mut swapped: bool;
let mut temp: f64;
loop {
swapped = false;
gap = new_gap(gap);
for i in 0..(xlen - gap) {
if a[i] > a[i + gap] {
swapped = true;
temp = a[i];
a[i] = a[i + gap];
a[i + gap] = temp;
}
}
if !(gap > 1 || swapped) {
break;
}
}
}
const N: usize = 10000000;
fn main() {
let mut arr: Vec<f64> = std::iter::repeat(0.0).take(N).collect();
//let mut arr: Box<[f64]> = Box::new([0.0; N]); // previously: [f64; N] = [0.0; N];
for z in 0..(N) {
arr[z] = (N - z) as f64;
}
comb_sort(arr.as_mut_slice());
for z in 1..(N) {
if arr[z] < arr[z - 1] {
print!("!")
}
}
}
In the future, the box syntax will be stabilized. When it is, it will support this large allocation, as no function call to Box::new will be needed, thus the array will never be placed on the stack. For example:
#![feature(box_syntax)]
fn main() {
let v = box [0i32; 5_000_000];
println!("{}", v[1_000_000])
}

Resources