Creating a string from Vec<char> [duplicate] - string

This question already has answers here:
How to convert Vec<char> to a string
(2 answers)
Closed 6 years ago.
I've got a Vec<char> that I need to turn into a &str or String, but I'm unsure of the best way to do this. I've looked around and every resource I've found seems to be out-dated in some way. The answers in this question don't seem to be applicable for the newest build.
I'm using the nightly for 2015-3-19

The iterator based approach with .collect should work, after updating for language changes:
char_vector.iter().cloned().collect::<String>();
(I've chosen to replace .map(|c| *c) with .cloned() but either works.)

If your vector can be consumed, you can also use into_iter to avoid the clone
fn main() {
let char_vector = vec!['h', 'e', 'l', 'l', 'o'];
let str: String = char_vector.into_iter().collect();
println!("{}", str);
}

You can convert the Vec into a String without doing any allocations. It requires quite some unsafe code though:
#![feature(raw, unicode)]
use std::raw::Repr;
use std::slice::from_raw_parts_mut;
fn inplace_to_string(v: Vec<char>) -> String {
unsafe {
let mut i = 0;
{
let ch_v = &v[..];
let r = ch_v.repr();
let p: &mut [u8] = from_raw_parts_mut(r.data as *mut u8, r.len*4);
for ch in ch_v {
i += ch.encode_utf8(&mut p[i..i+4]).unwrap();
}
}
let p = v.as_ptr();
let cap = v.capacity()*4;
std::mem::forget(v);
let v = Vec::from_raw_parts(p as *mut u8, i, cap);
String::from_utf8_unchecked(v)
}
}
fn main() {
let char_vector = vec!['h', 'ä', 'l', 'l', 'ö'];
let str: String = char_vector.iter().cloned().collect();
let str2 = inplace_to_string(char_vector);
println!("{}", str);
println!("{}", str2);
}
PlayPen
Detailed Explanation
This creates a mutable u8 slice and a char slice simultaneously to the same buffer (breaking all Rust guarantees). Note that the u8 slice is four times as large as the char slice, since char always takes up 4 bytes.
let ch_v = &v[..];
let r = ch_v.repr();
let v: &mut [u8] = from_raw_parts_mut(r.data as *mut u8, r.len*4);
We need that to iterate over the unicode chars and replace them by their utf8 encoded counterpart. Since utf8 is always shorter or the same length as unicode, we can guarantee that we never overwrite any part we haven't read yet.
for ch in ch_v {
i += ch.encode_utf8(&mut v[i..i+4]).unwrap();
}
Since a char is always unicode and our buffer is always exactly 4 bytes (which is the maximum number of bytes a utf8 encoded unicode char will need), we can encode our chars to utf8 without checking if it worked (it will always work). The encode_utf8 function returns the length of the utf8 representation. Our index i is the location of the last written utf8 char.
Finally we need to do some cleaning up. Our vector is still of type Vec<char>. We get all the info we need (Pointer to the heap allocated array and the capacity)
let p = v.as_ptr();
let cap = v.capacity()*4;
Then we release the previous vector from all obligations like freeing memory.
std::mem::forget(v);
and finally recreate the u8 vector with correct length and capacity, and directly turn it into a String. The conversion to String does not need to be checked, as we already know the utf8 is correct, since the original Vec<char> could only contain correct unicode chars.
let v = Vec::from_raw_parts(p as *mut u8, i, cap);
String::from_utf8_unchecked(v)

Related

How to create a &str from a single character? [duplicate]

This question already has answers here:
Converting a char to &str
(3 answers)
Closed 1 year ago.
I can't believe I'm asking this frankly, but how do I create a &str (or a String) when I have a single character?
The first thing to try for simple conversions is into().
It works for String because String implements From<char>.
let c: char = 'π';
let s: String = c.into();
You can't build a &str directly from a char. A &str is a reference type. The easiest solution is to build it from a string:
let s: &str = &s;
An alternative for most kinds of values is the format macro:
let s = format!("{}", c);
If just need to use the &str locally and you want to avoid heap allocation, you can use char method encode_utf8:
fn main() {
let c = 'n';
let mut tmp = [0; 1];
let foo = c.encode_utf8(&mut tmp);
println!("str: {}", foo);
}
or
fn main() {
let tmp = [b'n'; 1];
let foo = std::str::from_utf8(&tmp).unwrap();
println!("str: {}", foo);
}
To work with every char you need to use a u8 array of length 4 [0; 4]. In utf8, ascii chars can be represented as a single byte, but all other characters require more bytes with maximum of 4.
This is a simplified example based on an answer from a very similar question:
Converting a char to &str

Difference between double quotes and single quotes in Rust

I was doing the adventofcode of 2020 day 3 in Rust to train a little bit because I am new to Rust and I my code would not compile depending if I used single quotes or double quotes on my "tree" variable
the first code snippet would not compile and throw the error: expected u8, found &[u8; 1]
use std::fs;
fn main() {
let text: String = fs::read_to_string("./data/text").unwrap();
let vec: Vec<&str> = text.lines().collect();
let vec_vertical_len = vec.len();
let vec_horizontal_len = vec[0].len();
let mut i_pointer: usize = 0;
let mut j_pointer: usize = 0;
let mut tree_counter: usize = 0;
let tree = b"#";
loop {
i_pointer += 3;
j_pointer += 1;
if j_pointer >= vec_vertical_len {
break;
}
let i_index = i_pointer % vec_horizontal_len;
let character = vec[j_pointer].as_bytes()[i_index];
if character == tree {
tree_counter += 1
}
}
println!("{}", tree_counter);
}
the second snippet compiles and gives the right answer..
use std::fs;
fn main() {
let text: String = fs::read_to_string("./data/text").unwrap();
let vec: Vec<&str> = text.lines().collect();
let vec_vertical_len = vec.len();
let vec_horizontal_len = vec[0].len();
let mut i_pointer: usize = 0;
let mut j_pointer: usize = 0;
let mut tree_counter: usize = 0;
let tree = b'#';
loop {
i_pointer += 3;
j_pointer += 1;
if j_pointer >= vec_vertical_len {
break;
}
let i_index = i_pointer % vec_horizontal_len;
let character = vec[j_pointer].as_bytes()[i_index];
if character == tree {
tree_counter += 1
}
}
println!("{}", tree_counter);
}
I did not find any reference explaining what is going on when using single or double quotes..can someone help me?
The short answer is it works similarly to java. Single quotes for characters and double quotes for strings.
let a: char = 'k';
let b: &'static str = "k";
The b'' or b"" prefix means take what I have here and interpret as byte literals instead.
let a: u8 = b'k';
let b: &'static [u8; 1] = b"k";
The reason strings result in references is due to how they are stored in the compiled binary. It would be too inefficient to store a string constant inside each method, so strings get put at the beginning of the binary in header area. When your program is being executed, you are taking a reference to the bytes in that header (hence the static lifetime).
Going further down the rabbit hole, single quotes technically hold a codepoint. This is essentially what you might think of as a character. So a Unicode character would also be considered a single codepoint even though it may be multiple bytes long. A codepoint is assumed to fit into a u32 or less so you can safely convert any char by using as u32, but not the other way around since not all u32 values will match valid codepoints. This also means b'\u{x}' is not valid since \u{x} may produce characters that will not fit within a single byte.
// U+1F600 is a unicode smiley face
let a: char = '\u{1F600}';
assert_eq!(a as u32, 0x1F600);
However, you might find it interesting to know that since Rust strings are stored as UTF-8, codepoints over 127 will occupy multiple bytes in a string despite fitting into a single byte on their own. As you may already know, UTF-8 is simply a way of converting codepoints to bytes and back again.
let foo: &'static str = "\u{1F600}";
let foo_chars: Vec<char> = foo.chars().collect();
let foo_bytes: Vec<u8> = foo.bytes().collect();
assert_eq!(foo_chars.len(), 1);
assert_eq!(foo_bytes.len(), 4);
assert_eq!(foo_chars[0] as u32, 0x1F600);
assert_eq!(foo_bytes, vec![240, 159, 152, 128]);

Cast vector of i8 to vector of u8 in Rust? [duplicate]

This question already has answers here:
How do I convert a Vec<T> to a Vec<U> without copying the vector?
(2 answers)
Closed 3 years ago.
Is there a better way to cast Vec<i8> to Vec<u8> in Rust except for these two?
creating a copy by mapping and casting every entry
using std::transmute
The (1) is slow, the (2) is "transmute should be the absolute last resort" according to the docs.
A bit of background maybe: I'm getting a Vec<i8> from the unsafe gl::GetShaderInfoLog() call and want to create a string from this vector of chars by using String::from_utf8().
The other answers provide excellent solutions for the underlying problem of creating a string from Vec<i8>. To answer the question as posed, creating a Vec<u8> from data in a Vec<i8> can be done without copying or transmuting the vector. As pointed out by #trentcl, transmuting the vector directly constitutes undefined behavior because Vec is allowed to have different layout for different types.
The correct (though still requiring the use of unsafe) way to transfer a vector's data without copying it is:
obtain the *mut i8 pointer to the data in the vector, along with its length and capacity
leak the original vector to prevent it from freeing the data
use Vec::from_raw_parts to build a new vector, giving it the pointer cast to *mut u8 - this is the unsafe part, because we are vouching that the pointer contains valid and initialized data, and that it is not in use by other objects, and so on.
This is not UB because the new Vec is given the pointer of the correct type from the start. Code (playground):
fn vec_i8_into_u8(v: Vec<i8>) -> Vec<u8> {
// ideally we'd use Vec::into_raw_parts, but it's unstable,
// so we have to do it manually:
// first, make sure v's destructor doesn't free the data
// it thinks it owns when it goes out of scope
let mut v = std::mem::ManuallyDrop::new(v);
// then, pick apart the existing Vec
let p = v.as_mut_ptr();
let len = v.len();
let cap = v.capacity();
// finally, adopt the data into a new Vec
unsafe { Vec::from_raw_parts(p as *mut u8, len, cap) }
}
fn main() {
let v = vec![-1i8, 2, 3];
assert!(vec_i8_into_u8(v) == vec![255u8, 2, 3]);
}
transmute on a Vec is always, 100% wrong, causing undefined behavior, because the layout of Vec is not specified. However, as the page you linked also mentions, you can use raw pointers and Vec::from_raw_parts to perform this correctly. user4815162342's answer shows how.
(std::mem::transmute is the only item in the Rust standard library whose documentation consists mostly of suggestions for how not to use it. Take that how you will.)
However, in this case, from_raw_parts is also unnecessary. The best way to deal with C strings in Rust is with the wrappers in std::ffi, CStr and CString. There may be better ways to work this in to your real code, but here's one way you could use CStr to borrow a Vec<c_char> as a &str:
const BUF_SIZE: usize = 1000;
let mut info_log: Vec<c_char> = vec![0; BUF_SIZE];
let mut len: usize;
unsafe {
gl::GetShaderInfoLog(shader, BUF_SIZE, &mut len, info_log.as_mut_ptr());
}
let log = Cstr::from_bytes_with_nul(info_log[..len + 1])
.expect("Slice must be nul terminated and contain no nul bytes")
.to_str()
.expect("Slice must be valid UTF-8 text");
Notice there is no unsafe code except to call the FFI function; you could also use with_capacity + set_len (as in wasmup's answer) to skip initializing the Vec to 1000 zeros, and use from_bytes_with_nul_unchecked to skip checking the validity of the returned string.
See this:
fn get_compilation_log(&self) -> String {
let mut len = 0;
unsafe { gl::GetShaderiv(self.id, gl::INFO_LOG_LENGTH, &mut len) };
assert!(len > 0);
let mut buf = Vec::with_capacity(len as usize);
let buf_ptr = buf.as_mut_ptr() as *mut gl::types::GLchar;
unsafe {
gl::GetShaderInfoLog(self.id, len, std::ptr::null_mut(), buf_ptr);
buf.set_len(len as usize);
};
match String::from_utf8(buf) {
Ok(log) => log,
Err(vec) => panic!("Could not convert compilation log from buffer: {}", vec),
}
}
See ffi:
let s = CStr::from_ptr(strz_ptr).to_str().unwrap();
Doc

How can I append a char or &str to a String without first converting it to String?

I am attempting to write a lexer for fun, however something keeps bothering me.
let mut chars: Vec<char> = Vec::new();
let mut contents = String::new();
let mut tokens: Vec<&String> = Vec::new();
let mut append = String::new();
//--snip--
for _char in chars {
append += &_char.to_string();
append = append.trim().to_string();
if append.contains("print") {
println!("print found at: \n{}", append);
append = "".to_string();
}
}
Any time I want to do something as simple as append a &str to a String I have to convert it using .to_string, String::from(), .to_owned, etc.
Is there something I am doing wrong, so that I don't have to constantly do this, or is this the primary way of appending?
If you're trying to do something with a type, check the documentation. From the documentation for String:
push: "Appends the given char to the end of this String."
push_str: "Appends a given string slice onto the end of this String."
It's important to understand the differences between String and &str, and why different methods accept and return each of them.
A &str or &mut str are usually preferred in function arguments and return types. That's because they are just pointers to data so nothing needs to be copied or moved when they are passed around.
A String is returned when a function needs to do some new allocation, while &str and &mut str are slices into an existing String. Even though &mut str is mutable, you can't mutate it in a way that increases its length because that would require additional allocation.
The trim function is able to return a &str slice because that doesn't involve mutating the original string - a trimmed string is just a substring, which a slice perfectly describes. But sometimes that isn't possible; for example, a function that pads a string with an extra character would have to return a String because it would be allocating new memory.
You can reduce the number of type conversions in your code by choosing different methods:
for c in chars {
append.push(c); // append += &_char.to_string();
append = append.trim().to_string();
if append.contains("print") {
println!("print found at: \n{}", append);
append.clear(); // append = "".to_string();
}
}
There isn't anything like a trim_in_place method for String, so the way you have done it is probably the only way.

How to find the starting offset of a string slice of another string? [duplicate]

This question already has answers here:
How to get the byte offset between `&str`
(2 answers)
Closed 3 years ago.
Given a string and a slice referring to some substring, is it possible to find the starting and ending index of the slice?
I have a ParseString function which takes in a reference to a string, and tries to parse it according to some grammar:
ParseString(inp_string: &str) -> Result<(), &str>
If the parsing is fine, the result is just Ok(()), but if there's some error, it usually is in some substring, and the error instance is Err(e), where e is a slice of that substring.
When given the substring where the error occurs, I want to say something like "Error from characters x to y", where x and y are the starting and ending indices of the erroneous substring.
I don't want to encode the position of the errors directly in Err, because I'm nesting these invocations, and the offsets in the nested slice might not correspond to the some slice in the top level string.
As long as all of your string slices borrow from the same string buffer, you can calculate offsets with simple pointer arithmetic. You need the following methods:
str::as_ptr(): Returns the pointer to the start of the string slice
A way to get the difference between two pointers. Right now, the easiest way is to just cast both pointers to usize (which is always a no-op) and then subtract those. On 1.47.0+, there is a method offset_from() which is slightly nicer.
Here is working code (Playground):
fn get_range(whole_buffer: &str, part: &str) -> (usize, usize) {
let start = part.as_ptr() as usize - whole_buffer.as_ptr() as usize;
let end = start + part.len();
(start, end)
}
fn main() {
let input = "Everyone ♥ Ümläuts!";
let part1 = &input[1..7];
println!("'{}' has offset {:?}", part1, get_range(input, part1));
let part2 = &input[7..16];
println!("'{}' has offset {:?}", part2, get_range(input, part2));
}
Rust actually used to have an unstable method for doing exactly this, but it was removed due to being obsolete, which was a bit odd considering the replacement didn't remotely have the same functionality.
That said, the implementation isn't that big, so you can just add the following to your code somewhere:
pub trait SubsliceOffset {
/**
Returns the byte offset of an inner slice relative to an enclosing outer slice.
Examples
```ignore
let string = "a\nb\nc";
let lines: Vec<&str> = string.lines().collect();
assert!(string.subslice_offset_stable(lines[0]) == Some(0)); // &"a"
assert!(string.subslice_offset_stable(lines[1]) == Some(2)); // &"b"
assert!(string.subslice_offset_stable(lines[2]) == Some(4)); // &"c"
assert!(string.subslice_offset_stable("other!") == None);
```
*/
fn subslice_offset_stable(&self, inner: &Self) -> Option<usize>;
}
impl SubsliceOffset for str {
fn subslice_offset_stable(&self, inner: &str) -> Option<usize> {
let self_beg = self.as_ptr() as usize;
let inner = inner.as_ptr() as usize;
if inner < self_beg || inner > self_beg.wrapping_add(self.len()) {
None
} else {
Some(inner.wrapping_sub(self_beg))
}
}
}
You can remove the _stable suffix if you don't need to support old versions of Rust; it's just there to avoid a name conflict with the now-removed subslice_offset method.

Resources