Most operations on str in Rust create newly-allocated Strings. I understand that, because UTF8 is complex, one cannot generally know beforehand the size of the output, and so that output must be growable. However, I might have my own buffer , such as a Vec<u8> I'd like to grow. Is there any way to specify an existing output container to string operations?
e.g.,
let s = "my string";
let s: Vec<u8> = Vec::with_capacity(100); // explicit allocation
s.upper_into(s); // perhaps new allocation here, if result fits in `v`
-EDIT-
This is, of course, just for one case. I'd love to be able to treat all of the str methods this way, including for example those in sentencecase , without having to copy their internal logic.
You can walk char-by-char and use char::to_uppercase():
let mut uppercase = String::with_capacity(100);
uppercase.extend(s.chars().flat_map(char::to_uppercase));
I think it does not handle everything correctly, but this is exactly what str::to_uppercase() does too, so I assume it's OK.
Related
I'm new to Rust, and I want to process strings in a function in Rust and then return a struct that contains the results of that processing to use in more functions. This is very simplified and a bit messier because of all my attempts to get this working, but:
struct Strucc<'a> {
string: &'a str,
booool: bool
}
fn do_stuff2<'a>(input: &'a str) -> Result<Strucc, &str> {
let to_split = input.to_lowercase();
let splitter = to_split.split("/");
let mut array: Vec<&str> = Vec::new();
for split in splitter {
array.push(split);
}
let var = array[0];
println!("{}", var);
let result = Strucc{
string: array[0],
booool: false
};
Ok(result)
}
The issue is that to convert the &str to lowercase, I have to create a new String that's owned by the function.
As I understand it, the reason this won't compile is because when I split the new String I created, all the &strs I get from it are substrings of the String, which are all still owned by the function, and so when the value is returned and that String goes out of scope, the value in the struct I returned gets erased.
I tried to fix this with lifetimes (as you can see in the function definition), but from what I can tell I'd have to give the String a lifetime which I can't do as far as I'm aware because it isn't borrowed. Either that or I need to make the struct own that String (which I also don't understand how to do, nor does it seem reasonable as I'd have to make the struct mutable).
Also as a sidenote: Previously I have tried to just use a String in the struct but I want to define constants which won't work with that, and I still don't think it would solve the issue. I've also tried to use .clone() in various places just in case but had no luck (though I know why this shouldn't work anyway).
I have been looking for some solution for this for hours and it feels like such a small step so I feel I may be asking the wrong questions or have missed something simple but please explain it like I'm five because I'm very confused.
I think you misunderstand what &str actually is. &str is just a pointer to the string data plus a length. The point of &str is to be an immutable reference to a specific string, which enables all sorts of nice optimizations. When you attempt to turn the &str lowercase, Rust needs somewhere to put the data, and the only place to put it would be a String, because Strings own their data. Take a look at this post for more information.
Your goal is unachievable without Strucc containing a String, since .to_lowercase() has to create new data, and you have to allocate the resulting data somewhere in order to own a reference to it. The best place to put the resulting data would be the returned struct, i.e. Strucc, and therefore Strucc must contain a String.
Also as a sidenote: Previously I have tried to just use a String in the struct but I want to define constants which won't work with that, and I still don't think it would solve the issue.
You can use "x".to_owned() to create a String literal.
If you're trying to create a global constant, look at once_cell's lazy global initialization.
I was trying to use rust for competitive coding and I was wondering what is the most efficient way of storing user input in a Vec. I have come up with a method but I am afraid that it is slow and redundant.
Here is my code:
use std::io;
fn main() {
let mut input = String::new();
io::stdin().read_line(&mut input).expect("cant read line");
let input:Vec<&str> = input.split(" ").collect();
let input:Vec<String> = input.iter().map(|x| x.to_string()).collect();
let input:Vec<i32> = input.iter().map(|x| x.trim().parse().unwrap()).collect();
println!("{:?}", input);
}
PS: I am new to rust.
I see those ways of improving performance of the code:
Although not really relevant for std::io::stdin(), std::io::BufReader may have great effect for reading e.g. from std::fs::File. Buffer capacity can also matter.
Using locked stdin: let si = std::io::stdin(); let si = si.locked();
Avoiding allocations by keeping vectors around and using extend_from_iter instead of collect, if the code reads multiple line (unlike in the sample you posted in the question).
Maybe avoiding temporary vectors alltogether and just chaining Iterator operations together. Or using a loop like for line in input.split(...) { ... }. It may affect performance in both ways - you need to experiment to find out.
Avoiding to_string() and just storing reference to input buffer (which can also be used to parse() into i32. Note that this may invite famous Rust borrowing and lifetimes complexity.
Maybe finding some fast SIMD-enhanced string to int parser instead of libstd's parse().
Maybe streaming the result to algorithm instead of collecting everything to a Vec first. This can be beneficial especially if multiple threads can be used. For performance, you would still likely need to send data in chunks, not by one single i32.
Yeah, there are some changes you can make that will make your code more precise, simple and faster.
A better code :
use std::io;
fn main() {
let mut input = String::new();
io::stdin().read_line(&mut input).unwrap();
let input: Vec<i32> = input.split_whitespace().map(|x| x.parse().unwrap()).collect();
println!("{:?}", input);
}
Explanation
The input.split_whitespace() returns an iterator containing elements that are seperated by any kind of whitespace including line breaks. This saves the time used in spliting by just one whitespace input.split(" ") and iterating over again with a .trim() method on each string slice to remove any surronding whitespaces.
(You can also checkout input.split_ascii_whitespace(), if you want to restrict the split over ascii whitespaces).
There was no need for the code input.iter().map(|x| x.to_string()).collect(), since you can call also call a .trim() method on a string slice.
This saves some time in both the runtime and coding process, since the .collect() method is only used once and there was just one iteration.
I've created a few non-trivial parsers in nom, so I'm pretty familiar with it at this point. All the parsers I've created until now always provide the entire input slice to the parser.
I'd like to create a streaming parser, which I assume means that I can continue to feed bytes into the parser until it is complete. I've had a hard time finding any documentation or examples that illustrate this, and I also question my assumption of what a "streaming parser" is.
My questions are:
Is my understanding of what a streaming parser is correct?
If so, are there any good examples of a parser using this technique?
nom parsers neither maintain a buffer to feed more data into, nor do they maintain "state" where they previously needed more bytes.
But if you take a look at the IResult structure you see that you can return a partial result or indicate that you need more data.
There seem to be some structures provided to handle streaming: I think you are supposed to create a Consumer from a parser using the consumer_from_parser! macro, implement a Producer for your data source, and call run until it returns None (and start again when you have more data). Examples and docs seem to be mostly missing so far - see bottom of https://github.com/Geal/nom :)
Also it looks like most functions and macros in nom are not documented well (or at all) regarding their behavior when hitting the end of the input. For example take_until! returns Incomplete if the input isn't long enough to contain the substr to look for, but returns an error if the input is long enough but doesn't contain substr.
Also nom mostly uses either &[u8] or &str for input; you can't signal an actual "end of stream" through these types. You could implement your own input type (related traits: nom::{AsBytes,Compare,FindSubstring,FindToken,InputIter,InputLength,InputTake,Offset,ParseTo,Slice}) to add a "reached end of stream" flag, but the nom provided macros and functions won't be able to interpret it.
All in all I'd recommend splitting streamed input through some other means into chunks you can handle with simple non-streaming parsers (maybe even use synom instead of nom).
Here is a minimal working example. As #Stefan wrote, "I'd recommend splitting streamed input through some other means into chunks you can handle".
What somewhat works (and I'd be glad for suggestions on how to improve it), is to combine a File::bytes() method and then only take as many bytes as necessary and pass them to nom::streaming::take.
let reader = file.bytes();
let buf = reader.take(length).collect::<B>()?;
let (_input, chunk) = take(length)(&*buf)...;
The complete function can look like this:
/// Parse the first handful of bytes and return the bytes interpreted as UTF8
fn parse_first_bytes(file: std::fs::File, length: usize) -> Result<String> {
type B = std::result::Result<Vec<u8>, std::io::Error>;
let reader = file.bytes();
let buf = reader.take(length).collect::<B>()?;
let (_input, chunk) = take(length)(&*buf)
.finish()
.map_err(|nom::error::Error { input: _, code: _ }| eyre!("..."))?;
let s = String::from_utf8_lossy(chunk);
Ok(s.to_string())
}
Here is the rest of main for an implementation similar to Unix' head command.
use color_eyre::Result;
use eyre::eyre;
use nom::{bytes::streaming::take, Finish};
use std::{fs::File, io::Read, path::PathBuf};
use structopt::StructOpt;
#[derive(Debug, StructOpt)]
#[structopt(about = "A minimal example of parsing a file only partially.
This implements the POSIX 'head' utility.")]
struct Args {
/// Input File
#[structopt(parse(from_os_str))]
input: PathBuf,
/// Number of bytes to consume
#[structopt(short = "c", default_value = "32")]
num_bytes: usize,
}
fn main() -> Result<()> {
let args = Args::from_args();
let file = File::open(args.input)?;
let head = parse_first_bytes(file, args.num_bytes)?;
println!("{}", head);
Ok(())
}
I have a bunch of long immutable strings, which I would like to store in a HashSet.
I need a bunch of mappings with these strings as keys.
I would like to use references to these strings as keys in these mappings to avoid copying strings.
This is how I managed to eventually get to this status. The only concern is this extra copy I need to make at line 5.
let mut strings: HashSet<String> = HashSet::new(); // 1
let mut map: HashMap<&String, u8> = HashMap::new(); // 2
// 3
let s = "very long string".to_string(); // 4
strings.insert(s.clone()); // 5
let s_ref = strings.get(&s).unwrap(); // 6
map.insert(s_ref, 5); // 7
playground link
To avoid this cloning I found two workarounds:
Use Rc for string (adds overhead and code clutter)
Use unsafe code
Is there any sensible way to remove this excessive cloning?
It seems to me that what you are looking for is String Interning. There is a library, string-cache, which was developed as part of the Servo project which may be of help.
In any case, the basics are simple:
a long-lived pool of String, which guarantees they will not be moving at all
a look-up system to avoid inserting duplicates in the pool
You can use a typed arena to store your String, and then store &str to those strings without copying them (they will live for as long as the arena lives). Use a HashSet<&str> on top to avoid duplicates, and you're set.
I'm getting into Rust programming to realize a small program and I'm a little bit lost in string conversions.
In my program, I have a vector as follows:
let mut name: Vec<winnt::WCHAR> = Vec::new();
WCHAR is the same as a u16 on my Windows machine.
I hand over the Vec<u16> to a C function (as a pointer) which fills it with data. I then need to convert the string contained in the vector into a &str. However, no matter, what I try, I can not manage to get this conversion working.
The only thing I managed to get working is to convert it to a WideString:
widestr = unsafe { WideCString::from_ptr_str(name.as_ptr()) };
But this seems to be a step into the wrong direction.
What is the best way to convert the Vec<u16> to an &str under the assumption that the vector holds a valid and null-terminated string.
I then need to convert the string contained in the vector into a &str. However, no matter, what I try, I can not manage to get this conversion working.
There's no way of making this a "free" conversion.
A &str is a Unicode string encoded with UTF-8. This is a byte-oriented encoding. If you have UTF-16 (or the different but common UCS-2 encoding), there's no way to read one as the other. That's equivalent to trying to read a JPEG image as a PDF. Both chunks of data might be a string, but the encoding is important.
The first question is "do you really need to do that?". Many times, you can take data from one function and shovel it back into another function, never looking at it. If you can get away with that, that might be be best answer.
If you do need to transform it, then you have to deal with the errors that can occur. An arbitrary array of 16-bit integers may not be valid UTF-16 or UCS-2. These encodings have edge cases that can easily produce invalid strings. Null-termination is another aspect - Unicode actually allows for embedded NUL characters, so a null-terminated string can't hold all possible Unicode characters!
Once you've ensured that the encoding is valid 1 and figured out how many entries in the input vector comprise the string, then you have to decode the input format and re-encode to the output format. This is likely to require some kind of new allocation, so you are most likely to end up with a String, which can then be used most anywhere a &str can be used.
There is a built-in method to convert UTF-16 data to a String: String::from_utf16. Note that it returns a Result to allow for these error cases. There's also String::from_utf16_lossy, which replaces invalid encoded parts with the Unicode replacement character.
let name = [0x68, 0x65, 0x6c, 0x6c, 0x6f];
let a = String::from_utf16(&name);
let b = String::from_utf16_lossy(&name);
println!("{:?}", a);
println!("{:?}", b);
If you are starting from a pointer to a u16 or WCHAR, you will need to convert to a slice first by using slice::from_raw_parts. If you have a null-terminated string, you need to find the NUL yourself and slice the input appropriately.
1: This is actually a great way of using types; a &str is guaranteed to be UTF-8 encoded, so no further check needs to be made. Similarly, the WideCString is likely to perform a check once upon construction and then can skip the check on later uses.
This is my simple hack for this case. There must be a bug; fix for your own case:
let mut v = vec![0u16; MAX_PATH as usize];
// imaginary win32 function
win32_function(v.as_mut_ptr());
let mut path = String::new();
for val in v.iter() {
let c: u8 = (*val & 0xFF) as u8;
if c == 0 {
break;
} else {
path.push(c as char);
}
}