So, I'm working on porting a string tokenizer that I wrote in Python over to Rust, and I've run into an issue I can't seem to get past with lifetimes and structs.
So, the process is basically:
Get an array of files
Convert each file to a Vec<String> of tokens
User a Counter and Unicase to get counts of individual instances of tokens from each vec
Save that count in a struct, along with some other data
(Future) do some processing on the set of Structs to accumulate the total data along side the per-file data
struct Corpus<'a> {
words: Counter<UniCase<&'a String>>,
parts: Vec<CorpusPart<'a>>
}
pub struct CorpusPart<'a> {
percent_of_total: f32,
word_count: usize,
words: Counter<UniCase<&'a String>>
}
fn process_file(entry: &DirEntry) -> CorpusPart {
let mut contents = read_to_string(entry.path())
.expect("Could not load contents.");
let tokens = tokenize(&mut contents);
let counted_words = collect(&tokens);
CorpusPart {
percent_of_total: 0.0,
word_count: tokens.len(),
words: counted_words
}
}
pub fn tokenize(normalized: &mut String) -> Vec<String> {
// snip ...
}
pub fn collect(results: &Vec<String>) -> Counter<UniCase<&'_ String>> {
results.iter()
.map(|w| UniCase::new(w))
.collect::<Counter<_>>()
}
However, when I try to return CorpusPart it complains that it is trying to reference a local variable tokens. How can/should I deal with this? I tried adding lifetime annotations, but couldn't figure it out...
Essentially, I no longer need the Vec<String>, but I do need some of the Strings that were in it for the counter.
Any help is appreciated, thank you!
The issue here is that you are throwing away Vec<String>, but still referencing the elements inside it. If you no longer need Vec<String>, but still require some of the contents inside, you have to transfer the ownership to something else.
I assume you want Corpus and CorpusPart to both point to the same Strings, so you are not duplicating Strings needlessly. If that is the case, either Corpus or CorpusPart must own the String, so that the one that don't own the String references the Strings owned by the other. (Sounds more complicated that it actually is)
I will assume CorpusPart owns the String, and Corpus just points to those strings
use std::fs::DirEntry;
use std::fs::read_to_string;
pub struct UniCase<a> {
test: a
}
impl<a> UniCase<a> {
fn new(item: a) -> UniCase<a> {
UniCase {
test: item
}
}
}
type Counter<a> = Vec<a>;
struct Corpus<'a> {
words: Counter<UniCase<&'a String>>, // Will reference the strings in CorpusPart (I assume you implemented this elsewhere)
parts: Vec<CorpusPart>
}
pub struct CorpusPart {
percent_of_total: f32,
word_count: usize,
words: Counter<UniCase<String>> // Has ownership of the strings
}
fn process_file(entry: &DirEntry) -> CorpusPart {
let mut contents = read_to_string(entry.path())
.expect("Could not load contents.");
let tokens = tokenize(&mut contents);
let length = tokens.len(); // Cache the length, as tokens will no longer be valid once passed to collect
let counted_words = collect(tokens);
CorpusPart {
percent_of_total: 0.0,
word_count: length,
words: counted_words
}
}
pub fn tokenize(normalized: &mut String) -> Vec<String> {
Vec::new()
}
pub fn collect(results: Vec<String>) -> Counter<UniCase<String>> {
results.into_iter() // Use into_iter() to consume the Vec that is passed in, and take ownership of the internal items
.map(|w| UniCase::new(w))
.collect::<Counter<_>>()
}
I aliased Counter<a> to Vec<a>, as I don't know what Counter you are using.
Playground
Related
I have a struct that has a field that is a BTreeMap whose value is another struct that implements From<&[u8]>
MyStruct {
...
btree: BTreeMap<String, MyOtherStruct>
...
}
MyOtherStruct implements From<&[u8]> because i'm recovering it from a file.
impl From<&[u8]> for OtherMyStruct {
fn from(stream: &[u8]) -> Self {
...
}
}
I read the file that has a list of MyOtherStruct, and I have a function that parses the stream and returns an array of streams, which represents the streams of each struct MyOtherStruct
fn read_file(path: &PathBuf) -> Vec<u8> {
....
}
fn find_streams(stream: &[u8]) -> Vec<&[u8]> {
....
}
Then to build MyStruct, I take the array of streams and for each stream I create MyOtherStruct from the stream
fn main() {
let file_content = read_file(PathBuf::from("path"));
let streams = find_streams(&file_content);
let mut my_other_structs = BTreeMap::<String, MyOtherStruct>::new();
// here is where i collect my items
streams.iter().for_each(|s| {
let item = MyOtherStruct::from(*s);
my_other_structs.insert(String::from("some key"), item);
});
....
....
}
The question is in the part where I collect my items. Before using a for_each I used a map but the compiler gave me an error that said the trait 'FromIterator<IndexEntry>' is not implemented for 'BTreeMap<std::string::String, IndexEntry>'.
Of course I understand what the compiler error refers to, so I copied the signature of the trait I needed, pasted it into the editor and implemented it.
impl FromIterator<MyOtherStruct> for BTreeMap<String, MyOtherStruct> {
fn from_iter<T: IntoIterator<Item = MyOtherStruct>>(iter: T) -> Self {
let mut btree = BTreeMap::new();
iter.into_iter().for_each(|e| {
btree.insert(String::from("some key"), e);
});
btree
}
}
so, then instead of doing it this way
let mut my_other_structs = BTreeMap::<String, MyOtherStruct>::new();
streams.iter().for_each(|s| {
let item = MyOtherStruct::from(*s);
my_other_structs.insert(String::from("some key"), item);
});
it looked something like this
let my_other_structs = streams.iter()
.map(|s| MyOtherStruct::from(*s) )
.collect();
My question is, beyond cosmetics, is there any significant difference in the way things look on the back end? When assembling my BTreeMap one way or the other.
I mean I love how it looks when I do it with the FromIterator and just use a .map where I need it, but internally I do a for_each and it's the same thing I'm doing the other way without augmenting a .map on top of it.
so is there any relevant difference in this case?
map().collect() is more idiomatic for a couple of reasons. For one a simple for loop is recommended over the use of for_each by it's own documentation unless it makes the code possible or more readable.
The second and more important reason is .collect() can and will use size hints of the iterator where it can and preallocate the storage needed, so it will perform as good or better than for_each(insert).
Your FromIterator<MyOtherStruct> implementation could also be streamlined using the existing impl<K, V> FromIterator<(K, V)> for HashMap<K, V> like this:
impl FromIterator<MyOtherStruct> for BTreeMap<String, MyOtherStruct> {
fn from_iter<T: IntoIterator<Item = MyOtherStruct>>(iter: T) -> Self {
iter.into_iter()
.map(|e| (String::from("some key"), e))
.collect()
}
}
Or depending on your actual uses just do that directly instead of implementing FromIterator in the first place.
I'm trying to create a constructor for my struct, which would store an iterator over String read from file. The problem is that once the functions returns, String is dropped and compiler complains new() returns a value referencing data owned by the current function. Is there a way to associate String with a struct somehow so that it is not dropped after return?
I think I understand a complaint here but I don't understand how to deal with it, because I want constructor to deal both with file reading and iterator creation.
pub struct CharStream<'a> {
input: std::str::Chars<'a>,
filename: String,
}
impl<'a> CharStream<'a> {
pub fn new(filename: String) -> CharStream<'a> {
let mut file = File::open(&filename).unwrap();
let mut input = String::new();
file.read_to_string(&mut input);
CharStream {
input: input.chars(), // Create an iterator over `input`
filename: filename,
}
// `input` is dropped here
}
}
I would rename CharStream into FileContents and let it own both the filename and contents of the file as Strings. Then when you need to produce a TokenIter to iterate over chunks of chars from the contents you can then create the Chars<'a> on-demand and pass it to TokenIter then. Complete example:
use std::fs;
use std::str::Chars;
struct FileContents {
filename: String,
contents: String,
}
impl FileContents {
fn new(filename: String) -> Self {
let contents = fs::read_to_string(&filename).unwrap();
FileContents { filename, contents }
}
fn token_iter(&self) -> TokenIter<'_> {
TokenIter {
chars: self.contents.chars(),
}
}
}
struct TokenIter<'a> {
chars: Chars<'a>,
}
struct Token; // represents some chunk of chars
impl<'a> Iterator for TokenIter<'a> {
type Item = Token;
fn next(&mut self) -> Option<Self::Item> {
self.chars.next(); // call as many times as necessary to create token
Some(Token) // return created token here
}
}
fn example(filename: String) {
let file_contents = FileContents::new(filename);
let tokens = file_contents.token_iter();
for token in tokens {
// more processing here
}
}
playground
The iterator returned by String::chars() is only valid as long as the original string input lives. input is dropped at the end of new, so the iterator cannot be returned from the function.
To solve this, you'd want to store the input string in the struct as well, but then you run into other problems because one struct member can't have a reference to another member of the same struct. One reason for this is that the struct would become immovable, since moving it would invalidate the reference.
The simplest solution is probably to collect the chars into a Vec<char> and store that vector inside the CharStream. Then add an usize index and write your own Iterator<Item = char> implementation.
Another approach (more memory-efficient) is to store the String itself, and create the Chars iterator on demand, but that would of course result in a different API.
Solutions involving RefCell or similar wrappers are probably also possible.
TL;DR: I want to implement trait std::io::Write that outputs to a memory buffer, ideally String, for unit-testing purposes.
I must be missing something simple.
Similar to another question, Writing to a file or stdout in Rust, I am working on a code that can work with any std::io::Write implementation.
It operates on structure defined like this:
pub struct MyStructure {
writer: Box<dyn Write>,
}
Now, it's easy to create instance writing to either a file or stdout:
impl MyStructure {
pub fn use_stdout() -> Self {
let writer = Box::new(std::io::stdout());
MyStructure { writer }
}
pub fn use_file<P: AsRef<Path>>(path: P) -> Result<Self> {
let writer = Box::new(File::create(path)?);
Ok(MyStructure { writer })
}
pub fn printit(&mut self) -> Result<()> {
self.writer.write(b"hello")?;
Ok(())
}
}
But for unit testing, I also need to have a way to run the business logic (here represented by method printit()) and trap its output, so that its content can be checked in the test.
I cannot figure out how to implement this. This playground code shows how I would like to use it, but it does not compile because it breaks borrowing rules.
// invalid code - does not compile!
fn main() {
let mut buf = Vec::new(); // This buffer should receive output
let mut x2 = MyStructure { writer: Box::new(buf) };
x2.printit().unwrap();
// now, get the collected output
let output = std::str::from_utf8(buf.as_slice()).unwrap().to_string();
// here I want to analyze the output, for instance in unit-test asserts
println!("Output to string was {}", output);
}
Any idea how to write the code correctly? I.e., how to implement a writer on top of a memory structure (String, Vec, ...) that can be accessed afterwards?
Something like this does work:
let mut buf = Vec::new();
{
// Use the buffer by a mutable reference
//
// Also, we're doing it inside another scope
// to help the borrow checker
let mut x2 = MyStructure { writer: Box::new(&mut buf) };
x2.printit().unwrap();
}
let output = std::str::from_utf8(buf.as_slice()).unwrap().to_string();
println!("Output to string was {}", output);
However, in order for this to work, you need to modify your type and add a lifetime parameter:
pub struct MyStructure<'a> {
writer: Box<dyn Write + 'a>,
}
Note that in your case (where you omit the + 'a part) the compiler assumes that you use 'static as the lifetime of the trait object:
// Same as your original variant
pub struct MyStructure {
writer: Box<dyn Write + 'static>
}
This limits the set of types which could be used here, in particular, you cannot use any kinds of borrowed references. Therefore, for maximum genericity we have to be explicit here and define a lifetime parameter.
Also note that depending on your use case, you can use generics instead of trait objects:
pub struct MyStructure<W: Write> {
writer: W
}
In this case the types are fully visible at any point of your program, and therefore no additional lifetime annotation is needed.
I want to write a macro that generates varying structs from an integer argument. For example, make_struct!(3) might generate something like this:
pub struct MyStruct3 {
field_0: u32,
field_1: u32,
field_2: u32
}
What's the best way to transform that "3" literal into a number that I can use to generate code? Should I be using macro_rules! or a proc-macro?
You need a procedural attribute macro and quite a bit of pipework. An example implementation is on Github; bear in mind that it is pretty rough around the edges, but works pretty nicely to start with.
The aim is to have the following:
#[derivefields(u32, "field", 3)]
struct MyStruct {
foo: u32
}
transpile to:
struct MyStruct {
pub field_0: u32,
pub field_1: u32,
pub field_2: u32,
foo: u32
}
To do this, first, we're going to establish a couple of things. We're going to need a struct to easily store and retrieve our arguments:
struct MacroInput {
pub field_type: syn::Type,
pub field_name: String,
pub field_count: u64
}
The rest is pipework:
impl Parse for MacroInput {
fn parse(input: ParseStream) -> syn::Result<Self> {
let field_type = input.parse::<syn::Type>()?;
let _comma = input.parse::<syn::token::Comma>()?;
let field_name = input.parse::<syn::LitStr>()?;
let _comma = input.parse::<syn::token::Comma>()?;
let count = input.parse::<syn::LitInt>()?;
Ok(MacroInput {
field_type: field_type,
field_name: field_name.value(),
field_count: count.base10_parse().unwrap()
})
}
}
This defines syn::Parse on our struct and allows us to use syn::parse_macro_input!() to easily parse our arguments.
#[proc_macro_attribute]
pub fn derivefields(attr: TokenStream, item: TokenStream) -> TokenStream {
let input = syn::parse_macro_input!(attr as MacroInput);
let mut found_struct = false; // We actually need a struct
item.into_iter().map(|r| {
match &r {
&proc_macro::TokenTree::Ident(ref ident) if ident.to_string() == "struct" => { // react on keyword "struct" so we don't randomly modify non-structs
found_struct = true;
r
},
&proc_macro::TokenTree::Group(ref group) if group.delimiter() == proc_macro::Delimiter::Brace && found_struct == true => { // Opening brackets for the struct
let mut stream = proc_macro::TokenStream::new();
stream.extend((0..input.field_count).fold(vec![], |mut state:Vec<proc_macro::TokenStream>, i| {
let field_name_str = format!("{}_{}", input.field_name, i);
let field_name = Ident::new(&field_name_str, Span::call_site());
let field_type = input.field_type.clone();
state.push(quote!(pub #field_name: #field_type,
).into());
state
}).into_iter());
stream.extend(group.stream());
proc_macro::TokenTree::Group(
proc_macro::Group::new(
proc_macro::Delimiter::Brace,
stream
)
)
}
_ => r
}
}).collect()
}
The behavior of the modifier creates a new TokenStream and adds our fields first. This is extremely important; assume that the struct provided is struct Foo { bar: u8 }; appending last would cause a parse error due to a missing ,. Prepending allows us to not have to care about this, since a trailing comma in a struct is not a parse error.
Once we have this TokenStream, we successively extend() it with the generated tokens from quote::quote!(); this allows us to not have to build the token fragments ourselves. One gotcha is that the field name needs to be converted to an Ident (it gets quoted otherwise, which isn't something we want).
We then return this modified TokenStream as a TokenTree::Group to signify that this is indeed a block delimited by brackets.
In doing so, we also solved a few problems:
Since structs without named members (pub struct Foo(u32) for example) never actually have an opening bracket, this macro is a no-op for this
It will no-op any item that isn't a struct
It will also no-op structs without a member
I am just beginning to learn Rust and I’m struggling to handle the lifetimes.
I’d like to have a struct with a String in it which will be used to buffer lines from stdin. Then I’d like to have a method on the struct which returns the next character from the buffer, or if all of the characters from the line have been consumed it will read the next line from stdin.
The documentation says that Rust strings aren’t indexable by character because that is inefficient with UTF-8. As I’m accessing the characters sequentially it should be fine to use an iterator. However, as far as I understand, iterators in Rust are tied to the lifetime of the thing they’re iterating and I can’t work out how I could store this iterator in the struct alongside the String.
Here is the pseudo-Rust that I’d like to achieve. Obviously it doesn’t compile.
struct CharGetter {
/* Buffer containing one line of input at a time */
input_buf: String,
/* The position within input_buf of the next character to
* return. This needs a lifetime parameter. */
input_pos: std::str::Chars
}
impl CharGetter {
fn next(&mut self) -> Result<char, io::Error> {
loop {
match self.input_pos.next() {
/* If there is still a character left in the input
* buffer then we can just return it immediately. */
Some(n) => return Ok(n),
/* Otherwise get the next line */
None => {
io::stdin().read_line(&mut self.input_buf)?;
/* Reset the iterator to the beginning of the
* line. Obviously this doesn’t work because it’s
* not obeying the lifetime of input_buf */
self.input_pos = self.input_buf.chars();
}
}
}
}
}
I am trying to do the Synacor challenge. This involves implementing a virtual machine where one of the opcodes reads a character from stdin and stores it in a register. I have this part working fine. The documentation states that whenever the program inside the VM reads a character it will keep reading until it reads a whole line. I wanted to take advantage of this to add a “save” command to my implementation. That means that whenever the program asks for a character, I will read a line from the input. If the line is “save”, I will save the state of the VM and then continue to get another line to feed to the VM. Each time the VM executes the input opcode, I need to be able to give it one character at a time from the buffered line until the buffer is depleted.
My current implementation is here. My plan was to add input_buf and input_pos to the Machine struct which represents the state of the VM.
As thoroughly described in Why can't I store a value and a reference to that value in the same struct?, in general you can't do this because it truly is unsafe. When you move memory, you invalidate references. This is why a lot of people use Rust - to not have invalid references which lead to program crashes!
Let's look at your code:
io::stdin().read_line(&mut self.input_buf)?;
self.input_pos = self.input_buf.chars();
Between these two lines, you've left self.input_pos in a bad state. If a panic occurs, then the destructor of the object has the opportunity to access invalid memory! Rust is protecting you from an issue that most people never think about.
As also described in that answer:
There is a special case where the lifetime tracking is overzealous:
when you have something placed on the heap. This occurs when you use a
Box<T>, for example. In this case, the structure that is moved
contains a pointer into the heap. The pointed-at value will remain
stable, but the address of the pointer itself will move. In practice,
this doesn't matter, as you always follow the pointer.
Some crates provide ways of representing this case, but they require
that the base address never move. This rules out mutating vectors,
which may cause a reallocation and a move of the heap-allocated
values.
Remember that a String is just a vector of bytes with extra preconditions added.
Instead of using one of those crates, we can also roll our own solution, which means we (read you) get to accept all the responsibility for ensuring that we aren't doing anything wrong.
The trick here is to ensure that the data inside the String never moves and no accidental references are taken.
use std::{mem, str::Chars};
/// I believe this struct to be safe because the String is
/// heap-allocated (stable address) and will never be modified
/// (stable address). `chars` will not outlive the struct, so
/// lying about the lifetime should be fine.
///
/// TODO: What about during destruction?
/// `Chars` shouldn't have a destructor...
struct OwningChars {
_s: String,
chars: Chars<'static>,
}
impl OwningChars {
fn new(s: String) -> Self {
let chars = unsafe { mem::transmute(s.chars()) };
OwningChars { _s: s, chars }
}
}
impl Iterator for OwningChars {
type Item = char;
fn next(&mut self) -> Option<Self::Item> {
self.chars.next()
}
}
You might even think about putting just this code into a module so that you can't accidentally muck about with the innards.
Here's the same code using the ouroboros crate to create a self-referential struct containing the String and a Chars iterator:
use ouroboros::self_referencing; // 0.4.1
use std::str::Chars;
#[self_referencing]
pub struct IntoChars {
string: String,
#[borrows(string)]
chars: Chars<'this>,
}
// All these implementations are based on what `Chars` implements itself
impl Iterator for IntoChars {
type Item = char;
#[inline]
fn next(&mut self) -> Option<Self::Item> {
self.with_mut(|me| me.chars.next())
}
#[inline]
fn count(mut self) -> usize {
self.with_mut(|me| me.chars.count())
}
#[inline]
fn size_hint(&self) -> (usize, Option<usize>) {
self.with(|me| me.chars.size_hint())
}
#[inline]
fn last(mut self) -> Option<Self::Item> {
self.with_mut(|me| me.chars.last())
}
}
impl DoubleEndedIterator for IntoChars {
#[inline]
fn next_back(&mut self) -> Option<Self::Item> {
self.with_mut(|me| me.chars.next_back())
}
}
impl std::iter::FusedIterator for IntoChars {}
// And an extension trait for convenience
trait IntoCharsExt {
fn into_chars(self) -> IntoChars;
}
impl IntoCharsExt for String {
fn into_chars(self) -> IntoChars {
IntoCharsBuilder {
string: self,
chars_builder: |s| s.chars(),
}
.build()
}
}
Here's the same code using the rental crate to create a self-referential struct containing the String and a Chars iterator:
#[macro_use]
extern crate rental; // 0.5.5
rental! {
mod into_chars {
pub use std::str::Chars;
#[rental]
pub struct IntoChars {
string: String,
chars: Chars<'string>,
}
}
}
use into_chars::IntoChars;
// All these implementations are based on what `Chars` implements itself
impl Iterator for IntoChars {
type Item = char;
#[inline]
fn next(&mut self) -> Option<Self::Item> {
self.rent_mut(|chars| chars.next())
}
#[inline]
fn count(mut self) -> usize {
self.rent_mut(|chars| chars.count())
}
#[inline]
fn size_hint(&self) -> (usize, Option<usize>) {
self.rent(|chars| chars.size_hint())
}
#[inline]
fn last(mut self) -> Option<Self::Item> {
self.rent_mut(|chars| chars.last())
}
}
impl DoubleEndedIterator for IntoChars {
#[inline]
fn next_back(&mut self) -> Option<Self::Item> {
self.rent_mut(|chars| chars.next_back())
}
}
impl std::iter::FusedIterator for IntoChars {}
// And an extension trait for convenience
trait IntoCharsExt {
fn into_chars(self) -> IntoChars;
}
impl IntoCharsExt for String {
fn into_chars(self) -> IntoChars {
IntoChars::new(self, |s| s.chars())
}
}
This answer doesn’t address the general problem of trying to store an iterator in the same struct as the object that it is iterating over. However, in this particular case we can get around the problem by storing an integer byte index into the string instead of the iterator. Rust will let you create a string slice using this byte index and then we can use that to extract the next character starting from that point. Next we just need to update the byte index by the number of bytes the code point takes up in UTF-8. We can do this with char::len_utf8().
This would work like the below:
struct CharGetter {
// Buffer containing one line of input at a time
input_buf: String,
// The byte position within input_buf of the next character to
// return.
input_pos: usize,
}
impl CharGetter {
fn next(&mut self) -> Result<char, std::io::Error> {
loop {
// Get an iterator over the string slice starting at the
// next byte position in the string
let mut input_pos = self.input_buf[self.input_pos..].chars();
// Try to get a character from the temporary iterator
match input_pos.next() {
// If there is still a character left in the input
// buffer then we can just return it immediately.
Some(n) => {
// Move the position along by the number of bytes
// that this character occupies in UTF-8
self.input_pos += n.len_utf8();
return Ok(n);
},
// Otherwise get the next line
None => {
self.input_buf.clear();
std::io::stdin().read_line(&mut self.input_buf)?;
// Reset the iterator to the beginning of the
// line.
self.input_pos = 0;
}
}
}
}
}
In practice this isn’t really doing anything that is more safe than storing the iterator because the input_pos variable is still effectively doing the same thing as an iterator and its validity is still dependent on input_buf not being modified. Presumably if something else modified the buffer in the meantime then the program could panic when the string slice is created because it might no longer be at a character boundary.