I'm rather new to Rust and have put together a little experiment that blows my understanding of annotations entirely out of the water. This is compiled with rust-0.13.0-nightly and there's a playpen version of the code here.
The meat of the program is the function 'recognize', which is co-responsible for allocating String instances along with the function 'lex'. I'm sure the code is a bit goofy so, in addition to getting the lifetimes right enough to get this compiling I would also happily accept some guidance on making this idiomatic.
#[deriving(Show)]
enum Token<'a> {
Field(&'a std::string::String),
}
#[deriving(Show)]
struct LexerState<'a> {
character: int,
field: int,
tokens: Vec<Token<'a>>,
str_buf: &'a std::string::String,
}
// The goal with recognize is to:
//
// * gather all A .. z into a temporary string buffer str_buf
// * on ',', move buffer into a Field token
// * store the completely extracted field in LexerState's tokens attribute
//
// I think I'm not understanding how to specify the lifetimes and mutability
// correctly.
fn recognize<'a, 'r>(c: char, ctx: &'r mut LexerState<'a>) -> &'r mut LexerState<'a> {
match c {
'A' ... 'z' => {
ctx.str_buf.push(c);
},
',' => {
ctx.tokens.push(Field(ctx.str_buf));
ctx.field += 1;
ctx.str_buf = &std::string::String::new();
},
_ => ()
};
ctx.character += 1;
ctx
}
fn lex<'a, I, E>(it: &mut I)
-> LexerState<'a> where I: Iterator<Result<char, E>> {
let mut ctx = LexerState { character: 0, field: 0,
tokens: Vec::new(), str_buf: &std::string::String::new() };
for val in *it {
let c:char = val.ok().expect("wtf");
recognize(c, &mut ctx);
}
ctx
}
fn main() {
let tokens = lex(&mut std::io::stdio::stdin().chars());
println!("{}", tokens)
}
In this case, you're constructing new strings rather than borrowing existing strings, so you'd use an owned string directly:
use std::mem;
#[deriving(Show)]
enum Token {
Field(String),
}
#[deriving(Show)]
struct LexerState {
character: int,
field: int,
tokens: Vec<Token>,
str_buf: String,
}
// The goal with recognize is to:
//
// * gather all A .. z into a temporary string buffer str_buf
// * on ',', move buffer into a Field token
// * store the completely extracted field in LexerState's tokens attribute
//
// I think I'm not understanding how to specify the lifetimes and mutability
// correctly.
fn recognize<'a, 'r>(c: char, ctx: &'r mut LexerState) -> &'r mut LexerState {
match c {
'A' ...'z' => { ctx.str_buf.push(c); }
',' => {
ctx.tokens.push(Field(mem::replace(&mut ctx.str_buf,
String::new())));
ctx.field += 1;
}
_ => (),
};
ctx.character += 1;
ctx
}
fn lex<I, E>(it: &mut I) -> LexerState where I: Iterator<Result<char, E>> {
let mut ctx =
LexerState{
character: 0,
field: 0,
tokens: Vec::new(),
str_buf: String::new(),
};
for val in *it {
let c: char = val.ok().expect("wtf");
recognize(c, &mut ctx);
}
ctx
}
fn main() {
let tokens = lex(&mut std::io::stdio::stdin().chars());
println!("{}" , tokens)
}
Related
Is there a way in Rust to initialize the first n elements of an array manually, and specify a default value to be used for the rest?
Specifically, when initializing structs, we can specify some fields, and use .. to initialize the remaining fields from another struct, e.g.:
let foo = Foo {
x: 1,
y: 2,
..Default::default()
};
Is there a similar mechanism for initializing part of an array manually? e.g.
let arr: [i32; 5] = [1, 2, ..3];
to get [1, 2, 3, 3, 3]?
Edit: I realized this can be done on stable. For the original answer, see below.
I had to juggle with the compiler so it will be able to infer the type of the array, but it works:
// A workaround on the same method on `MaybeUninit` being unstable.
// Copy-paste from https://doc.rust-lang.org/stable/src/core/mem/maybe_uninit.rs.html#943-953.
pub unsafe fn maybe_uninit_array_assume_init<T, const N: usize>(
array: [core::mem::MaybeUninit<T>; N],
) -> [T; N] {
// SAFETY:
// * The caller guarantees that all elements of the array are initialized
// * `MaybeUninit<T>` and T are guaranteed to have the same layout
// * `MaybeUninit` does not drop, so there are no double-frees
// And thus the conversion is safe
(&array as *const _ as *const [T; N]).read()
}
macro_rules! array_with_default {
(#count) => { 0usize };
(#count $e:expr, $($rest:tt)*) => { 1usize + array_with_default!(#count $($rest)*) };
[$($e:expr),* ; $default:expr; $default_size:expr] => {{
// There is no hygiene for items, so we use unique names here.
#[allow(non_upper_case_globals)]
const __array_with_default_EXPRS_LEN: usize = array_with_default!(#count $($e,)*);
#[allow(non_upper_case_globals)]
const __array_with_default_DEFAULT_SIZE: usize = $default_size;
let mut result = unsafe { ::core::mem::MaybeUninit::<
[::core::mem::MaybeUninit<_>; {
__array_with_default_EXPRS_LEN + __array_with_default_DEFAULT_SIZE
}],
>::uninit().assume_init() };
let mut dest = result.as_mut_ptr();
$(
let expr = $e;
unsafe {
::core::ptr::write((*dest).as_mut_ptr(), expr);
dest = dest.add(1);
}
)*
for default_value in [$default; __array_with_default_DEFAULT_SIZE] {
unsafe {
::core::ptr::write((*dest).as_mut_ptr(), default_value);
dest = dest.add(1);
}
}
unsafe { maybe_uninit_array_assume_init(result) }
}};
}
Playground.
Based on the example from #Denys, here is a macro that works on nightly. Note that I had problems matching the .. syntax (though I'm not entirely sure that's impossible; just didn't put much time into that):
#![feature(generic_const_exprs)]
#![allow(incomplete_features)]
use std::mem::MaybeUninit;
pub fn concat_arrays<T, const N: usize, const M: usize>(a: [T; N], b: [T; M]) -> [T; N + M] {
unsafe {
let mut result = MaybeUninit::<[T; N + M]>::uninit();
let dest = result.as_mut_ptr().cast::<[T; N]>();
dest.write(a);
let dest = dest.add(1).cast::<[T; M]>();
dest.write(b);
result.assume_init()
}
}
macro_rules! array_with_default {
[$($e:expr),* ; $default:expr; $default_size:expr] => {
concat_arrays([$($e),*], [$default; $default_size])
};
}
fn main() {
dbg!(array_with_default![1, 2; 3; 7]);
}
Playground.
As another option, you can build a default filled array and just modify the positions you require in runtime:
#![feature(explicit_generic_args_with_impl_trait)]
fn array_with_default_and_positions<T: Copy, const SIZE: usize>(
default: T,
init_values: impl IntoIterator<Item = (usize, T)>,
) -> [T; SIZE] {
let mut res = [default; SIZE];
for (i, e) in init_values.into_iter() {
res[i] = e;
}
res
}
Playground
Notice the use of #![feature(explicit_generic_args_with_impl_trait)],which is nightly, it could be replaced by an slice since T and usize are copy:
fn array_with_default_and_positions_v2<T: Copy, const SIZE: usize>(
default: T,
init_values: &[(usize, T)],
) -> [T; SIZE] {
let mut res = [default; SIZE];
for &(i, e) in init_values.into_iter() {
res[i] = e;
}
res
}
I am implementing a robot that takes orders like L (turn left), R (turn right) and M (move forward). These orders may be augmented with a quantifier like M3LMR2 (move 3 steps, turn left, move one step, turn face). This is the equivalent of MMMLMRR.
I coded the robot structure that can understand the following enum:
pub enum Message {
TurnLeft(i8),
TurnRight(i8),
MoveForward(i8),
}
Robot::execute(&mut self, orders: Vec<Message>) is doing its job correctly.
Now, I am struggling to write something decent for the string parsing, juggling with &str, String, char and unsafe slicings because tokens can be 1 or more characters.
I have tried regular expression matching (almost worked), but I really want to tokenize the string:
fn capture(orders: &String, start: &usize, end: &usize) -> Message {
unsafe {
let order = orders.get_unchecked(start..end);
// …
};
Message::TurnLeft(1) // temporary
}
pub fn parse_orders(orders: String) -> Result<Vec<Message>, String> {
let mut messages = vec![];
let mut start: usize = 0;
let mut end: usize = 0;
while end < orders.len() && end != start {
end += 1;
match orders.get(end) {
Some('0'...'9') => continue,
_ => {
messages.push(capture(&orders, &start, &end));
start = end;
}
}
}
Ok(messages)
}
This doesn't compile and is clumsy.
The idea is to write a parser that turn the order string into a vector of Message:
let messages = parse_order("M3LMR2");
println!("Messages => {:?}", messages);
// would print
// [Message::MoveForward(3), Message::TurnLeft(1), Message::MoveForward(1), Message::TurnRight(2)]
What would be the efficient/elegant way for doing that?
You can do this very simply with an iterator, using parse and some basic String processing:
#[derive(Debug, PartialEq, Clone)]
enum Message {
TurnLeft(u8),
TurnRight(u8),
MoveForward(u8),
}
struct RobotOrders(String);
impl RobotOrders {
fn new(source: impl Into<String>) -> Self {
RobotOrders(source.into())
}
}
impl Iterator for RobotOrders {
type Item = Message;
fn next(&mut self) -> Option<Message> {
self.0.chars().next()?;
let order = self.0.remove(0);
let n_digits = self.0.chars().take_while(char::is_ascii_digit).count();
let mut number = self.0.clone();
self.0 = number.split_off(n_digits);
let number = number.parse().unwrap_or(1);
Some(match order {
'L' => Message::TurnLeft(number),
'R' => Message::TurnRight(number),
'M' => Message::MoveForward(number),
_ => unimplemented!(),
})
}
}
fn main() {
use Message::*;
let orders = RobotOrders::new("M3LMR2");
let should_be = [MoveForward(3), TurnLeft(1), MoveForward(1), TurnRight(2)];
assert!(orders.eq(should_be.iter().cloned()));
}
I'm trying to build a basic web crawler in Rust, which I'm trying to port to html5ever. As of right now, I have a function with a struct inside that is supposed to return a Vec<String>. It gets this Vec from the struct in the return statement. Why does it always return an empty vector? (Does it have anything to do with the lifetime parameters?)
fn find_urls_in_html<'a>(
original_url: &Url,
raw_html: String,
fetched_cache: &Vec<String>,
) -> Vec<String> {
#[derive(Clone)]
struct Sink<'a> {
original_url: &'a Url,
returned_vec: Vec<String>,
fetched_cache: &'a Vec<String>,
}
impl<'a> TokenSink for Sink<'a> {
type Handle = ();
fn process_token(&mut self, token: Token, _line_number: u64) -> TokenSinkResult<()> {
trace!("token {:?}", token);
match token {
TagToken(tag) => {
if tag.kind == StartTag && tag.attrs.len() != 0 {
let _attribute_name = get_attribute_for_elem(&tag.name);
if _attribute_name == None {
return TokenSinkResult::Continue;
}
let attribute_name = _attribute_name.unwrap();
for attribute in &tag.attrs {
if &attribute.name.local != attribute_name {
continue;
}
trace!("element {:?} found", tag);
add_urls_to_vec(
repair_suggested_url(
self.original_url,
(&attribute.name.local, &attribute.value),
),
&mut self.returned_vec,
&self.fetched_cache,
);
}
}
}
ParseError(error) => {
warn!("error parsing html for {}: {:?}", self.original_url, error);
}
_ => {}
}
return TokenSinkResult::Continue;
}
}
let html = Sink {
original_url: original_url,
returned_vec: Vec::new(),
fetched_cache: fetched_cache,
};
let mut byte_tendril = ByteTendril::new();
{
let tendril_push_result = byte_tendril.try_push_bytes(&raw_html.into_bytes());
if tendril_push_result.is_err() {
warn!("error pushing bytes to tendril: {:?}", tendril_push_result);
return Vec::new();
}
}
let mut queue = BufferQueue::new();
queue.push_back(byte_tendril.try_reinterpret().unwrap());
let mut tok = Tokenizer::new(html.clone(), std::default::Default::default()); // default default! default?
let feed = tok.feed(&mut queue);
return html.returned_vec;
}
The output ends with no warning (and a panic, caused by another function due to this being empty). Can anyone help me figure out what's going on?
Thanks in advance.
When I initialize the Tokenizer, I use:
let mut tok = Tokenizer::new(html.clone(), std::default::Default::default());
The problem is that I'm telling the Tokenizer to use html.clone() instead of html. As such, it is writing returned_vec to the cloned object, not html. Changing a few things, such as using a variable with mutable references, fixes this problem.
I'm trying to create some sets of Strings and then merge some of these sets so that they have the same tag (of type usize). Once I initialize the map, I start adding strings:
self.clusters.make_set("a");
self.clusters.make_set("b");
When I call self.clusters.find("a") and self.clusters.find("b"), different values are returned, which is fine because I haven't merged the sets yet. Then I call the following method to merge two sets
let _ = self.clusters.union("a", "b");
If I call self.clusters.find("a") and self.clusters.find("b") now, I get the same value. However, when I call the finalize() method and try to iterate through the map, the original tags are returned, as if I never merged the sets.
self.clusters.finalize();
for (address, tag) in &self.clusters.map {
self.clusterizer_writer.write_all(format!("{};{}\n", address,
self.clusters.parent[*tag]).as_bytes()).unwrap();
}
// to output all keys with the same tag as a list.
let a: Vec<(usize, Vec<String>)> = {
let mut x = HashMap::new();
for (k, v) in self.clusters.map.clone() {
x.entry(v).or_insert_with(Vec::new).push(k)
}
x.into_iter().collect()
};
I can't figure out why this is the case, but I'm relatively new to Rust; maybe its an issue with pointers?
Instead of "a" and "b", I'm actually using something like utils::arr_to_hex(&input.outpoint.txid) of type String.
This is the Rust implementation of the Union-Find algorithm that I am using:
/// Tarjan's Union-Find data structure.
#[derive(RustcDecodable, RustcEncodable)]
pub struct DisjointSet<T: Clone + Hash + Eq> {
set_size: usize,
parent: Vec<usize>,
rank: Vec<usize>,
map: HashMap<T, usize>, // Each T entry is mapped onto a usize tag.
}
impl<T> DisjointSet<T>
where
T: Clone + Hash + Eq,
{
pub fn new() -> Self {
const CAPACITY: usize = 1000000;
DisjointSet {
set_size: 0,
parent: Vec::with_capacity(CAPACITY),
rank: Vec::with_capacity(CAPACITY),
map: HashMap::with_capacity(CAPACITY),
}
}
pub fn make_set(&mut self, x: T) {
if self.map.contains_key(&x) {
return;
}
let len = &mut self.set_size;
self.map.insert(x, *len);
self.parent.push(*len);
self.rank.push(0);
*len += 1;
}
/// Returns Some(num), num is the tag of subset in which x is.
/// If x is not in the data structure, it returns None.
pub fn find(&mut self, x: T) -> Option<usize> {
let pos: usize;
match self.map.get(&x) {
Some(p) => {
pos = *p;
}
None => return None,
}
let ret = DisjointSet::<T>::find_internal(&mut self.parent, pos);
Some(ret)
}
/// Implements path compression.
fn find_internal(p: &mut Vec<usize>, n: usize) -> usize {
if p[n] != n {
let parent = p[n];
p[n] = DisjointSet::<T>::find_internal(p, parent);
p[n]
} else {
n
}
}
/// Union the subsets to which x and y belong.
/// If it returns Ok<u32>, it is the tag for unified subset.
/// If it returns Err(), at least one of x and y is not in the disjoint-set.
pub fn union(&mut self, x: T, y: T) -> Result<usize, ()> {
let x_root;
let y_root;
let x_rank;
let y_rank;
match self.find(x) {
Some(x_r) => {
x_root = x_r;
x_rank = self.rank[x_root];
}
None => {
return Err(());
}
}
match self.find(y) {
Some(y_r) => {
y_root = y_r;
y_rank = self.rank[y_root];
}
None => {
return Err(());
}
}
// Implements union-by-rank optimization.
if x_root == y_root {
return Ok(x_root);
}
if x_rank > y_rank {
self.parent[y_root] = x_root;
return Ok(x_root);
} else {
self.parent[x_root] = y_root;
if x_rank == y_rank {
self.rank[y_root] += 1;
}
return Ok(y_root);
}
}
/// Forces all laziness, updating every tag.
pub fn finalize(&mut self) {
for i in 0..self.set_size {
DisjointSet::<T>::find_internal(&mut self.parent, i);
}
}
}
I think you're just not extracting the information out of your DisjointSet struct correctly.
I got sniped by this and implemented union find. First, with a basic usize implemention:
pub struct UnionFinderImpl {
parent: Vec<usize>,
}
Then with a wrapper for more generic types:
pub struct UnionFinder<T: Hash> {
rev: Vec<Rc<T>>,
fwd: HashMap<Rc<T>, usize>,
uf: UnionFinderImpl,
}
Both structs implement a groups() method that returns a Vec<Vec<>> of groups. Clone isn't required because I used Rc.
Playground
Out of habit from interpreted programming languages, I want to rewrite many values based on their key. I assumed that I would store all the information in the struct prepared for this project. So I started iterating:
struct Container {
x: String,
y: String,
z: String
}
impl Container {
// (...)
fn load_data(&self, data: &HashMap<String, String>) {
let valid_keys = vec_of_strings![ // It's simple vector with Strings
"x", "y", "z"
] ;
for key_name in &valid_keys {
if data.contains_key(key_name) {
self[key_name] = Some(data.get(key_name);
// It's invalid of course but
// I do not know how to write it correctly.
// For example, in PHP I would write it like this:
// $this[$key_name] = $data[$key_name];
}
}
}
// (...)
}
Maybe macros? I tried to use them. key_name is always interpreted as it is, I cannot get value of key_name instead.
How can I do this without repeating the code for each value?
With macros, I always advocate starting from the direct code, then seeing what duplication there is. In this case, we'd start with
fn load_data(&mut self, data: &HashMap<String, String>) {
if let Some(v) = data.get("x") {
self.x = v.clone();
}
if let Some(v) = data.get("y") {
self.y = v.clone();
}
if let Some(v) = data.get("z") {
self.z = v.clone();
}
}
Note the number of differences:
The struct must take &mut self.
It's inefficient to check if a value is there and then get it separately.
We need to clone the value because we only only have a reference.
We cannot store an Option in a String.
Once you have your code working, you can see how to abstract things. Always start by trying to use "lighter" abstractions (functions, traits, etc.). Only after exhausting that, I'd start bringing in macros. Let's start by using stringify
if let Some(v) = data.get(stringify!(x)) {
self.x = v.clone();
}
Then you can extract out a macro:
macro_rules! thing {
($this: ident, $data: ident, $($name: ident),+) => {
$(
if let Some(v) = $data.get(stringify!($name)) {
$this.$name = v.clone();
}
)+
};
}
impl Container {
fn load_data(&mut self, data: &HashMap<String, String>) {
thing!(self, data, x, y, z);
}
}
fn main() {
let mut c = Container::default();
let d: HashMap<_, _> = vec![("x".into(), "alpha".into())].into_iter().collect();
c.load_data(&d);
println!("{:?}", c);
}
Full disclosure: I don't think this is a good idea.