Struct property accessable from method but not from outside - rust

I'm trying to build a basic web crawler in Rust, which I'm trying to port to html5ever. As of right now, I have a function with a struct inside that is supposed to return a Vec<String>. It gets this Vec from the struct in the return statement. Why does it always return an empty vector? (Does it have anything to do with the lifetime parameters?)
fn find_urls_in_html<'a>(
original_url: &Url,
raw_html: String,
fetched_cache: &Vec<String>,
) -> Vec<String> {
#[derive(Clone)]
struct Sink<'a> {
original_url: &'a Url,
returned_vec: Vec<String>,
fetched_cache: &'a Vec<String>,
}
impl<'a> TokenSink for Sink<'a> {
type Handle = ();
fn process_token(&mut self, token: Token, _line_number: u64) -> TokenSinkResult<()> {
trace!("token {:?}", token);
match token {
TagToken(tag) => {
if tag.kind == StartTag && tag.attrs.len() != 0 {
let _attribute_name = get_attribute_for_elem(&tag.name);
if _attribute_name == None {
return TokenSinkResult::Continue;
}
let attribute_name = _attribute_name.unwrap();
for attribute in &tag.attrs {
if &attribute.name.local != attribute_name {
continue;
}
trace!("element {:?} found", tag);
add_urls_to_vec(
repair_suggested_url(
self.original_url,
(&attribute.name.local, &attribute.value),
),
&mut self.returned_vec,
&self.fetched_cache,
);
}
}
}
ParseError(error) => {
warn!("error parsing html for {}: {:?}", self.original_url, error);
}
_ => {}
}
return TokenSinkResult::Continue;
}
}
let html = Sink {
original_url: original_url,
returned_vec: Vec::new(),
fetched_cache: fetched_cache,
};
let mut byte_tendril = ByteTendril::new();
{
let tendril_push_result = byte_tendril.try_push_bytes(&raw_html.into_bytes());
if tendril_push_result.is_err() {
warn!("error pushing bytes to tendril: {:?}", tendril_push_result);
return Vec::new();
}
}
let mut queue = BufferQueue::new();
queue.push_back(byte_tendril.try_reinterpret().unwrap());
let mut tok = Tokenizer::new(html.clone(), std::default::Default::default()); // default default! default?
let feed = tok.feed(&mut queue);
return html.returned_vec;
}
The output ends with no warning (and a panic, caused by another function due to this being empty). Can anyone help me figure out what's going on?
Thanks in advance.

When I initialize the Tokenizer, I use:
let mut tok = Tokenizer::new(html.clone(), std::default::Default::default());
The problem is that I'm telling the Tokenizer to use html.clone() instead of html. As such, it is writing returned_vec to the cloned object, not html. Changing a few things, such as using a variable with mutable references, fixes this problem.

Related

Lifetime issue assigning reference from conditional

I'm quite new in Rust and I'm having an issue with lifetimes that I believe I can understand what is happening and why, but can't get around in my head how can I solve it.
For simplicity I created this short "clone" of what I'm actually trying to do, but the real code is using asyc-stripe. Will annotate the example code with the real one in case is relevant.
There is the following structure:
// https://github.com/arlyon/async-stripe/blob/9f1a84144a23cc7b2124a1252ee15dc646ce0215/src/resources/generated/subscription.rs#L385
struct ObjectA<'a> {
field: i32,
object_b_id: Option<&'a str>,
}
// https://github.com/arlyon/async-stripe/blob/9f1a84144a23cc7b2124a1252ee15dc646ce0215/src/resources/generated/subscription.rs#L570
impl<'a> ObjectA<'a> {
fn new(field: i32) -> Self {
return Self {
field,
object_b_id: Default::default(),
};
}
}
// https://github.com/arlyon/async-stripe/blob/9f1a84144a23cc7b2124a1252ee15dc646ce0215/src/resources/generated/subscription.rs#L210
fn persist_obj_a(obj_a: ObjectA<'_>) {}
// ---
// https://github.com/arlyon/async-stripe/blob/9f1a84144a23cc7b2124a1252ee15dc646ce0215/src/resources/generated/payment_method.rs#L18
struct ObjectB {
id: ObjectBId,
}
// https://github.com/arlyon/async-stripe/blob/9f1a84144a23cc7b2124a1252ee15dc646ce0215/src/ids.rs#L518
struct ObjectBId {
value: String,
}
impl ObjectBId {
fn as_str(&self) -> &str {
return self.value.as_str();
}
}
// This is a wrapper around https://github.com/arlyon/async-stripe/blob/9f1a84144a23cc7b2124a1252ee15dc646ce0215/src/resources/generated/payment_method.rs#L128 that just returns the first one found (id any, hence the Option)
fn load_object_b() -> Option<ObjectB> {
return Some(ObjectB {
id: ObjectBId {
value: String::from("some_id"),
},
});
}
And what I'm trying to do is: load the ObjectB with load_object_b and use its ID into a ObjectA.
Ok, so on to my attempts.
First attempt
fn first_try(condition: bool) {
let mut obj_a = ObjectA::new(1);
if condition {
match load_object_b() {
Some(obj_b) => obj_a.object_b_id = Some(obj_b.id.as_str()),
None => (),
}
}
persist_obj_a(obj_a);
}
In here I get
obj_b.id does not live long enough
Which I can understand, since from what I can understand the obj_b only exists during the match arm and is droped by the end of it.
Second attempt
fn second_try(condition: bool) {
let mut obj_a = ObjectA::new(1);
if condition {
let obj_b = load_object_b();
match obj_b {
Some(ref obj_b) => obj_a.object_b_id = Some(obj_b.id.as_str()),
None => (),
}
}
persist_obj_a(obj_a);
}
Here I get
obj_b.0 does not live long enough
Which I guess it is still the same idea, just in a different place. Since again, from my understanding, obj_b now only lives within the scope of the if condition.
Third and last attempt
I ended up "solving" it with:
fn third_try(condition: bool) {
let mut obj_a = ObjectA::new(1);
let obj_b = load_object_b();
let obj_b_id = match obj_b {
Some(ref obj_b) => Some(obj_b.id.as_str()),
None => None,
};
if condition {
obj_a.object_b_id = obj_b_id;
}
persist_obj_a(obj_a);
}
In here I moved the obj_b to have the same lifetime as obj_a. So it solves the issue that I was having.
My problem with this solution is that I feel that I'm wasting resource doing the (possible expensive) request to load_object_b even if I'm not gonna use it based on the condition.
Not sure if I'm missing something very obvious or just going on the overall wrong direction, but would appreciate some light on what I might be doing wrong.
This should work, I think:
fn third_try(condition: bool) {
let mut obj_a = ObjectA::new(1);
let obj_b = if condition { load_object_b() } else { None };
obj_a.object_b_id = obj_b.as_ref().map (|o| o.id.as_str());
persist_obj_a(obj_a);
}
Rust allows you to have conditionally initialized variables. You can declare obj_b ouside of the if, but only initialize it inside the if. The compiler will ensure you can use it only if it is initialized.
fn second_try(condition: bool) {
let mut obj_a = ObjectA::new(1);
let obj_b;
if condition {
obj_b = load_object_b();
match obj_b {
Some(ref obj_b) => obj_a.object_b_id = Some(obj_b.id.as_str()),
None => (),
}
}
persist_obj_a(obj_a);
}

How to use actix field stream by two consumers?

I have an actix web service and would like to parse the contents of a multipart field while streaming with async-gcode and in addition store the contents e.g. in a database.
However, I have no clue how to feed in the stream to the Parser and at the same time collect the bytes into a Vec<u8> or a String.
The first problem I face is that field is a stream of actix::web::Bytes and not of u8.
#[post("/upload")]
pub async fn upload_job(
mut payload: Multipart,
) -> Result<HttpResponse, Error> {
let mut contents : Vec<u8> = Vec::new();
while let Ok(Some(mut field)) = payload.try_next().await {
let content_disp = field.content_disposition().unwrap();
match content_disp.get_name().unwrap() {
"file" => {
while let Some(chunk) = field.next().await {
contents.append(&mut chunk.unwrap().to_vec());
// already parse the contents
// and additionally store contents somewhere
}
}
_ => (),
}
}
Ok(HttpResponse::Ok().finish())
}
Any hint or suggestion is very much appreciated.
One of the options is to wrap field in a struct and implement Stream trait for it.
use actix_web::{HttpRequest, HttpResponse, Error};
use futures_util::stream::Stream;
use std::pin::Pin;
use actix_multipart::{Multipart, Field};
use futures::stream::{self, StreamExt};
use futures_util::TryStreamExt;
use std::task::{Context, Poll};
use async_gcode::{Parser, Error as PError};
use bytes::BytesMut;
use std::cell::RefCell;
pub struct Wrapper {
field: Field,
buffer: RefCell<BytesMut>,
index: usize,
}
impl Wrapper {
pub fn new(field: Field, buffer: RefCell<BytesMut>) -> Self {
buffer.borrow_mut().truncate(0);
Wrapper {
field,
buffer,
index: 0
}
}
}
impl Stream for Wrapper {
type Item = Result<u8, PError>;
fn poll_next(
mut self: Pin<&mut Self>,
cx: &mut Context<'_>,
) -> Poll<Option<Result<u8, PError>>> {
if self.index == self.buffer.borrow().len() {
match Pin::new(&mut self.field).poll_next(cx) {
Poll::Ready(Some(Ok(chunk))) => self.buffer.get_mut().extend_from_slice(&chunk),
Poll::Pending => return Poll::Pending,
Poll::Ready(None) => return Poll::Ready(None),
Poll::Ready(Some(Err(_))) => return Poll::Ready(Some(Err(PError::BadNumberFormat/* ??? */)))
};
} else {
let b = self.buffer.borrow()[self.index];
self.index += 1;
return Poll::Ready(Some(Ok(b)));
}
Poll::Ready(None)
}
}
#[post("/upload")]
pub async fn upload_job(
mut payload: Multipart,
) -> Result<HttpResponse, Error> {
while let Ok(Some(field)) = payload.try_next().await {
let content_disp = field.content_disposition().unwrap();
match content_disp.get_name().unwrap() {
"file" => {
let mut contents: RefCell<BytesMut> = RefCell::new(BytesMut::new());
let mut w = Wrapper::new(field, contents.clone());
let mut p = Parser::new(w);
while let Some(res) = p.next().await {
// Do something with results
};
// Do something with the buffer
let a = contents.get_mut()[0];
}
_ => (),
}
}
Ok(HttpResponse::Ok().finish())
}
Copying the Bytes from the Field won't be necessary when
Bytes::try_unsplit will be implemented. (https://github.com/tokio-rs/bytes/issues/287)
The answer from dmitryvm (thanks for your effort) showed me that there are actually two problems. At first, flatten the Bytes into u8's and, secondly, to "split" the stream into a buffer for later storage and the async-gcode parser.
This shows how I solved it:
#[post("/upload")]
pub async fn upload_job(
mut payload: Multipart,
) -> Result<HttpResponse, Error> {
let mut contents : Vec<u8> = Vec::new();
while let Ok(Some(mut field)) = payload.try_next().await {
let content_disp = field.content_disposition().unwrap();
match content_disp.get_name().unwrap() {
"file" => {
let field_stream = field
.map_err(|_| async_gcode::Error::BadNumberFormat) // Translate error
.map_ok(|y| { // Translate Bytes into stream with Vec<u8>
contents.extend_from_slice(&y); // Copy and store for later usage
stream::iter(y).map(Result::<_, async_gcode::Error>::Ok)
})
.try_flatten(); // Flatten the streams of u8's
let mut parser = Parser::new(field_stream);
while let Some(gcode) = parser.next().await {
// Process result from parser
}
}
_ => (),
}
}
Ok(HttpResponse::Ok().finish())
}

Borrowing the mutable member used inside the loop

The problem I want to solve is:
Given the recursively nested data structure, eg. a JSON tree, and a path pointing to (possibly non-existent) element inside it, return the mutable reference of the element, that's the closest to given path.
Example: if we have JSON document in form { a: { b: { c: "foo" } } } and a path a.b.d, we want to have a mutable pointer to value stored under key "b".
This is a code snippet, what I've got so far:
use std::collections::HashMap;
enum Json {
Number(i64),
Bool(bool),
String(String),
Array(Vec<Json>),
Object(HashMap<String, Json>)
}
struct Pointer<'a, 'b> {
value: &'a mut Json,
path: Vec<&'b str>,
position: usize
}
/// Return a mutable pointer to JSON element having shared
/// the nearest common path with provided JSON.
fn nearest_mut<'a,'b>(obj: &'a mut Json, path: Vec<&'b str>) -> Pointer<'a,'b> {
let mut i = 0;
let mut current = obj;
for &key in path.iter() {
match current {
Json::Array(array) => {
match key.parse::<usize>() {
Ok(index) => {
match array.get_mut(index) {
Some(inner) => current = inner,
None => break,
}
},
_ => break,
}
} ,
Json::Object(map) => {
match map.get_mut(key) {
Some(inner) => current = inner,
None => break
}
},
_ => break,
};
i += 1;
}
Pointer { path, position: i, value: current }
}
The problem is that this doesn't pass through Rust's borrow checker, as current is borrowed as mutable reference twice, once inside match statement and once at the end of the function, when constructing the pointer method.
I've tried a different approaches, but not figured out how to achieve the goal (maybe going the unsafe path).
I completely misread your question and I owe you an apology.
You cannot do it in one pass - you're going to need to do a read-only pass to find the nearest path (or exact path), and then a read-write pass to actually extract the reference, or pass a mutator function in the form of a closure.
I've implemented the two-pass method for you. Do note that it is still pretty performant:
fn nearest_mut<'a, 'b>(obj: &'a mut Json, path: Vec<&'b str>) -> Pointer<'a, 'b> {
let valid_path = nearest_path(obj, path);
exact_mut(obj, valid_path).unwrap()
}
fn exact_mut<'a, 'b>(obj: &'a mut Json, path: Vec<&'b str>) -> Option<Pointer<'a, 'b>> {
let mut i = 0;
let mut target = obj;
for token in path.iter() {
i += 1;
// borrow checker gets confused about `target` being mutably borrowed too many times because of the loop
// this once-per-loop binding makes the scope clearer and circumvents the error
let target_once = target;
let target_opt = match *target_once {
Json::Object(ref mut map) => map.get_mut(*token),
Json::Array(ref mut list) => match token.parse::<usize>() {
Ok(t) => list.get_mut(t),
Err(_) => None,
},
_ => None,
};
if let Some(t) = target_opt {
target = t;
} else {
return None;
}
}
Some(Pointer {
path,
position: i,
value: target,
})
}
/// Return a mutable pointer to JSON element having shared
/// the nearest common path with provided JSON.
fn nearest_path<'a, 'b>(obj: &'a Json, path: Vec<&'b str>) -> Vec<&'b str> {
let mut i = 0;
let mut target = obj;
let mut valid_paths = vec![];
for token in path.iter() {
// borrow checker gets confused about `target` being mutably borrowed too many times because of the loop
// this once-per-loop binding makes the scope clearer and circumvents the error
let target_opt = match *target {
Json::Object(ref map) => map.get(*token),
Json::Array(ref list) => match token.parse::<usize>() {
Ok(t) => list.get(t),
Err(_) => None,
},
_ => None,
};
if let Some(t) = target_opt {
target = t;
valid_paths.push(*token)
} else {
return valid_paths;
}
}
return valid_paths
}
The principle is simple - I reused the method I wrote in my initial question in order to get the nearest valid path (or exact path).
From there, I feed that straight into the function that I had in my original answer, and since I am certain the path is valid (from the prior function call) I can safely unwrap() :-)

How can I set a struct field value by string name?

Out of habit from interpreted programming languages, I want to rewrite many values based on their key. I assumed that I would store all the information in the struct prepared for this project. So I started iterating:
struct Container {
x: String,
y: String,
z: String
}
impl Container {
// (...)
fn load_data(&self, data: &HashMap<String, String>) {
let valid_keys = vec_of_strings![ // It's simple vector with Strings
"x", "y", "z"
] ;
for key_name in &valid_keys {
if data.contains_key(key_name) {
self[key_name] = Some(data.get(key_name);
// It's invalid of course but
// I do not know how to write it correctly.
// For example, in PHP I would write it like this:
// $this[$key_name] = $data[$key_name];
}
}
}
// (...)
}
Maybe macros? I tried to use them. key_name is always interpreted as it is, I cannot get value of key_name instead.
How can I do this without repeating the code for each value?
With macros, I always advocate starting from the direct code, then seeing what duplication there is. In this case, we'd start with
fn load_data(&mut self, data: &HashMap<String, String>) {
if let Some(v) = data.get("x") {
self.x = v.clone();
}
if let Some(v) = data.get("y") {
self.y = v.clone();
}
if let Some(v) = data.get("z") {
self.z = v.clone();
}
}
Note the number of differences:
The struct must take &mut self.
It's inefficient to check if a value is there and then get it separately.
We need to clone the value because we only only have a reference.
We cannot store an Option in a String.
Once you have your code working, you can see how to abstract things. Always start by trying to use "lighter" abstractions (functions, traits, etc.). Only after exhausting that, I'd start bringing in macros. Let's start by using stringify
if let Some(v) = data.get(stringify!(x)) {
self.x = v.clone();
}
Then you can extract out a macro:
macro_rules! thing {
($this: ident, $data: ident, $($name: ident),+) => {
$(
if let Some(v) = $data.get(stringify!($name)) {
$this.$name = v.clone();
}
)+
};
}
impl Container {
fn load_data(&mut self, data: &HashMap<String, String>) {
thing!(self, data, x, y, z);
}
}
fn main() {
let mut c = Container::default();
let d: HashMap<_, _> = vec![("x".into(), "alpha".into())].into_iter().collect();
c.load_data(&d);
println!("{:?}", c);
}
Full disclosure: I don't think this is a good idea.

What's an idiomatic way to delete a value from HashMap if it is empty?

The following code works, but it doesn't look nice as the definition of is_empty is too far away from the usage.
fn remove(&mut self, index: I, primary_key: &Rc<K>) {
let is_empty;
{
let ks = self.data.get_mut(&index).unwrap();
ks.remove(primary_key);
is_empty = ks.is_empty();
}
// I have to wrap `ks` in an inner scope so that we can borrow `data` mutably.
if is_empty {
self.data.remove(&index);
}
}
Do we have some ways to drop the variables in condition before entering the if branches, e.g.
if {ks.is_empty()} {
self.data.remove(&index);
}
Whenever you have a double look-up of a key, you need to think Entry API.
With the entry API, you get a handle to a key-value pair and can:
read the key,
read/modify the value,
remove the entry entirely (getting the key and value back).
It's extremely powerful.
In this case:
use std::collections::HashMap;
use std::collections::hash_map::Entry;
fn remove(hm: &mut HashMap<i32, String>, index: i32) {
if let Entry::Occupied(o) = hm.entry(index) {
if o.get().is_empty() {
o.remove_entry();
}
}
}
fn main() {
let mut hm = HashMap::new();
hm.insert(1, String::from(""));
remove(&mut hm, 1);
println!("{:?}", hm);
}
I did this in the end:
match self.data.entry(index) {
Occupied(mut occupied) => {
let is_empty = {
let ks = occupied.get_mut();
ks.remove(primary_key);
ks.is_empty()
};
if is_empty {
occupied.remove();
}
},
Vacant(_) => unreachable!()
}

Resources