How to write HTML source to an HTML file with fantoccini? - rust

I'm making a scraper and want to download the HTML source of a website so that it can be parsed. I'm using fantoccini with WebDriver to log in to the site. I have an asynchronous funct
Now that I'm logged in, what do I need to do to extract the HTML source?
What I have so far is this:
let htmlsrc = client.source();
let mut file = File::create("htmlsrc.html").unwrap();
fs::write("htmlsrc.txt", htmlsrc).await?;
However this gives me this error:
error[E0277]: the trait bound `impl Future: AsRef<[u8]>` is not satisfied
--> src/main.rs:44:30
|
44 | fs::write("htmlsrc.txt", htmlsrc).await?;
| --------- ^^^^^^^ the trait `AsRef<[u8]>` is not implemented for `impl Future`
I'm new to Rust, so I'm not very sure of what I'm doing.
Any help would be appreciated!
The full code is this:
use fantoccini::{ClientBuilder, Locator, Client};
use std::time::Duration;
use tokio::time::sleep;
use std::fs::File;
use tokio::fs;
// let's set up the sequence of steps we want the browser to take
#[tokio::main]
async fn main() -> Result<(), fantoccini::error::CmdError> {
let s_email = "email";
let password = "pass;
// Connecting using "native" TLS (with feature `native-tls`; on by default)
let mut c = ClientBuilder::native()
.connect("http://localhost:4444").await
.expect("failed to connect to WebDriver");
// first, go to the Managebac login page
c.goto("https://bavarianis.managebac.com/login").await?;
// define email field with css selector
let mut email_field = c.wait().for_element(Locator::Css("#session_login")).await?;
// type in email
email_field.send_keys(s_email).await?;
// define email field with css selector
let mut pass_field = c.wait().for_element(Locator::Css("#session_password")).await?;
// type in email
pass_field.send_keys(password).await?;
// define sign in button with xpath
let signin_button = "/html/body/main/div/div/form/div[2]/input";
let signin_button = c.wait().for_element(Locator::XPath(signin_button)).await?;
// click sign in
signin_button.click().await?;
let htmlsrc = c.source();
let mut file = File::create("htmlsrc.html").unwrap();
fs::write("htmlsrc.txt", htmlsrc).await?;
//temp to observe
sleep(Duration::from_millis(6000)).await;
//c.close().await?;
Ok(())
}

Related

future cannot be sent between threads safely after Mutex

I've been trying to move from postgres to tokio_postgres but struggle with some async.
use scraper::Html;
use std::sync::Arc;
use tokio::sync::Mutex;
use tokio::task;
struct Url {}
impl Url {
fn scrapped_home(&self, symbol: String) -> Html {
let url = format!(
"https://finance.yahoo.com/quote/{}?p={}&.tsrc=fin-srch", symbol, symbol
);
let response = reqwest::blocking::get(url).unwrap().text().unwrap();
scraper::Html::parse_document(&response)
}
}
#[derive(Clone)]
struct StockData {
symbol: String,
}
#[tokio::main]
async fn main() {
let stock_data = StockData { symbol: "".to_string() };
let url = Url {};
let mut uri_test: Arc<Mutex<Html>> = Arc::new(Mutex::from(url.scrapped_home(stock_data.clone().symbol)));
let mut uri_test_closure = Arc::clone(&uri_test);
let uri = task::spawn_blocking(|| {
uri_test_closure.lock()
});
}
Without putting a mutex on
url.scrapped_home(stock_data.clone().symbol)),
I would get the error that a runtime cannot drop in a context where blocking is not allowed, so I put in inside spawn_blocking. Then I get the error that Cell cannot be shared between threads safely. This, from what I could gather, is because Cell isn'it Sync. I then wrapped in within a Mutex. This on the other hand throws Cell cannot be shared between threads safely'.
Now, is that because it contains a reference to a Cell and therefore isn't memory-safe? If so, would I need to implement Sync for Html? And how?
Html is from the scraper crate.
UPDATE:
Sorry, here's the error.
error: future cannot be sent between threads safely
--> src/database/queries.rs:141:40
|
141 | let uri = task::spawn_blocking(|| {
| ________________________________________^
142 | | uri_test_closure.lock()
143 | | });
| |_________^ future is not `Send`
|
= help: within `tendril::tendril::NonAtomic`, the trait `Sync` is not implemented for `Cell<usize>`
note: required by a bound in `spawn_blocking`
--> /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.20.1/src/task/blocking.rs:195:12
|
195 | R: Send + 'static,
| ^^^^ required by this bound in `spawn_blocking`
UPDATE:
Adding Cargo.toml as requested:
[package]
name = "reprod"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
reqwest = { version = "0.11", features = ["json", "blocking"] }
tokio = { version = "1", features = ["full"] }
tokio-postgres = "0"
scraper = "0.12.0"
UPDATE: Added original sync code:
fn main() {
let stock_data = StockData { symbol: "".to_string() };
let url = Url {};
url.scrapped_home(stock_data.clone().symbol);
}
UPDATE: Thanks to Kevin I was able to get it to work. As he pointed out Html was neither Send nor Sync. This part of the Rust lang doc helped me to understand how message passing works.
pub fn scrapped_home(&self, symbol: String) -> Html {
let (tx, rx) = mpsc::channel();
let url = format!(
"https://finance.yahoo.com/quote/{}?p={}&.tsrc=fin-srch", symbol, symbol
);
thread::spawn(move || {
let val = reqwest::blocking::get(url).unwrap().text().unwrap();
tx.send(val).unwrap();
});
scraper::Html::parse_document(&rx.recv().unwrap())
}
Afterwards I had some sort of epiphany and got it to work with tokio, without message passing, as well
pub async fn scrapped_home(&self, symbol: String) -> Html {
let url = format!(
"https://finance.yahoo.com/quote/{}?p={}&.tsrc=fin-srch", symbol, symbol
);
let response = task::spawn_blocking(move || {
reqwest::blocking::get(url).unwrap().text().unwrap()
}).await.unwrap();
scraper::Html::parse_document(&response)
}
I hope that this might help someone.
This illustrates it a bit more clearly now: you're trying to return a tokio::sync::MutexGuard across a thread boundary. When you call this:
let mut uri_test: Arc<Mutex<Html>> = Arc::new(Mutex::from(url.scrapped_home(stock_data.clone().symbol)));
let mut uri_test_closure = Arc::clone(&uri_test);
let uri = task::spawn_blocking(|| {
uri_test_closure.lock()
});
The uri_test_closure.lock() call (tokio::sync::Mutex::lock()) doesn't have a semicolon, which means it's returning the object that's the result of the call. But you can't return a MutexGuard across a thread boundary.
I suggest you read up on the linked lock() call, as well as blocking_lock() and such there.
I'm not certain of the point of your call to task::spawn_blocking here. If you're trying to illustrate a use case for something, that's not coming across.
Edit:
The problem is deeper. Html is both !Send and !Sync which means you can't even wrap it up in an Arc<Mutex<Html>> or Arc<Mutex<Optional<Html>>> or whatever. You need to get the data from another thread in another way, and not as that "whole" object. See this post on the rust user forum for more detailed information. But whatever you're wrapping must be Send and that struct is explicitly not.
So if a type is Send and !Sync, you can wrap in a Mutex and an Arc. But if it's !Send, you're hooped, and need to use message passing, or other synchronization mechanisms.

How do I insert a dynamic byte string into a vector?

I need to create packet to send to the server. For this purpose I use vector with byteorder crate. When I try to append string, Rust compiler tells I use unsafe function and give me an error.
use byteorder::{LittleEndian, WriteBytesExt};
fn main () {
let login = "test";
let packet_length = 30 + (login.len() as i16);
let mut packet = Vec::new();
packet.write_u8(0x00);
packet.write_i16::<LittleEndian>(packet_length);
packet.append(&mut Vec::from(String::from("game name ").as_bytes_mut()));
// ... rest code
}
The error is:
packet.append(&mut Vec::from(String::from("game name ").as_bytes_mut()));
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ call to unsafe function
This is playground to reproduce: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=381c6d14660d47beaece15d068b3dc6a
What is the correct way to insert some string as bytes into vector ?
The unsafe function called was as_bytes_mut(). This creates a mutable reference with exclusive access to the bytes representing the string, allowing you to modify them. You do not really need a mutable reference in this case, as_bytes() would have sufficed.
However, there is a more idiomatic way. Vec<u8> also functions as a writer (it implements std::io::Write), so you can use one of its methods, or even the write! macro, to write encoded text on it.
use std::io::Write;
use byteorder::{LittleEndian, WriteBytesExt};
fn main () -> Result<(), std::io::Error> {
let login = "test";
let packet_length = 30 + (login.len() as i16);
let mut packet = Vec::new();
packet.write_u8(0x00)?;
packet.write_i16::<LittleEndian>(packet_length)?;
let game_name = String::from("game name");
write!(packet, "{} ", game_name)?;
Ok(())
}
Playground
See also:
Use write! macro with a string instead of a string literal
What's the de-facto way of reading and writing files in Rust 1.x?
You can use .extend() on the Vec and pass in the bytes representation of the String:
use byteorder::{LittleEndian, WriteBytesExt};
fn main() {
let login = "test";
let packet_length = 30 + (login.len() as i16);
let mut packet = Vec::new();
packet.write_u8(0x00);
packet.write_i16::<LittleEndian>(packet_length);
let string = String::from("game name ");
packet.extend(string.as_bytes());
}
Playground

How do you create a Rust Quick XML reader from either a file or URL?

I have the following code using the quick_xml library:
use quick_xml::Reader;
use std::io::BufRead;
use std::path::Path;
use std::io::BufReader;
/// Returns an XML stream either from a file or a URL.
fn get_xml_stream(source: &str) -> Result<Reader<impl BufRead>, Error> {
let local_path = Path::new(source);
// Try to read a local file first.
if local_path.is_file() {
let reader =
Reader::from_file(source).context(format!("couldn't read file {:?}", source))?;
return Ok(reader);
}
// Try to fetch a remote file.
let response = reqwest::get(source).context(format!(
"File not found and failed fetching from remote URL {}",
source
))?;
if !response.status().is_success() {
return Err(format_err!("XML download failed with {:#?}", response));
}
Ok(Reader::from_reader(BufReader::new(response)))
}
The return type is dynamic: a Reader that either has data from a file or response body.
Compilation error:
error[E0308]: mismatched types
--> src/main.rs:225:43
|
225 | Ok(Reader::from_reader(BufReader::new(response)))
| ^^^^^^^^ expected struct `std::fs::File`, found struct `reqwest::response::Response`
|
= note: expected type `std::fs::File`
found type `reqwest::response::Response`
The compiler thinks we always want to read from a file, but this is a response stream here. How can I tell the compiler to accept both types of buffered readers in the XML reader?
Returning impl SomeTrait means the function returns one concrete type that implements that trait and you just don't want to spell out what type it is. It doesn't mean it can return heterogeneous types.
Box<dyn BufRead> is the right choice here:
use failure::{Error, format_err, ResultExt}; // failure = "0.1.6"
use quick_xml::Reader; // quick-xml = "0.17.2"
use std::fs::File;
use std::io::{BufRead, BufReader};
use std::path::Path;
/// Returns an XML stream either from a file or a URL.
fn get_xml_stream(source: &str) -> Result<Reader<Box<dyn BufRead>>, Error> {
let local_path = Path::new(source);
if local_path.is_file() {
let file = File::open(local_path)?;
let reader = BufReader::new(file);
Ok(Reader::from_reader(Box::new(reader)))
} else {
let response = reqwest::get(source).context(format!(
"File not found and failed fetching from remote URL {}",
source
))?;
if !response.status().is_success() {
return Err(format_err!("XML download failed with {:#?}", response));
}
let reader = BufReader::new(response);
Ok(Reader::from_reader(Box::new(reader)))
}
}
As a side note, mixing local path and remote URL is not a good idea. A mere local_path.is_file() is not enough to sanitize the input. You've been warned.

How to convert from std::io::Bytes to &[u8]

I am trying to write the contents of an HTTP Response to a file.
extern crate reqwest;
use std::io::Write;
use std::fs::File;
fn main() {
let mut resp = reqwest::get("https://www.rust-lang.org").unwrap();
assert!(resp.status().is_success());
// Write contents to disk.
let mut f = File::create("download_file").expect("Unable to create file");
f.write_all(resp.bytes());
}
But I get the following compile error:
error[E0308]: mismatched types
--> src/main.rs:12:17
|
12 | f.write_all(resp.bytes());
| ^^^^^^^^^^^^ expected &[u8], found struct `std::io::Bytes`
|
= note: expected type `&[u8]`
found type `std::io::Bytes<reqwest::Response>`
You cannot. Checking the docs for io::Bytes, there are no appropriate methods. That's because io::Bytes is an iterator that returns things byte-by-byte so there may not even be a single underlying slice of data.
It you only had io::Bytes, you would need to collect the iterator into a Vec:
let data: Result<Vec<_>, _> = resp.bytes().collect();
let data = data.expect("Unable to read data");
f.write_all(&data).expect("Unable to write data");
However, in most cases you have access to the type that implements Read, so you could instead use Read::read_to_end:
let mut data = Vec::new();
resp.read_to_end(&mut data).expect("Unable to read data");
f.write_all(&data).expect("Unable to write data");
In this specific case, you can use io::copy to directly copy from the Request to the file because Request implements io::Read and File implements io::Write:
extern crate reqwest;
use std::io;
use std::fs::File;
fn main() {
let mut resp = reqwest::get("https://www.rust-lang.org").unwrap();
assert!(resp.status().is_success());
// Write contents to disk.
let mut f = File::create("download_file").expect("Unable to create file");
io::copy(&mut resp, &mut f).expect("Unable to copy data");
}

tokio-curl: capture output into a local `Vec` - may outlive borrowed value

I do not know Rust well enough to understand lifetimes and closures yet...
Trying to collect the downloaded data into a vector using tokio-curl:
extern crate curl;
extern crate futures;
extern crate tokio_core;
extern crate tokio_curl;
use std::io::{self, Write};
use std::str;
use curl::easy::Easy;
use tokio_core::reactor::Core;
use tokio_curl::Session;
fn main() {
// Create an event loop that we'll run on, as well as an HTTP `Session`
// which we'll be routing all requests through.
let mut lp = Core::new().unwrap();
let mut out = Vec::new();
let session = Session::new(lp.handle());
// Prepare the HTTP request to be sent.
let mut req = Easy::new();
req.get(true).unwrap();
req.url("https://www.rust-lang.org").unwrap();
req.write_function(|data| {
out.extend_from_slice(data);
io::stdout().write_all(data).unwrap();
Ok(data.len())
})
.unwrap();
// Once we've got our session, issue an HTTP request to download the
// rust-lang home page
let request = session.perform(req);
// Execute the request, and print the response code as well as the error
// that happened (if any).
let mut req = lp.run(request).unwrap();
println!("{:?}", req.response_code());
println!("out: {}", str::from_utf8(&out).unwrap());
}
Produces an error:
error[E0373]: closure may outlive the current function, but it borrows `out`, which is owned by the current function
--> src/main.rs:25:24
|
25 | req.write_function(|data| {
| ^^^^^^ may outlive borrowed value `out`
26 | out.extend_from_slice(data);
| --- `out` is borrowed here
|
help: to force the closure to take ownership of `out` (and any other referenced variables), use the `move` keyword, as shown:
| req.write_function(move |data| {
Investigating further, I see that Easy::write_function requires the 'static lifetime, but the example of how to collect output from the curl-rust docs uses Transfer::write_function instead:
use curl::easy::Easy;
let mut data = Vec::new();
let mut handle = Easy::new();
handle.url("https://www.rust-lang.org/").unwrap();
{
let mut transfer = handle.transfer();
transfer.write_function(|new_data| {
data.extend_from_slice(new_data);
Ok(new_data.len())
}).unwrap();
transfer.perform().unwrap();
}
println!("{:?}", data);
The Transfer::write_function does not require the 'static lifetime:
impl<'easy, 'data> Transfer<'easy, 'data> {
/// Same as `Easy::write_function`, just takes a non `'static` lifetime
/// corresponding to the lifetime of this transfer.
pub fn write_function<F>(&mut self, f: F) -> Result<(), Error>
where F: FnMut(&[u8]) -> Result<usize, WriteError> + 'data
{
...
But I can't use a Transfer instance on tokio-curl's Session::perform because it requires the Easy type:
pub fn perform(&self, handle: Easy) -> Perform {
transfer.easy is a private field that is directly passed to session.perform.
It this an issue with tokio-curl? Maybe it should mark the transfer.easy field as public or implement new function like perform_transfer? Is there another way to collect output using tokio-curl per transfer?
The first thing you have to understand when using the futures library is that you don't have any control over what thread the code is going to run on.
In addition, the documentation for curl's Easy::write_function says:
Note that the lifetime bound on this function is 'static, but that is often too restrictive. To use stack data consider calling the transfer method and then using write_function to configure a callback that can reference stack-local data.
The most straight-forward solution is to use some type of locking primitive to ensure that only one thread at a time may have access to the vector. You also have to share ownership of the vector between the main thread and the closure:
use std::sync::Mutex;
use std::sync::Arc;
let out = Arc::new(Mutex::new(Vec::new()));
let out_closure = out.clone();
// ...
req.write_function(move |data| {
let mut out = out_closure.lock().expect("Unable to lock output");
// ...
}).expect("Cannot set writing function");
// ...
let out = out.lock().expect("Unable to lock output");
println!("out: {}", str::from_utf8(&out).expect("Data was not UTF-8"));
Unfortunately, the tokio-curl library does not currently support using the Transfer type that would allow for stack-based data.

Resources