Moving a &[&str] into a thread - rust

As the title already says, I'm trying to move a &[&str] into a thread. Well, actually, the code below works, but I have two problems with it:
let args2: Vec<_> = args.iter().map(|arg| { arg.to_string() }).collect(); seems a bit verbose to convert a &[&str] into a Vec<String>. Can this be done "nicer"?
If I understand it correctly, the strings get copied twice: first by the let cmd2 and let args2 statements; then by moving them inside the move closure. Is this correct? And if so, can it be done with one copy?
I'm aware of thread::scoped, but is deprecated at the moment. I'm also coding this to learn a bit more about Rust, so comments about "unrusty" code are appreciated too.
use std::process::{Command,Output};
use std::thread;
use std::thread::JoinHandle;
pub struct Process {
joiner: JoinHandle<Output>,
}
impl Process {
pub fn new(cmd: &str, args: &[&str]) -> Process {
// Copy the strings for the thread
let cmd2 = cmd.to_string();
let args2: Vec<_> = args.iter().map(|arg| { arg.to_string() }).collect();
let child = thread::spawn(move || {
Command::new(cmd2).args(&args2[..]).output().unwrap_or_else(|e| {
panic!("Failed to execute process: {}", e)
})
});
Process { joiner: child }
}
}

let args2: Vec<_> = args.iter().map(|arg| { arg.to_string() }).collect(); seems a bit verbose to convert a &[&str] into a Vec. Can this be done "nicer"?
I don't think so. There are a few minor variations of this that also work (e.g. args.iter().cloned().map(String::from).collect();), but I can't think of one that is substantially nicer. One minor point is that using to_string to convert a &str to a String isn't quite as efficient as using String::from or to_owned.
If I understand it correctly, the strings get copied twice: first by the let cmd2 and let args2 statements; then by moving them inside the move closure. Is this correct? And if so, can it be done with one copy?
No, the strings are only copied where you call to_string. Strings don't implement Copy, so they're never copied implicitly. If you try to access the strings after they have been moved to the closure, you will get a compiler error.

Related

How to store one of two constants in a value, where the constants share traits?

Depending on configuration I need to select either stdout or sink once, and pass the results as an output destination for subsequent output call.s
My Java and C++ experience tell me that abstracting away from the concrete type is wise and makes room for future design changes. This code however won't compile:
let out = if std::env::var("LOG").is_ok() {
std::io::stdout()
} else {
std::io::sink()
};
Stating...
`if` and `else` have incompatible types
What is the Rust-o-matic way of solving this?
Dynamic dispatch using trait objects is probably what you need:
use std::io::{self, Write};
use std::env;
fn get_output() -> Box<dyn Write> {
if env::var("LOG").is_ok() {
Box::new(io::stdout())
} else {
Box::new(io::sink())
}
}
let out = get_output();
The approach from Peter's answer is probably what you need, but it does require an extra allocation. (Which probably doesn't matter in the least in this case, but could matter in other scenarios.) If you are only passing out downward, i.e. as argument to functions, you can avoid the allocation by using two variables to store the different outputs:
let (mut stdout, mut sink);
let out: &mut dyn Write = if std::env::var("LOG").is_ok() {
stdout = std::io::stdout();
&mut stdout
} else {
sink = std::io::sink();
&mut sink
};
// ...proceed to use out...

Rust chunks method with owned values?

I'm trying to perform a parallel operation on several chunks of strings at a time, and I'm finding having an issue with the borrow checker:
(for context, identifiers is a Vec<String> from a CSV file, client is reqwest and target is an Arc<String> that is write once read many)
use futures::{stream, StreamExt};
use std::sync::Arc;
async fn nop(
person_ids: &[String],
target: &str,
url: &str,
) -> String {
let noop = format!("{} {}", target, url);
let noop2 = person_ids.iter().for_each(|f| {f.as_str();});
"Some text".into()
}
#[tokio::main]
async fn main() {
let target = Arc::new(String::from("sometext"));
let url = "http://example.com";
let identifiers = vec!["foo".into(), "bar".into(), "baz".into(), "qux".into(), "quux".into(), "quuz".into(), "corge".into(), "grault".into(), "garply".into(), "waldo".into(), "fred".into(), "plugh".into(), "xyzzy".into()];
let id_sets: Vec<&[String]> = identifiers.chunks(2).collect();
let responses = stream::iter(id_sets)
.map(|person_ids| {
let target = target.clone();
tokio::spawn( async move {
let resptext = nop(person_ids, target.as_str(), url).await;
})
})
.buffer_unordered(2);
responses
.for_each(|b| async { })
.await;
}
Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=e41c635e99e422fec8fc8a581c28c35e
Given chunks yields a Vec<&[String]>, the compiler complains that identifiers doesn't live long enough because it potentially goes out of scope while the slices are being referenced. Realistically this won't happen because there's an await. Is there a way to tell the compiler that this is safe, or is there another way of getting chunks as a set of owned Strings for each thread?
There was a similarly asked question that used into_owned() as a solution, but when I try that, rustc complains about the slice size not being known at compile time in the request_user function.
EDIT: Some other questions as well:
Is there a more direct way of using target in each thread without needing Arc? From the moment it is created, it never needs to be modified, just read from. If not, is there a way of pulling it out of the Arc that doesn't require the .as_str() method?
How do you handle multiple error types within the tokio::spawn() block? In the real world use, I'm going to receive quick_xml::Error and reqwest::Error within it. It works fine without tokio spawn for concurrency.
Is there a way to tell the compiler that this is safe, or is there another way of getting chunks as a set of owned Strings for each thread?
You can chunk a Vec<T> into a Vec<Vec<T>> without cloning by using the itertools crate:
use itertools::Itertools;
fn main() {
let items = vec![
String::from("foo"),
String::from("bar"),
String::from("baz"),
];
let chunked_items: Vec<Vec<String>> = items
.into_iter()
.chunks(2)
.into_iter()
.map(|chunk| chunk.collect())
.collect();
for chunk in chunked_items {
println!("{:?}", chunk);
}
}
["foo", "bar"]
["baz"]
This is based on the answers here.
Your issue here is that the identifiers are a Vector of references to a slice. They will not necessarily be around once you've left the scope of your function (which is what async move inside there will do).
Your solution to the immediate problem is to convert the Vec<&[String]> to a Vec<Vec<String>> type.
A way of accomplishing that would be:
let id_sets: Vec<Vec<String>> = identifiers
.chunks(2)
.map(|x: &[String]| x.to_vec())
.collect();

Read file character-by-character in Rust

Is there an idiomatic way to process a file one character at a time in Rust?
This seems to be roughly what I'm after:
let mut f = io::BufReader::new(try!(fs::File::open("input.txt")));
for c in f.chars() {
println!("Character: {}", c.unwrap());
}
But Read::chars is still unstable as of Rust v1.6.0.
I considered using Read::read_to_string, but the file may be large and I don't want to read it all into memory.
Let's compare 4 approaches.
1. Read::chars
You could copy Read::chars implementation, but it is marked unstable with
the semantics of a partial read/write of where errors happen is currently unclear and may change
so some care must be taken. Anyway, this seems to be the best approach.
2. flat_map
The flat_map alternative does not compile:
use std::io::{BufRead, BufReader};
use std::fs::File;
pub fn main() {
let mut f = BufReader::new(File::open("input.txt").expect("open failed"));
for c in f.lines().flat_map(|l| l.expect("lines failed").chars()) {
println!("Character: {}", c);
}
}
The problems is that chars borrows from the string, but l.expect("lines failed") lives only inside the closure, so compiler gives the error borrowed value does not live long enough.
3. Nested for
This code
use std::io::{BufRead, BufReader};
use std::fs::File;
pub fn main() {
let mut f = BufReader::new(File::open("input.txt").expect("open failed"));
for line in f.lines() {
for c in line.expect("lines failed").chars() {
println!("Character: {}", c);
}
}
}
works, but it keeps allocation a string for each line. Besides, if there is no line break on the input file, the whole file would be load to the memory.
4. BufRead::read_until
A memory efficient alternative to approach 3 is to use Read::read_until, and use a single string to read each line:
use std::io::{BufRead, BufReader};
use std::fs::File;
pub fn main() {
let mut f = BufReader::new(File::open("input.txt").expect("open failed"));
let mut buf = Vec::<u8>::new();
while f.read_until(b'\n', &mut buf).expect("read_until failed") != 0 {
// this moves the ownership of the read data to s
// there is no allocation
let s = String::from_utf8(buf).expect("from_utf8 failed");
for c in s.chars() {
println!("Character: {}", c);
}
// this returns the ownership of the read data to buf
// there is no allocation
buf = s.into_bytes();
buf.clear();
}
}
I cannot use lines() because my file could be a single line that is gigabytes in size. This an improvement on #malbarbo's recommendation of copying Read::chars from the an old version of Rust. The utf8-chars crate already adds .chars() to BufRead for you.
Inspecting their repository, it doesn't look like they load more than 4 bytes at a time.
Your code will look the same as it did before Rust removed Read::chars:
use std::io::stdin;
use utf8_chars::BufReadCharsExt;
fn main() {
for c in stdin().lock().chars().map(|x| x.unwrap()) {
println!("{}", c);
}
}
Add the following to your Cargo.toml:
[dependencies]
utf8-chars = "1.0.0"
There are two solutions that make sense here.
First, you could copy the implementation of Read::chars() and use it; that would make it completely trivial to move your code over to the standard library implementation if/when it stabilizes.
On the other hand, you could simply iterate line by line (using f.lines()) and then use line.chars() on each line to get the chars. This is a little more hacky, but it will definitely work.
If you only wanted one loop, you could use flat_map() with a lambda like |line| line.chars().

Cannot move data out of a Mutex

Consider the following code example, I have a vector of JoinHandlers in which I need it iterate over to join back to the main thread, however, upon doing so I am getting the error error: cannot move out of borrowed content.
let threads = Arc::new(Mutex::new(Vec::new()));
for _x in 0..100 {
let handle = thread::spawn(move || {
//do some work
}
threads.lock().unwrap().push((handle));
}
for t in threads.lock().unwrap().iter() {
t.join();
}
Unfortunately, you can't do this directly. When Mutex consumes the data structure you fed to it, you can't get it back by value again. You can only get &mut reference to it, which won't allow moving out of it. So even into_iter() won't work - it needs self argument which it can't get from MutexGuard.
There is a workaround, however. You can use Arc<Mutex<Option<Vec<_>>>> instead of Arc<Mutex<Vec<_>>> and then just take() the value out of the mutex:
for t in threads.lock().unwrap().take().unwrap().into_iter() {
}
Then into_iter() will work just fine as the value is moved into the calling thread.
Of course, you will need to construct the vector and push to it appropriately:
let threads = Arc::new(Mutex::new(Some(Vec::new())));
...
threads.lock().unwrap().as_mut().unwrap().push(handle);
However, the best way is to just drop the Arc<Mutex<..>> layer altogether (of course, if this value is not used from other threads).
As referenced in How to take ownership of T from Arc<Mutex<T>>? this is now possible to do without any trickery in Rust using Arc::try_unwrap and Mutex.into_inner()
let threads = Arc::new(Mutex::new(Vec::new()));
for _x in 0..100 {
let handle = thread::spawn(move || {
println!("{}", _x);
});
threads.lock().unwrap().push(handle);
}
let threads_unwrapped: Vec<JoinHandle<_>> = Arc::try_unwrap(threads).unwrap().into_inner().unwrap();
for t in threads_unwrapped.into_iter() {
t.join().unwrap();
}
Play around with it in this playground to verify.
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=9d5635e7f778bc744d1fb855b92db178
while the drain is a good solution, you can also do the following thing
// with a copy
let built_words: Arc<Mutex<Vec<String>>> = Arc::new(Mutex::new(vec![]));
let result: Vec<String> = built_words.lock().unwrap().clone();
// using drain
let mut locked_result = built_words.lock().unwrap();
let mut result: Vec<String> = vec![];
result.extend(locked_result.drain(..));
I would prefer to clone the data to get the original value. Not sure if it has any performance overhead.

How to create chaining API after read_to_string was changed to take a buffer?

I'm trying to port my library clog to the latest Rust version.
Rust changed a lot in the previous month and so I'm scratching my head over this code asking myself if there's really no way anymore to write this in a completely chained way?
fn get_last_commit () -> String {
let output = Command::new("git")
.arg("rev-parse")
.arg("HEAD")
.output()
.ok().expect("error invoking git rev-parse");
let encoded = String::from_utf8(output.stdout).ok().expect("error parsing output of git rev-parse");
encoded
}
In an older version of Rust the code could be written like that
pub fn get_last_commit () -> String {
Command::new("git")
.arg("rev-parse")
.arg("HEAD")
.spawn()
.ok().expect("failed to invoke rev-parse")
.stdout.as_mut().unwrap().read_to_string()
.ok().expect("failed to get last commit")
}
It seems there is no read_to_string() method anymore that doesn't take a buffer which makes it hard to implement a chaining API unless I'm missing something.
UPDATE
Ok, I figured I can use map to get it chaining.
fn get_last_commit () -> String {
Command::new("git")
.arg("rev-parse")
.arg("HEAD")
.output()
.map(|output| {
String::from_utf8(output.stdout).ok().expect("error reading into string")
})
.ok().expect("error invoking git rev-parse")
}
Actually I wonder if I could use and then but it seems the errors don't line up correctly ;)
As others have said, this was changed to allow reusing buffers/avoiding allocations.
Another alternative is to use read_to_string and manually provide the buffer:
pub fn get_last_commit () -> String {
let mut string = String::new();
Command::new("git")
.arg("rev-parse")
.arg("HEAD")
.spawn()
.ok().expect("failed to invoke rev-parse")
.stdout.as_mut().unwrap()
.read_to_string(&mut string)
.ok().expect("failed to get last commit");
string
}
This API was changed so that you didn't have to re-allocate a new String each time. However, as you've noticed, there's some convenience loss if you don't care about allocation. It might be a good idea to suggest re-adding this back in, like what happened with Vec::from_elem. Maybe open a small RFC?
While it may make sense to try to add this back to the standard library, here's a version of read_to_string that allocates on its own that you can use today:
#![feature(io)]
use std::io::{self,Read,Cursor};
trait MyRead: Read {
fn read_full_string(&mut self) -> io::Result<String> {
let mut s = String::new();
let r = self.read_to_string(&mut s);
r.map(|_| s)
}
}
impl<T> MyRead for T where T: Read {}
fn main() {
let bytes = b"hello";
let mut input = Cursor::new(bytes);
let s = input.read_full_string();
println!("{}", s.unwrap());
}
This should allow you to use the chaining style you had before.

Resources