Capture of moved value that cannot be copied - rust

I'm using Kuchiki to parse some HTML and making HTTP requests using hyper to concurrently operate on results through scoped_threadpool.
I select and iterate over listings. I decide the number of threads to allocate in the threadpool based on the number of listings:
let listings = document.select("table.listings").unwrap();
let mut pool = Pool::new(listings.count() as u32);
pool.scoped(|scope| {
for listing in listings {
do_stuff_with(listing);
}
});
When I try to do this I get capture of moved value: listings. listings is kuchiki::iter::Select<kuchiki::iter::Elements<kuchiki::iter::Descendants>>, which is non-copyable -- so I get neither an implicit clone nor an explicit .clone.
Inside the pool I can just do document.select("table.listings") again and it will work, but this seems unnecessary to me since I already used it to get the count. I don't need listings after the loop either.
Is there any way for me to use a non-copyable value in a closure?

Sadly, I think it's not possible the way you want to do it.
Your listings.count() consumes the iterator listings. You can avoid this by writing listings.by_ref().count(), but this won't have the desired effect, since count() will consume all elements of the iterator, so that the next call to next() will always yield None.
The only way to do achieve your goal is to somehow get the length of the iterator listings without consuming its elements. The trait ExactSizeIterator was built for this purpose, but it seems that kuchiki::iter::Select doesn't implement it. Note that this may also be impossible for that kind of iterator.
Edit: As #delnan suggested, another possibility is of course to collect the iterator into a Vec. This has some disadvantages, but may be a good idea in your case.
Let me also note, that you probably shouldn't create one thread for every line in the SELECT result set. Usually threadpools use approximately as many threads as there are CPUs.

Related

How to map to references if it's not an Iterator<Item=&T>?

I have a function that receives Iterator<Item=AsRef> and I wanted to get an array of substrings form this iterator. The issue is that when mapping, .map() gets ownership of AsRef and I can't return as_ref(). How can I turn this iterator into an iterator of references like with Vec::iter? Sample code:
fn a(lines: impl Iterator<Item=impl AsRef<str>>) {
println!("{:?}", lines.map(|s| s.as_ref()).collect::<Vec<&str>>());
}
a(vec!["one".to_string(), "two".to_string()].iter());
Its a bit problematic to do with just an iterator. As signature describes, iterator can very match own the Item(s) and transfer the ownership to you inside the closure. Consider taking &impl AsRef<str> instead. This should not be a problem to provide at the caller site.
EDIT
It is actually tricky on the caller side too. Im using a ReadBuf that returns an Iterator<Item=String>. How would you turn this iterator into an Iterator<Item=&String>?
BufRead will read lines as you demand them and it is designed to wait on the input. To use it properly and with memory efficiency, I'd recommend processing each line in a loop and discarding it before reading the next.
That's the solution I found but it seems very costly having to allocate an entire Vec just to get the references of items I already own in the iterator.
Iterator does not in fact own all the strings at the same time. Instead it reads them in batches (or it should) and it does not store them internally for you to reference them (or at least it should not).

How can I share the data without locking whole part of it?

Consider the following scenarios.
let mut map = HashMap::new();
map.insert(2,5);
thread::scope(|s|{
s.spawn(|_|{
map.insert(1,5);
});
s.spawn(|_|{
let d = map.get(&2).unwrap();
});
}).unwrap();
This code cannot be compiled because we borrow the variable map mutably in h1 and borrow again in h2. The classical solution is wrapping map by Arc<Mutex<...>>. But in the above code, we don't need to lock whole hashmap. Because, although two threads concurrently access to same hashmap, they access completely different region of it.
So I want to share map through thread without using lock, but how can I acquire it? I'm also open to use unsafe rust...
in the above code, we don't need to lock whole hashmap
Actually, we do.
Every insert into the HashMap may possibly trigger its reallocation, if the map is at that point on its capacity. Now, imagine the following sequence of events:
Second thread calls get and retrieves reference to the value (at runtime it'll be just an address).
First thread calls insert.
Map gets reallocated, the old chunk of memory is now invalid.
Second thread dereferences the previously-retrieved reference - boom, we get UB!
So, if you need to insert something in the map concurrently, you have to synchronize that somehow.
For the standard HashMap, the only way to do this is to lock the whole map, since the reallocation invalidates every element. If you used something like DashMap, which synchronizes access internally and therefore allows inserting through shared reference, this would require no locking from your side - but can be more cumbersome in other parts of API (e.g. you can't return a reference to the value inside the map - get method returns RAII wrapper, which is used for synchronization), and you can run into unexpected deadlocks.

Is reading from a file with multiple threads considered undefined behavior in rust?

Example seen below. It seems like this might by definition be ub, but it remains unclear to me.
fn main(){
let mut threads = Vec::new();
for _ in 0..100 {
let thread = thread::spawn(move || {
fs::read_to_string("./config/init.json")
.unwrap()
.trim()
.to_string()
});
threads.push(thread);
}
for handler in threads {
handler.join().unwrap();
}
}
On most operating systems only individual read operations are guaranteed to be atomic. read_to_string may perform multiple distinct reads, which means that it's not guaranteed to be atomic between multiple threads/processes. If another process is modifying this file concurrently, read_to_string could return a mixture of data from before and after the modification. In other words, each read_to_string operation is not guaranteed to return an identical result, and some may even fail while others succeed if another process deletes the file while the program is running.
However, none of this behavior is classified as "undefined." Absent hardware problems, you are guaranteed to get back a std::io::Result<String> in a valid state, which is something you can reason about. Once UB is invoked, you can no longer reason about the state of the program.
By way of analogy, consider a choose your own adventure book. At the end of each segment you'll have some instructions like "If you choose to go into the cave, go to page 53. If you choose to take the path by the river, go to page 20." Then you turn to the appropriate page and keep reading. This is a bit like Result -- if you have an Ok you do one thing, but if you have an Err you do another thing.
Once undefined behavior is invoked, this kind of choice no longer makes sense because the program is in a state where the rules of the language no longer apply. The program could do anything, including deleting random files from your hard drive. In the book analogy, the book caught fire. Trying to follow the rules of the book no longer makes any sense, and you hope the book doesn't burn your house down with it.
In Rust you're not supposed to be able to invoke UB without using the unsafe keyword, so if you don't see that keyword anywhere then UB isn't on the table.

How to parallely `map(...)` on a custom, single-threaded iterator in Rust?

I have a MyReader that implements Iterator and produces Buffers where Buffer : Send. MyReader produces a lot of Buffers very quickly, but I have a CPU-intensive job to perform on each Buffer (.map(|buf| ...)) that is my bottleneck, and then gather the results (ordered). I want to parallelize the CPU intense work - hopefully to N threads, that would use work stealing to perform them as fast as the number of cores allows.
Edit: To be more precise. I am working on rdedup. MyStruct is Chunker which reads io::Read (typically stdio), finds parts (chunks) of data and yields them. Then map() is supposed, for each chunk, to calculate sha256 digest of it, compress, encrypt, save and return the digest as the result of map(...). Digest of saved data is used to build index of the data. The order between chunks being processed by map(...) does not matter, but digest returned from each map(...) needs to be collected in the same order that the chunks were found. The actual save to file step is offloaded to yet another thread (writter thread). actual code of PR in question
I hoped I can use rayon for this, but rayon expect an iterator that is already parallizable - eg. a Vec<...> or something like that. I have found no way to get a par_iter from MyReader - my reader is very single-threaded in nature.
There is simple_parallel but documentation says it's not recommended for general use. And I want to make sure everything will just work.
I could just take a spmc queue implementation and a custom thread_pool, but I was hopping for an existing solution that is optimized and tested.
There's also pipeliner but doesn't support ordered map yet.
In general, preserving order is a pretty tough requirement as far as parallelization goes.
You could try to hand-make it with a typical fan-out/fan-in setup:
a single producer which tags inputs with a sequential monotonically increasing ID,
a thread pool which consumes from this producer and then sends the result toward the final consumer,
a consumer who buffers and reorders result so as to treat them in the sequential order.
Or you could raise the level of abstraction.
Of specific interest here: Future.
A Future represents the result of a computation, which may or may not have happened yet. A consumer receiving an ordered list of Future can simply wait on each one, and let buffering occur naturally in the queue.
For bonus points, if you use a fixed size queue, you automatically get back-pressure on the consumer.
And therefore I would recommend building something of CpuPool.
The setup is going to be:
use std::sync::mpsc::{Receiver, Sender};
fn produce(sender: Sender<...>) {
let pool = CpuPool::new_num_cpus();
for chunk in reader {
let future = pool.spawn_fn(|| /* do work */);
sender.send(future);
}
// Dropping the sender signals there's no more work to consumer
}
fn consume(receiver: Receiver<...>) {
while let Ok(future) = receiver.recv() {
let item = future.wait().expect("Computation Error?");
/* do something with item */
}
}
fn main() {
let (sender, receiver) = std::sync::mpsc::channel();
std::thread::spawn(move || consume(receiver));
produce(sender);
}
There is now a dpc-pariter crate. Simply replace iter.map(fn) with iter.parallel_map(fn), which will perform work in parallel while preserving result order. From the docs:
* drop-in replacement for standard iterators(*)
* preserves order
* lazy, somewhat like single-threaded iterators
* panic propagation
* support for iterating over borrowed values using scoped threads
* backpressure
* profiling methods (useful for analyzing pipelined processing bottlenecks)
Also, Rayon has an open issue with a great in-depth discussion of various implementation details and limitations.

Is it possible to have safe mutable aliasing to non-overlapping memory?

I'm looking for a way to take a large object and break it into smaller mutable child objects, which can be processed in parallel.
Something like:
struct PixelBuffer { data:Vec<u32>, width:u32, height:u32 }
struct PixelBlock { data:Vec<u32> }
impl PixelBuffer {
fn decompose(&'a mut self) -> Vec<Guard<'a, PixelBlock>>> {
...
}
}
Where the resulting PixelBlock's can be processed in parallel, and the parent PixelBuffer will remain locked until all Guard<PixelBlock> are dropped.
This is effectively mutable pointer aliasing; the large data block in PixelBuffer will be directly modified via each PixelBlock.
However, each PixelBlock is non-overlapping segment from the internal data in PixelBuffer.
You can certainly do this in unsafe code (internal buffer is a raw pointer; generate a new external pointer for each PixelBlock); but is it possible to achieve the same result using safe code?
(NB. I'm open to using a data block allocated from libc::malloc if that'll help?)
This works fine and is a natural consequence of how, e.g., iterators work: the next method hands out a sequence of values that are not lifetime-connected to the reference they come from, i.e. fn next(&mut self) -> Option<Self::Item>. This automatically means that any iterator that yields &mut pointers (like, slice.iter_mut()) is yielding pointers to non-overlapping memory, because anything else would be incorrect.
One way to use this in parallel is something like my simple_parallel library, e.g. Pool::for_.
(You'll need to give more details about the internals of PixelBuffer to be more specific about how to do it in this case.)
There is no way to completely avoid unsafe Rust, because the compiler cannot currently evaluate the safety of sub-slices. However, the standard library contains code that provides a safe wrapper that you can use.
Read up on std::slice::Chunks and std::slice::ChunksMut.
Sample code: https://play.rust-lang.org/?gist=ceec5be3e1530c0a6d3b&version=stable
However, your next problem is sending the slices to separate threads, because the best way to do that would be thread::scoped, which is currently deprecated due to some safety problems that were discovered this year...
Also, keep in mind that Vec<_> owns its contents, whereas slices are just a view. Generally, you want to write most functions in terms of slices, and keep only one "Vec" to hold the data.

Resources