How to leverage concurrency for processing large array of data in rust? - rust

I am a javascript developer and I am building a desktop application using Tauri. I am used to single-threaded languages and trying to get comfortable with the concept of concurrency.
My problem can be summarized as follows. I receive a JSON array from the backend of length 500. I loop through this array and perform some asynchronous operations like making a network call. In the end, I aggregate, structure, and return the data. This entire process takes around 25-35 seconds on my machine.
I wanted to leverage the concept of concurrency to reduce the time required for this operation. One possible solution that I thought of, was to create n number of threads, let's say 8, and parallelly process the data.
async fun main(){
// creating variables which will hold final structured data
let app_data;
let app_to_bundle_map;
// create 8 threads
let thread_1 = thread::spawn(|| {})
let thread_2 = thread::spawn(|| {})
//.. so on
for i in 1..500{
let thread_assigned = i % 8
match thread_assigned {
1 => {
// process_data() on thread 1 and insert into app_data & app_to_bundle_map.
// But how do I assign process_data() to the thread's closure?
// How do I make sure thread 1 is available for use?
}
2 => {
// process_data() on thread 2 and insert into app_data & app_to_bundle_map
}
_ => {
// process_data() on thread _ and insert into app_data & app_to_bundle_map
}
}
}
}
fun process_data(item: String, data: &Data){
// perform some heavy operations like
make_network_call(item.url)
// perform more operations and modify the function argument
more_processing()
}
async fun make_a_network_call(url: String) -> String{
let client = reqwest::Client::builder().build().unwrap();
let mut _res: Result<Response, Error> = client.get(url).send().await;
match _res{
OK(_res) => {
// return response
}
Err(_res) => {
format!(r#"{{"error": "{placeholder}"}}"#, placeholder = _res.to_string())
}
}
}
Another option I thought of was dividing my data of size 500 into 8 parts and then processing them parallelly. Is this a better approach? Or are both the approaches wrong, if so, what do you suggest is the correct method to solve such problems in rust? Overall, my final goal is to reduce the time from 25-35 seconds to less than 10 seconds. Looking forward to everybody's insights. Thank you in advance.

Concurrency is hard [citation needed]. What you are trying to do here is manually handling it. That of course could work, but it will be a pain, especially if you are a beginner. Luckily there are fantastic libraries already out there that provide handling concurrency part for you. You should looking if they already provide what you need.From your description I am not quite sure if you are CPU or IO bound.
If you are CPU bound you should look at rayon crate. It allows to easily iterate in parallel over some iterator.
If you are IO bound you should look at async rust. There are many libraries, that allow to do many things, but I would recommend tokio to begin with. It is production-ready and has great emphasis put on networking. You would need however to learn a bit about async rust, as it requires different thinking model than normal synchronous code.
And regardless of which one you choose you should familiarize yourself with channels. They are a great and easy tool for passing data around, including from one thread to another.

Related

Rust: Safe multi threading with recursion

I'm new to Rust.
For learning purposes, I'm writing a simple program to search for files in Linux, and it uses a recursive function:
fn ffinder(base_dir:String, prmtr:&'static str, e:bool, h:bool) -> std::io::Result<()>{
let mut handle_vec = vec![];
let pth = std::fs::read_dir(&base_dir)?;
for p in pth {
let p2 = p?.path().clone();
if p2.is_dir() {
if !h{ //search doesn't include hidden directories
let sstring:String = get_fname(p2.display().to_string());
let slice:String = sstring[..1].to_string();
if slice != ".".to_string() {
let handle = thread::spawn(move || {
ffinder(p2.display().to_string(),prmtr,e,h);
});
handle_vec.push(handle);
}
}
else {//search include hidden directories
let handle2 = thread::spawn(move || {
ffinder(p2.display().to_string(),prmtr,e,h);
});
handle_vec.push(handle2);
}
}
else {
let handle3 = thread::spawn(move || {
if compare(rmv_underline(get_fname(p2.display().to_string())),rmv_underline(prmtr.to_string()),e){
println!("File found at: {}",p2.display().to_string().blue());
}
});
handle_vec.push(handle3);
}
}
for h in handle_vec{
h.join().unwrap();
}
Ok(())
}
I've tried to use multi threading (thread::spawn), however it can create too many threads, exploding the OS limit, which breaks the program execution.
Is there a way to multi thread with recursion, using a safe,limited (fixed) amount of threads?
As one of the commenters mentioned, this is an absolutely perfect case for using Rayon. The blog post mentioned doesn't show how Rayon might be used in recursion, only making an allusion to crossbeam's scoped threads with a broken link. However, Rayon provides its own scoped threads implementation that solves your problem as well in that only uses as many threads as you have cores available, avoiding the error you ran into.
Here's the documentation for it:
https://docs.rs/rayon/1.0.1/rayon/fn.scope.html
Here's an example from some code I recently wrote. Basically what it does is recursively scan a folder, and each time it nests into a folder it creates a new job to scan that folder while the current thread continues. In my own tests it vastly outperforms a single threaded approach.
let source = PathBuf::from("/foo/bar/");
let (tx, rx) = mpsc::channel();
rayon::scope(|s| scan(&source, tx, s));
fn scan<'a, U: AsRef<Path>>(
src: &U,
tx: Sender<(Result<DirEntry, std::io::Error>, u64)>,
scope: &Scope<'a>,
) {
let dir = fs::read_dir(src).unwrap();
dir.into_iter().for_each(|entry| {
let info = entry.as_ref().unwrap();
let path = info.path();
if path.is_dir() {
let tx = tx.clone();
scope.spawn(move |s| scan(&path, tx, s)) // Recursive call here
} else {
// dbg!("{}", path.as_os_str().to_string_lossy());
let size = info.metadata().unwrap().len();
tx.send((entry, size)).unwrap();
}
});
}
I'm not an expert on Rayon, but I'm fairly certain the threading strategy works like this:
Rayon creates a pool of threads to match the number of logical cores you have available in your environment. The first call to the scoped function creates a job that the first available thread "steals" from the queue of jobs available. Each time we make another recursive call, it doesn't necessarily execute immediately, but it creates a new job that an idle thread can then "steal" from the queue. If all of the threads are busy, the job queue just fills up each time we make another recursive call, and each time a thread finishes its current job it steals another job from the queue.
The full code can be found here: https://github.com/1Dragoon/fcp
(Note that repo is a work in progress and the code there is currently typically broken and probably won't work at the time you're reading this.)
As an exercise to the reader, I'm more of a sys admin than an actual developer, so I also don't know if this is the ideal approach. From Rayon's documentation linked earlier:
scope() is a more flexible building block compared to join(), since a loop can be used to spawn any number of tasks without recursing
The language of that is a bit confusing. I'm not sure what they mean by "without recursing". Join seems to intend for you to already have tasks known about ahead of time and to execute them in parallel as threads become available, whereas scope seems to be more aimed at only creating jobs when you need them rather than having to know everything you need to do in advance. Either that or I'm understanding their meaning backwards.

How to convert ParallelIterator back to sequential Iterator?

I'm iterating over several gigabytes of input items from a database. On each input item, I'm doing some CPU-intensive processing which produces one or more new output items, tens of gigabytes in total. The output items are then stored in another database table.
I have gotten a nice speedup by using Rayon for parallel processing. However, the database API is not thread-safe; it's Send but not Sync, so the I/O must be serialized.
Ideally, I would just want to write:
input_database
.read_items()
.par_bridge() // Start parallelism.
.flat_map_iter(|input_item| {
// produce an Iterator<Item = OutputItem>
})
.ser_bridge() // End parallelism. This function does not exist.
.for_each(|output_item| {
output_database.write_item(output_item);
});
Basically I want the opposite of par_bridge(); something that runs on the thread where it's called, reads items from each thread, and produces them serially. But in the current implementation of Rayon, this doesn't seem to exist. I'm not sure whether this is because it's theoretically impossible, or whether it doesn't fit into the current design of the library.
The output is too big to collect it all into a Vec first; it needs to be streamed into the database directly.
By the way, I'm not married to Rayon; if there's another crate that is more suitable, I'm happy to make the switch.
You can wrap your output database in an Arc<Mutex> to prevent parallel accesses:
let output_database = Arc::new (Mutex::new (output_database));
input_database
.read_items()
.par_bridge() // Start parallelism.
.flat_map_iter(|input_item| {
// produce an Iterator<Item = OutputItem>
})
.for_each_with (output_database, |output_database, output_item| {
output_database.lock().write_item(output_item);
});
I assume the order doesn't matter, therefore you don't need to have an order of your output data.
You could use a mpsc::channel to transfer your data from the for_each closure to your database api, e.g.
use std::sync::mpsc;
let (tx, rx) = mpsc::channel();
input_database
.read_items()
.par_bridge() // Start parallelism.
.flat_map_iter(|input_item| {
// produce an Iterator<Item = OutputItem>
})
.for_each(move |output_item| {
tx.send(output_item).unwrap();
});
and in a second thread you can use the rx variable to receive the data and write it to the database.

Scala Iterator for multithreading

I am using scala Iterator for waiting loop in synchronized block:
anObject.synchronized {
if (Try(anObject.foo()).isFailure) {
Iterator.continually {
anObject.wait()
Try(anObject.foo())
}.dropWhile(_.isFailure).next()
}
anObject.notifyAll()
}
Is it acceptable to use Iterator with concurrency and multithreading? If not, why? And then what to use and how?
There are some details, if it matters. anObject is a mutable queue. And there are multiple producers and consumers to the queue. So the block above is a code of such producer or consumer. anObject.foo is a common simplified declaration of function that either enqueue (for producer) or dequeue (for consumer) data to/from the queue.
Iterator is mutable internally, so you have to take that into consideration if you use it in multi-threaded environment. If you guaranteed that you won't end up in situation when e.g.
2 threads check hasNext()
one of them calls next() - it happens to be the last element
the other calls next() - NPE
(or similar) then you should be ok. In your example Iterator doesn't even leave the scope, so the errors shouldn't come from Iterator.
However, in your code I see the issue with having aObject.wait() and aObject.notifyAll() next to each other - if you call .wait then you won't reach .notifyAll which would unblock it. You can check in REPL that this hangs:
# val anObject = new Object { def foo() = throw new Exception }
anObject: {def foo(): Nothing} = ammonite.$sess.cmd21$$anon$1#126ae0ca
# anObject.synchronized {
if (Try(anObject.foo()).isFailure) {
Iterator.continually {
anObject.wait()
Try(anObject.foo())
}.dropWhile(_.isFailure).next()
}
anObject.notifyAll()
}
// wait indefinitelly
I would suggest changing the design to NOT rely on wait and notifyAll. However, from your code it is hard to say what you want to achieve so I cannot tell if this is more like Promise-Future case, monix.Observable, monix.Task or something else.
If your use case is a queue, produces and consumers, then it sound like a use case for reactive streams - e.g. FS2 + Monix, but it could be FS2+IO or something from Akka Streams
val queue: Queue[Task, Item] // depending on use case queue might need to be bounded
// in one part of the application
queue.enqueu1(item) // Task[Unit]
// in other part of the application
queue
.dequeue
.evalMap { item =>
// ...
result: Task[Result]
}
.compile
.drain
This approach would require some change in thinking about designing an application, because you would no longer work on thread directly, but rather designed a flow data and declaring what is sequential and what can be done in parallel, where threads become just an implementation detail.

Join futures with limited concurrency

I have a large vector of Hyper HTTP request futures and want to resolve them into a vector of results. Since there is a limit of maximum open files, I want to limit concurrency to N futures.
I've experimented with Stream::buffer_unordered but seems like it executed futures one by one.
We've used code like this in a project to avoid opening too many TCP sockets. These futures have Hyper futures within, so it seems exactly the same case.
// Convert the iterator into a `Stream`. We will process
// `PARALLELISM` futures at the same time, but with no specified
// order.
let all_done =
futures::stream::iter(iterator_of_futures.map(Ok))
.buffer_unordered(PARALLELISM);
// Everything after here is just using the stream in
// some manner, not directly related
let mut successes = Vec::with_capacity(LIMIT);
let mut failures = Vec::with_capacity(LIMIT);
// Pull values off the stream, dividing them into success and
// failure buckets.
let mut all_done = all_done.into_future();
loop {
match core.run(all_done) {
Ok((None, _)) => break,
Ok((Some(v), next_all_done)) => {
successes.push(v);
all_done = next_all_done.into_future();
}
Err((v, next_all_done)) => {
failures.push(v);
all_done = next_all_done.into_future();
}
}
}
This is used in a piece of example code, so the event loop (core) is explicitly driven. Watching the number of file handles used by the program showed that it was capped. Additionally, before this bottleneck was added, we quickly ran out of allowable file handles, whereas afterward we did not.

What language level support (if any) does Swift have for asynchronous programming?

Asynchronous programming is a must for responsive user interfaces when application have to communicate over unpredictable networks (e.g. smart phone applications). The user interface must remain responsive while waiting for results to come back from servers somewhere over the internet.
In most languages, the application programmer has to implement their own state machines (maybe using closures) to respond to asynchronous callbacks and/or coordinate multiply threads using locks.
Both of these are very error prone and not for the fait hearted!
(c# introduced the async keyword to help with this, only time (at least 5 years) will tell if it is a good solution.)
Does Swift have any built in support to assist the writing of asynchronous code?
While it isn't a built-in language feature, it may be interesting to note that it's possible to implement C# style async/await for Swift, and that because of the special syntax afforded to the last closure argument of a function call, it even looks like it might be part of the language.
If anyone is interested, you can get code for this on Bitbucket. Here's a quick taster of what's possible:
let task = async { () -> () in
let fetch = async { (t: Task<NSData>) -> NSData in
let req = NSURLRequest(URL: NSURL.URLWithString("http://www.google.com"))
let queue = NSOperationQueue.mainQueue()
var data = NSData!
NSURLConnection.sendAsynchronousRequest(req,
queue:queue,
completionHandler:{ (r: NSURLResponse!, d: NSData!, error: NSError!) -> Void in
data = d
Async.wake(t)
})
Async.suspend()
return data!
}
let data = await(fetch)
let str = NSString(bytes: data.bytes, length: data.length,
encoding: NSUTF8StringEncoding)
println(str)
}
Also, if you want something like #synchronized, try this:
func synchronized(obj: AnyObject, blk:() -> ()) {
objc_sync_enter(obj)
blk()
objc_sync_exit(obj)
}
var str = "A string we can synchronise on"
synchronized(str) {
println("The string is locked here")
}
Swift's approach to asynchronous programming is the same as Objective C's: use Grand Central Dispatch. You can pass closures to gcd dispatch_ functions, just as in ObjC. However, for aesthetic reasons, you can also pass your closure (block) after the close parentheses:
dispatch_async(dispatch_get_main_queue()) {
println("async hello world")
}

Resources