Parallel collection for unique string counting

Parallel collection for unique string counting - rust

I need some highly parallelized collection something like java concurrent skiplist. The general task: I am working on a server for counting unique words in all messages the server gets. I don't care what the words are only the count. Once in a while, I get a get_count message I reset the counter and start all over.
But I am always bottlenecked on the post_words function. The same thing in java runs 5s in rust 80s. I have tried the experimental skiplist set from the crossbeam. I got the same result. The other issue is string allocation. Any ideas?
//Dashset from https://docs.rs/dashmap/4.0.2/dashmap/struct.DashSet.html
type Words = DashSet<String>;
let set: Arc<Words> = Arc::new(DashSet::with_capacity(100000));
// for each new socket I create
let set = set.clone();
//Word processing
fn post_words(client: i32, data: Vec<u8>, db: &Words) -> Response {
let mut decoder = GzDecoder::new(data.as_slice());
let mut input = String::new();
decoder.read_to_string(&mut input).unwrap();
//The bottleneck
for word in input.split_whitespace() {
db.insert(String::from(word));
}
let mut response = Response::new();
response.status = Response_Status::OK;
return response
}

Since I don't need to know the words a can only keep a hash. Now I don't have to allocate string for each word. This improved my solution from 80s to 1.7s
type Words = DashSet<u64>;
async fn post_words(client: i32, data: Vec<u8>, db: &Words) -> Response {
let mut decoder = GzDecoder::new(data.as_slice());
let mut input = String::new();
decoder.read_to_string(&mut input).unwrap();
for word in input.split_whitespace() {
let mut s = DefaultHasher::new();
word.hash(&mut s);
db.insert( s.finish());
}
let mut response = Response::new();
response.status = Response_Status::OK;
return response
}

Related

What is the most efficient way to read the first line of a file separately to the rest of the file?

I am trying to figure out the best way to read the contents of a file. The problem is that I need to read the first line separately, because I need that to be parsed as a usize which I need for the dimension of a Array2 by ndarray.
I tried the following:
use ndarray::prelude::*;
use std::io:{BufRead,BufReader};
use std::fs;
fn read_inputfile(geom_filename: &str) -> (Vec<i32>, Array2<f64>, usize) {
//* Step 1: Read the coord data from input
println!("Inputfile: {geom_filename}");
let geom_file = fs::File::open(geom_filename).expect("Geometry file not found!");
let geom_file_reader = BufReader::new(geom_file);
let geom_file_lines: Vec<String> = geom_file_reader
.lines()
.map(|line| line.expect("Failed to read line!"))
.collect();
//* Read no of atoms first for array size
let no_atoms: usize = geom_file_lines[0].parse().unwrap();
let mut Z_vals: Vec<i32> = Vec::new();
let mut geom_matr: Array2<f64> = Array2::zeros((no_atoms, 3));
for (atom_idx, line) in geom_file_lines[1..].iter().enumerate() {
//* into_iter would do the same
let line_split: Vec<&str> = line.split_whitespace().collect();
Z_vals.push(line_split[0].parse().unwrap());
(0..3).for_each(|cart_coord| {
geom_matr[(atom_idx, cart_coord)] = line_split[cart_coord + 1].parse().unwrap();
});
}
(Z_vals, geom_matr, no_atoms)
}
Does this not kind of defeat the purpose of the BufReader? I am still relative new to Rust, so I might have misunderstood something, but I thought that one uses the BufReader so that the whole file does not need to be read into memory.
With the Vec<String> for geom_file_lines I am mostlike loading the whole file into memory again, right?

Does this not kind of defeat the purpose of the BufReader?
It very much does, yes. lines() gives you an iterator, so you can read them without loading all of them into memory at once. You force them all into memory, though, as you call collect().
Simply don't do that. Use the iterator as an iterator. Especially as you convert it back to an iterator later, via geom_file_lines[1..].iter().
Like this:
use ndarray::prelude::*;
use std::fs;
use std::io::{BufRead, BufReader};
pub fn read_inputfile(geom_filename: &str) -> (Vec<i32>, Array2<f64>, usize) {
//* Step 1: Read the coord data from input
println!("Inputfile: {geom_filename}");
let geom_file = fs::File::open(geom_filename).expect("Geometry file not found!");
let geom_file_reader = BufReader::new(geom_file);
let mut geom_file_lines = geom_file_reader
.lines()
.map(|line| line.expect("Failed to read line!"));
//* Read no of atoms first for array size
let no_atoms: usize = geom_file_lines.next().unwrap().parse().unwrap();
let mut z_vals: Vec<i32> = Vec::new();
let mut geom_matr: Array2<f64> = Array2::zeros((no_atoms, 3));
for (atom_idx, line) in geom_file_lines.enumerate() {
let line_split: Vec<&str> = line.split_whitespace().collect();
z_vals.push(line_split[0].parse().unwrap());
(0..3).for_each(|cart_coord| {
geom_matr[(atom_idx, cart_coord)] = line_split[cart_coord + 1].parse().unwrap();
});
}
(z_vals, geom_matr, no_atoms)
}
You can apply the same logic in your for loop:
for (atom_idx, line) in geom_file_lines.enumerate() {
let mut line_split = line.split_whitespace();
z_vals.push(line_split.next().unwrap().parse().unwrap());
(0..3).for_each(|cart_coord| {
geom_matr[(atom_idx, cart_coord)] = line_split.next().unwrap().parse().unwrap();
});
}

Why Sequential is faster than parallel?

The code is to count the frequency of each word in an article. In the code, I implemented sequential, muti-thread, and muti-thread with a thread pool.
I test the running time of three methods, however, I found that the sequential method is the fastest one. I use the article (data) at 37423.txt, the code is at play.rust-lang.org.
Below is just the single- and multi version (without the threadpool version):
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::SystemTime;
pub fn word_count(article: &str) -> HashMap<String, i64> {
let now1 = SystemTime::now();
let mut map = HashMap::new();
for word in article.split_whitespace() {
let count = map.entry(word.to_string()).or_insert(0);
*count += 1;
}
let after1 = SystemTime::now();
let d1 = after1.duration_since(now1);
println!("single: {:?}", d1.as_ref().unwrap());
map
}
fn word_count_thread(word_vec: Vec<String>, counts: &Arc<Mutex<HashMap<String, i64>>>) {
let mut p_count = HashMap::new();
for word in word_vec {
*p_count.entry(word).or_insert(0) += 1;
}
let mut counts = counts.lock().unwrap();
for (word, count) in p_count {
*counts.entry(word.to_string()).or_insert(0) += count;
}
}
pub fn mt_word_count(article: &str) -> HashMap<String, i64> {
let word_vec = article
.split_whitespace()
.map(|x| x.to_owned())
.collect::<Vec<String>>();
let count = Arc::new(Mutex::new(HashMap::new()));
let len = word_vec.len();
let q1 = len / 4;
let q2 = len / 2;
let q3 = q1 * 3;
let part1 = word_vec[..q1].to_vec();
let part2 = word_vec[q1..q2].to_vec();
let part3 = word_vec[q2..q3].to_vec();
let part4 = word_vec[q3..].to_vec();
let now2 = SystemTime::now();
let count1 = count.clone();
let count2 = count.clone();
let count3 = count.clone();
let count4 = count.clone();
let handle1 = thread::spawn(move || {
word_count_thread(part1, &count1);
});
let handle2 = thread::spawn(move || {
word_count_thread(part2, &count2);
});
let handle3 = thread::spawn(move || {
word_count_thread(part3, &count3);
});
let handle4 = thread::spawn(move || {
word_count_thread(part4, &count4);
});
handle1.join().unwrap();
handle2.join().unwrap();
handle3.join().unwrap();
handle4.join().unwrap();
let x = count.lock().unwrap().clone();
let after2 = SystemTime::now();
let d2 = after2.duration_since(now2);
println!("muti: {:?}", d2.as_ref().unwrap());
x
}
The result of mine is: single:7.93ms, muti: 15.78ms, threadpool: 25.33ms
I did the separation of the article before calculating time.
I want to know if the code has some problem.

First you may want to know the single-threaded version is mostly parsing whitespace (and I/O, but the file is small so it will be in the OS cache on the second run). At most ~20% of the runtime is the counting that you parallelized. Here is the cargo flamegraph of the single-threaded code:
In the multi-threaded version, it's a mess of thread creation, copying and hashmap overhead. To make sure it's not a "too little data" problem, I've used used 100x your input txt file and I'm measuring a 2x slowdown over the single-threaded version. According to the time command, it uses 2x CPU-time compared to wall-clock, so it seems to do some parallel work. The profile looks like this: (clickable svg version)
I'm not sure what to make of it exactly, but it's clear that memory management overhead has increased a lot. You seem to be copying strings for each hashmap, while an ideal wordcount would probably do zero string copying while counting.
More generally I think it's a bad idea to join the results in the threads - the way you do it (as opposed to a map-reduce pattern) the thread needs a global lock, so you could just pass the results back to the main thread instead for combining. I'm not sure if this is the main problem here, though.
Optimization
To avoid string copying, use HashMap<&str, i64> instead of HashMap<String, i64>. This requires some changes (lifetime annotations and thread::scope()). It makes mt_word_count() about 6x faster compared to the old version.
With a large input I'm measuring now a 4x speedup compared to word_count(). (Which is the best you can hope for with four threads.) The multi-threaded version is now also faster overall, but only by ~20% or so, for the reasons explained above. (Note that the single-threaded baseline has also profited the same &str optimization. Also, many things could still be improved/optimized, but I'll stop here.)
fn word_count_thread<'t>(word_vec: Vec<&'t str>, counts: &Arc<Mutex<HashMap<&'t str, i64>>>) {
let mut p_count = HashMap::new();
for word in word_vec {
*p_count.entry(word).or_insert(0) += 1;
}
let mut counts = counts.lock().unwrap();
for (word, count) in p_count {
*counts.entry(word).or_insert(0) += count;
}
}
pub fn mt_word_count<'t>(article: &'t str) -> HashMap<&'t str, i64> {
let word_vec = article.split_whitespace().collect::<Vec<&str>>();
// (skipping 16 unmodified lines)
let x = thread::scope(|scope| {
let handle1 = scope.spawn(move || {
word_count_thread(part1, &count1);
});
let handle2 = scope.spawn(move || {
word_count_thread(part2, &count2);
});
let handle3 = scope.spawn(move || {
word_count_thread(part3, &count3);
});
let handle4 = scope.spawn(move || {
word_count_thread(part4, &count4);
});
handle1.join().unwrap();
handle2.join().unwrap();
handle3.join().unwrap();
handle4.join().unwrap();
count.lock().unwrap().clone()
});
let after2 = SystemTime::now();
let d2 = after2.duration_since(now2);
println!("muti: {:?}", d2.as_ref().unwrap());
x
}

How to create threads in a for loop and get the return value from each?

I am writing a program that pings a set of targets 100 times, and stores each RTT value returned from the ping into a vector, thus giving me a set of RTT values for each target. Say I have n targets, I would like all of the pinging to be done concurrently. The rust code looks like this:
let mut sample_rtts_map = HashMap::new();
for addr in targets.to_vec() {
let mut sampleRTTvalues: Vec<f32> = vec![];
//sample_rtts_map.insert(addr, sampleRTTvalues);
thread::spawn(move || {
while sampleRTTvalues.len() < 100 {
let sampleRTT = ping(addr);
sampleRTTvalues.push(sampleRTT);
// thread::sleep(Duration::from_millis(5000));
}
});
}
The hashmap is used to tell which vector of values belongs to which target. The problem is, how do I retrieve the updated sampleRTTvalues from each thread after the thread is done executing? I would like something like:
let (name, sampleRTTvalues) = thread::spawn(...)
The name, being the name of the thread, and sampleRTTvalues being the vector. However, since I'm creating threads in a for loop, each thread is being instantiated the same way, so how I differentiate them?
Is there some better way to do this? I've looked into schedulers, future, etc., but it seems my case can just be done with simple threads.

I go the desired behavior with the following code:
use std::thread;
use std::sync::mpsc;
use std::collections::HashMap;
use rand::Rng;
use std::net::{Ipv4Addr,Ipv6Addr,IpAddr};
const RTT_ONE: IpAddr = IpAddr::V4(Ipv4Addr::new(127,0,0,1));
const RTT_TWO: IpAddr = IpAddr::V6(Ipv6Addr::new(0,0,0,0,0,0,0,1));
const RTT_THREE: IpAddr = IpAddr::V4(Ipv4Addr::new(127,0,1,1));//idk how ip adresses work, forgive if this in invalid but you get the idea
fn ping(address: IpAddr) -> f32 {
rand::thread_rng().gen_range(5.0..107.0)
}
fn main() {
let targets = [RTT_ONE,RTT_TWO,RTT_THREE];
let mut sample_rtts_map: HashMap<IpAddr,Vec<f32>> = HashMap::new();
for addr in targets.into_iter() {
let (sample_values,moved_values) = mpsc::channel();
let mut sampleRTTvalues: Vec<f32> = vec![];
thread::spawn(move || {
while sampleRTTvalues.len() < 100 {
let sampleRTT = ping(addr);
sampleRTTvalues.push(sampleRTT);
//thread::sleep(Duration::from_millis(5000));
}
});
sample_rtts_map.insert(addr,moved_values.recv().unwrap());
}
}
note that the use rand::Rng can be removed when implementing, as it is only so the example works. what this does is pass data from the spawned thread to the main thread, and in the method used it waits until the data is ready before adding it to the hash map. If this is problematic (takes a long time, etc.) then you can use try_recv instead of recv which will add an error / option type that will return a recoverable error if the value is ready when unwrapped, or return the value if it's ready

You can use a std::sync::mpsc channel to collect your data:
use std::collections::HashMap;
use std::sync::mpsc::channel;
use std::thread;
fn ping(_: &str) -> f32 { 0.0 }
fn main() {
let targets = ["a", "b"]; // just for example
let mut sample_rtts_map = HashMap::new();
let (tx, rx) = channel();
for addr in targets {
let tx = tx.clone();
thread::spawn(move || {
for _ in 0..100 {
let sampleRTT = ping(addr);
tx.send((addr, sampleRTT));
}
});
}
drop(tx);
// exit loop when all thread's tx have dropped
while let Ok((addr, sampleRTT)) = rx.recv() {
sample_rtts_map.entry(addr).or_insert(vec![]).push(sampleRTT);
}
println!("sample_rtts_map: {:?}", sample_rtts_map);
}
This will run all pinging threads simultaneously, and collect data in main thread synchronously, so that we can avoid using locks. Do not forget to drop sender in main thread after cloning to all pinging threads, or the main thread will hang forever.

How can I obtain AMQP message headers values using lapin?

How can I obtain the values in the message headers of an AMQP message via crate lapin (RabbitMQ client)?
I am trying to obtain the values of message headers from the lapin::message::Delivery struct.
I am using Delivery.properties.headers() which returns Option<amq_protocol_types::FieldTable>
How do I read the values in the FieldTable?
Are there any examples that show how to do so?
let mut consumer = channel
.basic_consume(
"hello",
"my_consumer",
BasicConsumeOptions::default(),
FieldTable::default(),
)
.await?;
while let Some(delivery) = consumer.next().await {
let (_, delivery2) = delivery.expect("error in consumer");
message_cnt+=1;
let payload_str:String = match String::from_utf8(delivery2.data.to_owned()) {//delivery.data is of type Vec<u8>
Ok(v) => v,
Err(e) => panic!("Invalid UTF-8 sequence: {}", e),
};
let log_message:String=format!("message_cnt is:{}, delivery_tag is:{}, exchange is:{}, routing_key is:{}, redelivered is:{}, properties is:'{:?}', received data is:'{:?}'"
,&message_cnt
,&delivery2.delivery_tag
,&delivery2.exchange
,&delivery2.routing_key
,&delivery2.redelivered
,&delivery2.properties//lapin::BasicProperties Contains the properties and the headers of the message.
,&payload_str
);
let amqp_msg_headers_option:&Option<amq_protocol_types::FieldTable>=delivery2.properties.headers();
let amqp_msg_headers:&amq_protocol_types::FieldTable=match amqp_msg_headers_option{
None=>{
let bt=backtrace::Backtrace::new();
let log_message=format!(">>>>>At receive_message_from_amqp(), message received has no headers, backtrace is '{:?}'",&bt);
slog::error!(my_slog_logger,"{}",log_message);
let custom_error=std::io::Error::new(std::io::ErrorKind::Other, &log_message.to_string()[..]);
return std::result::Result::Err(Box::new(custom_error));
}
,Some(amqp_msg_headers)=>{amqp_msg_headers}
};
if amqp_msg_headers.contains_key("worker_id"){
//let worker_id2:String=amqp_msg_headers.get("worker_id").into();
let amqp_msg_headers_btm:&std::collections::BTreeMap<amq_protocol_types::ShortString, lapin::types::AMQPValue>=amqp_msg_headers.inner();
let worker_id2_option=amqp_msg_headers_btm.get(lapin::types::AMQPValue::ShortString("worker_id".into()));
}
delivery2
.ack(BasicAckOptions::default())
.await?;
}

Well I found a solution of sorts. This entails the use of serde_json. I made use of serde_json to obtain serde_json::Value object from the object of amq_protocol_types::FieldTable obtained from the properties field of lapin::message::Delivery.
I then processed the serde_json::Value to obtain the header key and value.
Below is the function I have written for the above logic.
fn extract_amqp_msg_headers_values(
my_slog_logger:&slog::Logger
,amqp_msg_headers_basic_properties:&lapin::BasicProperties
)->std::result::Result<std::collections::HashMap<String,String>, Box<std::io::Error>> {
let mut amqp_msg_headers_hm:std::collections::HashMap<String,String>=std::collections::HashMap::new();
let amqp_msg_headers_option=amqp_msg_headers_basic_properties.headers();
let amqp_msg_headers=match amqp_msg_headers_option{
None=>{
let bt=backtrace::Backtrace::new();
let log_message=format!(">>>>>At extract_amqp_msg_headers_values(), message received has no headers, backtrace is '{:?}'",&bt);
slog::error!(my_slog_logger,"{}",log_message);
let custom_error=std::io::Error::new(std::io::ErrorKind::Other, &log_message.to_string()[..]);
return std::result::Result::Err(Box::new(custom_error));
}
,Some(amqp_msg_headers)=>{amqp_msg_headers}
};
//let mut worker_id2:String="".to_owned();
let amqp_msg_headers_btm:&std::collections::BTreeMap<amq_protocol_types::ShortString, lapin::types::AMQPValue>=amqp_msg_headers.inner();
let amqp_msg_headers_serde_value_option=serde_json::to_value(&amqp_msg_headers_btm);
let amqp_msg_headers_serde_value:serde_json::value::Value=match amqp_msg_headers_serde_value_option{
Err(err)=>{
let bt=backtrace::Backtrace::new();
let log_message=format!(">>>>>At extract_amqp_msg_headers_values(), pos 2b, some error has been encountered transforming amqp_msg_headers to json, amqp_msg_headers is:'{:?}', error is:'{:?}', backtrace is '{:?}'",&amqp_msg_headers,&err,&bt);
slog::error!(my_slog_logger,"{}",log_message);
let custom_error=std::io::Error::new(std::io::ErrorKind::Other, &log_message.to_string()[..]);
return std::result::Result::Err(Box::new(custom_error));
}
,Ok(serde_json_value)=>{serde_json_value}
};
let serde_json_map:&serde_json::Map<String,serde_json::Value>=amqp_msg_headers_serde_value.as_object().unwrap();
let amqp_msg_headers_serde_value_key_vec2:Vec<String>=serde_json_map.keys().cloned().collect();
for amqp_msg_headers_serde_value_key2 in amqp_msg_headers_serde_value_key_vec2{
let amqp_msg_headers_serde_value2:&serde_json::value::Value=&serde_json_map[&amqp_msg_headers_serde_value_key2];
let serde_json_map3:&serde_json::Map<String,serde_json::Value>=amqp_msg_headers_serde_value2.as_object().unwrap();
let some_header_key:String=remove_quotes(&amqp_msg_headers_serde_value_key2).to_owned();
let amqp_msg_headers_serde_value_key_vec3:Vec<String>=serde_json_map3.keys().cloned().collect();
for amqp_msg_headers_serde_value_key3 in amqp_msg_headers_serde_value_key_vec3{
let some_header_value:String=remove_quotes(&serde_json_map3[&amqp_msg_headers_serde_value_key3].to_string()).to_owned();
amqp_msg_headers_hm.insert(some_header_key.to_owned(),some_header_value.to_owned());
}
}
//slog::info!(my_slog_logger," amqp_msg_headers_hm is:'{:?}'",&amqp_msg_headers_hm);
Ok(amqp_msg_headers_hm)
}
/// This is a general purpose function to strip double quotes surrounding the supplied &str value.
pub fn remove_quotes(some_str:&str)->&str
{
// The trimmed string is a slice to the original string, hence no new
// allocation is performed
let chars_to_trim: &[char] = &['"',' ', ','];
let some_str: &str = some_str.trim_matches(chars_to_trim);
//println!("some_str is:'{}'", some_str);
return some_str;
}

Why do String::from(&str) and &str.to_string() behave differently in Rust?

fn main() {
let string = "Rust Programming".to_string();
let mut slice = &string[5..12].to_string(); // Doesn't work...why?
let mut slice = string[5..12].to_string(); // works
let mut slice2 = String::from(&string[5..12]); // Works
slice.push('p');
println!("slice: {}, slice2: {}, string: {}", slice,slice2,string);
}
What is happening here? Please explain.

The main issue here that & have lower priority than method call.
So, actual code is
let mut slice = &(string[5..12].to_string());
So you a taking a reference to temporary String object that dropped and cannot be used later.

You should wrap your reference in parenthesis and call the method on the result.
fn main() {
let string = "Rust Programming".to_string();
let mut slice = (&string[5..12]).to_string(); // ! -- this should work -- !
let mut slice = string[5..12].to_string(); // works
let mut slice2 = String::from(&string[5..12]); // Works
slice.push('p');
println!("slice: {}, slice2: {}, string: {}", slice,slice2,string);
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parallel collection for unique string counting - rust

Related

What is the most efficient way to read the first line of a file separately to the rest of the file?

Why Sequential is faster than parallel?

How to create threads in a for loop and get the return value from each?

How can I obtain AMQP message headers values using lapin?

Why do String::from(&str) and &str.to_string() behave differently in Rust?

Categories

Resources