Unknown Size of HashMap | Passing Dynamic Functions into Multithreading - multithreading

I've written a program to process my financial transactions but it's starting to run a little slower now that I'm adding more data to it. I decided to write it in Rust. I'm fluent in JS, TS, Python, bash/shell scripting. I need to feed the entire history into the application at this time. Currently my program is single-threaded. My thought is that if I use multi-threading strategically I should be able to cut the run-time down.
Here is how I attempted to implement multi-threading:
for row in lines[1..].iter() {
thread::spawn(|| {
process_transaction(row, &rules)
});
}
Rules is a hashmap that looks like this.
type CustomRule = Box<dyn Fn(&Transaction) -> &'static str>;
type RuleHashMap = HashMap<&'static str, CustomRule>;
Row is a Transaction structure w/ stuff and some functions implemented into it. (Regex, gt/lt matching)
The key will be a regex string and the value will be that custom function. This custom function has to be put in a Box because it's size needs to go onto the heap?
The idea is that I should be able to just quickly iterate over a set of regex patterns then run the corresponding additional logic needed to process that transaction. I am not mutating Transaction or Rules in any way and I'm only printing a result. Here's an example rule:
rules.insert(r"(?i)7-ELEVEN|EXXONMOBIL|CIRCLE K|SUNOCO|SHEETZ|A-PLUS|RACEWAY|SHELLSERVICE|Shell SERVICE|QUICK NEASY|QUICK N EASY|FAS MART|BP|ROYAL MART|CITG|gas|wawa", Box::new(|t:&Transaction|{
match t.less_than(15.0) {
true => "expenses:convience",
false => "expenses:business:gas"
}
}));
The compiler suggested I update the type to implement Send
type CustomRule = dyn Fn(&Transaction) + Send + 'static;
I added that but now it says it does not know the size of the HashMap at compile time. I get this message from the compiler
= help: the trait `Sized` is not implemented for `(dyn for<'r> Fn(&'r Transaction) + Send + 'static)`
note: required by a bound in `HashMap`
What is this? I'm new to lower level programming like this and I want to understand what is actually happening versus just copying blindly. Especially when playing w/ threads. Was putting that custom rule function into a Box<> type the wrong move? Am I making this more complicated than needed.

If all of your closures are "pure" functions that don't capture any state, then you can just store fn pointers in your map instead:
let mut rules: HashMap<&'static str, fn(&Transaction) -> &'static str> = HashMap::new();
rules.insert(r"some long regex", |t: &Transaction|{
"expenses:convience"
});
You say
The idea is that I should be able to just quickly iterate over a set of regex patterns then run the corresponding additional logic needed to process that transaction.
If you're actually iterating, why even use a HashMap? Just use a Vec or array of tuples like so:
let mut rules: [(&'static str, fn(&Transaction) -> &'static str); 1] = [
(r"some long regex", |t: &Transaction|{
"expenses:convience"
}),
];
You probably also should compile the Regexs only once, rather than compiling and running them every time. I recommend putting them in statics with once_cell::sync::Lazy.

Related

Stable alternative to collect_into - or - how do I collect a sized queue?

I have some pipeline which manipulates an iterator to a very big data set, and at the end, I wish to just keep the N top values.
I wrote a wrapper around a Vec - a struct which holds the Vec and its max size, and implements insertion such that the data in the vec is always ordered, and values which are too small would get ignored (could have also used a BTreeSet, if N is large enough).
Anyway, I thought I'd use it as follows:
let mut q = SizedQueue(5);
<my iterator pipleline>.collect_into(&mut q);
but I was disappointed to discover that collect_into is unstable, and could potentially be dropped because it might be deemed unnecessary, the reason given is that it could be done differently.
My question is - how could it be done differently (other than me just implementing a Trait for Iterator with this functionality myself)?
collect_into() is just a convenient shortcut to calling Extend::extend():
let mut q = SizedQueue(5);
q.extend(<my iterator pipleline>);
Of course, you need to implement Extend for your type. A simple implementation may look like:
impl<T: PartialOrd> Extend<T> for SizedQueue<T> {
fn extend<I: IntoIterator<Item = T>>(&mut self, iter: I) {
for item in iter {
self.push(item);
}
}
}
But if this is only for one use site where you call extend(), you may as well just inline it and loop and push().

Ergonomically passing a slice of trait objects

I am converting a variety of types to String when they are passed to a function. I'm not concerned about performance as much as ergonomics, so I want the conversion to be implicit. The original, less generic implementation of the function simply used &[impl Into<String>], but I think that it should be possible to pass a variety of types at once without manually converting each to a string.
The key is that ideally, all of the following cases should be valid calls to my function:
// String literals
perform_tasks(&["Hello", "world"]);
// Owned strings
perform_tasks(&[String::from("foo"), String::from("bar")]);
// Non-string types
perform_tasks(&[1,2,3]);
// A mix of any of them
perform_tasks(&["All", 3, String::from("types!")]);
Some various signatures I've attempted to use:
fn perform_tasks(items: &[impl Into<String>])
The original version fails twice; it can't handle numeric types without manual conversion, and it requires all of the arguments to be the same type.
fn perform_tasks(items: &[impl ToString])
This is slightly closer, but it still requires all of the arguments to be of one type.
fn perform_tasks(items: &[&dyn ToString])
Doing it this way is almost enough, but it won't compile unless I manually add a borrow on each argument.
And that's where we are. I suspect that either Borrow or AsRef will be involved in a solution, but I haven't found a way to get them to handle this situation. For convenience, here is a playground link to the final signature in use (without the needed references for it to compile), alongside the various tests.
The following way works for the first three cases if I understand your intention correctly.
pub fn perform_tasks<I, A>(values: I) -> Vec<String>
where
A: ToString,
I: IntoIterator<Item = A>,
{
values.into_iter().map(|s| s.to_string()).collect()
}
As the other comments pointed out, Rust does not support an array of mixed types. However, you can do one extra step to convert them into a &[&dyn fmt::Display] and then call the same function perform_tasks to get their strings.
let slice: &[&dyn std::fmt::Display] = &[&"All", &3, &String::from("types!")];
perform_tasks(slice);
Here is the playground.
If I understand your intention right, what you want is like this
fn main() {
let a = 1;
myfn(a);
}
fn myfn(i: &dyn SomeTrait) {
//do something
}
So it's like implicitly borrow an object as function argument. However, Rust won't let you to implicitly borrow some objects since borrowing is quite an important safety measure in rust and & can help other programmers quickly identified which is a reference and which is not. Thus Rust is designed to enforce the & to avoid confusion.

A cell with interior mutability allowing arbitrary mutation actions

Standard Cell struct provides interior mutability but allows only a few mutation methods such as set(), swap() and replace(). All of these methods change the whole content of the Cell.
However, sometimes more specific manipulations are needed, for example, to change only a part of data contained in the Cell.
So I tried to implement some kind of universal Cell, allowing arbitrary data manipulation.
The manipulation is represented by user-defined closure that accepts a single argument - &mut reference to the interior data of the Cell, so the user itself can deside what to do with the Cell interior. The code below demonstrates the idea:
use std::cell::UnsafeCell;
struct MtCell<Data>{
dcell: UnsafeCell<Data>,
}
impl<Data> MtCell<Data>{
fn new(d: Data) -> MtCell<Data> {
return MtCell{dcell: UnsafeCell::new(d)};
}
fn exec<F, RetType>(&self, func: F) -> RetType where
RetType: Copy,
F: Fn(&mut Data) -> RetType
{
let p = self.dcell.get();
let pd: &mut Data;
unsafe{ pd = &mut *p; }
return func(pd);
}
}
// test:
type MyCell = MtCell<usize>;
fn main(){
let c: MyCell = MyCell::new(5);
println!("initial state: {}", c.exec(|pd| {return *pd;}));
println!("state changed to {}", c.exec(|pd| {
*pd += 10; // modify the interior "in place"
return *pd;
}));
}
However, I have some concerns regarding the code.
Is it safe, i.e can some safe but malicious closure break Rust mutability/borrowing/lifetime rules by using this "universal" cell?
I consider it safe since lifetime of the interior reference parameter prohibits its exposition beyond the closure call time. But I still have doubts (I'm new to Rust).
Maybe I'm re-inventing the wheel and there exist some templates or techniques solving the problem?
Note: I posted the question here (not on code review) as it seems more related to the language rather than code itself (which represents just a concept).
[EDIT] I'd want zero cost abstraction without possibility of runtime failures, so RefCell is not perfect solution.
This is a very common pitfall for Rust beginners.
Is it safe, i.e can some safe but malicious closure break Rust mutability/borrowing/lifetime rules by using this "universal" cell? I consider it safe since lifetime of the interior reference parameter prohibits its exposition beyond the closure call time. But I still have doubts (I'm new to Rust).
In a word, no.
Playground
fn main() {
let mt_cell = MtCell::new(123i8);
mt_cell.exec(|ref1: &mut i8| {
mt_cell.exec(|ref2: &mut i8| {
println!("Double mutable ref!: {:?} {:?}", ref1, ref2);
})
})
}
You're absolutely right that the reference cannot be used outside of the closure, but inside the closure, all bets are off! In fact, pretty much any operation (read or write) on the cell within the closure is undefined behavior (UB), and may cause corruption/crashes anywhere in your program.
Maybe I'm re-inventing the wheel and there exist some templates or techniques solving the problem?
Using Cell is often not the best technique, but it's impossible to know what the best solution is without knowing more about the problem.
If you insist on Cell, there are safe ways to do this. The unstable (ie. beta) Cell::update() method is literally implemented with the following code (when T: Copy):
pub fn update<F>(&self, f: F) -> T
where
F: FnOnce(T) -> T,
{
let old = self.get();
let new = f(old);
self.set(new);
new
}
Or you could use Cell::get_mut(), but I guess that defeats the whole purpose of Cell.
However, usually the best way to change only part of a Cell is by breaking it up into separate Cells. For example, instead of Cell<(i8, i8, i8)>, use (Cell<i8>, Cell<i8>, Cell<i8>).
Still, IMO, Cell is rarely the best solution. Interior mutability is a common design in C and many other languages, but it is somewhat more rare in Rust, at least via shared references and Cell, for a number of reasons (e.g. it's not Sync, and in general people don't expect interior mutability without &mut). Ask yourself why you are using Cell and if it is really impossible to reorganize your code to use normal &mut references.
IMO the bottom line is actually about safety: if no matter what you do, the compiler complains and it seems that you need to use unsafe, then I guarantee you that 99% of the time either:
There's a safe (but possibly complex/unintuitive) way to do it, or
It's actually undefined behavior (like in this case).
EDIT: Frxstrem's answer also has better info about when to use Cell/RefCell.
Your code is not safe, since you can call c.exec inside c.exec to get two mutable references to the cell contents, as demonstrated by this snippet containing only safe code:
let c: MyCell = MyCell::new(5);
c.exec(|n| {
// need `RefCell` to access mutable reference from within `Fn` closure
let n = RefCell::new(n);
c.exec(|m| {
let n = &mut *n.borrow_mut();
// now `n` and `m` are mutable references to the same data, despite using
// no unsafe code. this is BAD!
})
})
In fact, this is exactly the reason why we have both Cell and RefCell:
Cell only allows you to get and set a value and does not allow you to get a mutable reference from an immutable one (thus avoiding the above issue), but it does not have any runtime cost.
RefCell allows you to get a mutable reference from an immutable one, but needs to perform checks at runtime to ensure that this is safe.
As far as I know, there's not really any safe way around this, so you need to make a choice in your code between no runtime cost but less flexibility, and more flexibility but with a small runtime cost.

Does partial application in Rust have overhead?

I like using partial application, because it permits (among other things) to split a complicated function call, that is more readable.
An example of partial application:
fn add(x: i32, y: i32) -> i32 {
x + y
}
fn main() {
let add7 = |x| add(7, x);
println!("{}", add7(35));
}
Is there overhead to this practice?
Here is the kind of thing I like to do (from a real code):
fn foo(n: u32, things: Vec<Things>) {
let create_new_multiplier = |thing| ThingMultiplier::new(thing, n); // ThingMultiplier is an Iterator
let new_things = things.clone().into_iter().flat_map(create_new_multiplier);
things.extend(new_things);
}
This is purely visual. I do not like to imbricate too much the stuff.
There should not be a performance difference between defining the closure before it's used versus defining and using it it directly. There is a type system difference — the compiler doesn't fully know how to infer types in a closure that isn't immediately called.
In code:
let create_new_multiplier = |thing| ThingMultiplier::new(thing, n);
things.clone().into_iter().flat_map(create_new_multiplier)
will be the exact same as
things.clone().into_iter().flat_map(|thing| {
ThingMultiplier::new(thing, n)
})
In general, there should not be a performance cost for using closures. This is what Rust means by "zero cost abstraction": the programmer could not have written it better themselves.
The compiler converts a closure into implementations of the Fn* traits on an anonymous struct. At that point, all the normal compiler optimizations kick in. Because of techniques like monomorphization, it may even be faster. This does mean that you need to do normal profiling to see if they are a bottleneck.
In your particular example, yes, extend can get inlined as a loop, containing another loop for the flat_map which in turn just puts ThingMultiplier instances into the same stack slots holding n and thing.
But you're barking up the wrong efficiency tree here. Instead of wondering whether an allocation of a small struct holding two fields gets optimized away you should rather wonder how efficient that clone is, especially for large inputs.

&self move field containing Box - Move out of borrowed content

I have a struct with a field containing references to other structs (that I did not define)
struct HtmlHandlebars {
user_helpers: Vec<(String, Box<HelperDef + 'static>)>,
}
And HtmlHandlebars has to implement a function
fn render(&self, ...) -> &self
And in that function I would need to move the Box to another function. Something like this:
fn render(&self, ...) -> &self {
let mut handlebars = Handlebars::new();
for (name, helper) in self.user_helpers {
handlebars.register_helper(&name, helper);
}
}
But I am kind of stuck because:
I can't move the Box references because I am borrowing self
I can't copy the Box references because that struct does not implement copy
I can't modify &self to &mut self because that causes other problems...
Maybe I am doing it completely wrong.. Is there something else I can do? What are my options?
If you need a more complete overview of the code, you can find it here
PS: I had no idea how to describe the situation in the title, feel free to change it
The code you've written is trying to consume the Vec and its elements. In general, you can iterate over &self.user_helpers which will give you references to the elements rather than consuming them. That is (modulo silly typos in the pattern):
for &(ref name, ref helper) in self.user_helpers {
handlebars.register_helper(name, helper);
}
See also: The Rust Programming Language on iterating vectors.
There is a problem with that though: Handlebars needs ownership of the helpers. It may very well be that you have to create a new Handlebars object every time you render, but in that case you also need to be able to create all the helpers every time you create a new Handlebars. There is no way to take ownership of the boxes without taking at least a &mut reference to the Vec. You need at least mutable access to take the handlers out of the struct. And if you did that, you wouldn't have the handlers around for the next time render() is called. Depending on how configurable the set of handlers is, you could have a function that constructs the Vec<Box<HelperDef + 'static>> out of thin air when you need it, or you could maintain a list of callbacks that construct Box<HelperDef + 'static> for you.

Resources