I am trying to find the fastest way to flip the value of a Boolean in rust? i.e.
false => true
true => false
For my application I do not care about the current value of the Boolean only that it is flipped. For my application (Sieve of Atkin - an improved version of the Sieve of Eratosthenes) this will need to be performed a large number of times so would be good to have it run as fast as possible. Currently my code is:
item[i] = !item[i]
Because, (as mentioned) the current value of item[i] is irrelevant I am sure there is a faster (possibly bitwise) way to do this. However, I'm a bit of a rust nubie and haven't been able to find it, can anyone advise me on a better way?
Thanks,
It doesn't matter. The compiler will outsmart you on this and pick the fastest method it knows. The exact syntax you use is not important. However this is only the case when you remember to turn on optimizations. You may think this is obvious, but it is an extremely common mistake. This is done using --release when building or running your project with cargo. If you forget this step, the compiler won't even attempt to speed up your code and timing code execution becomes meaningless.
What matters more is how you access the memory where the boolean resides. Try to keep memory you are working with on the stack if you are doing a lot of work on one value or small region at a time. Cache locality also means that it will be faster to read adjacent cache lines than jumping between places in memory. Memory is more likely to be in the cache if you accessed it recently or the CPU guesses you are about to access it.
There are also crates like bit-vec and bitvec which reduce each boolean to using a single bit. This is great for improving memory usage (8x improvement to be exact), but comes at a very small cost to performance. I would avoid the bitvec crate though. About a month ago I did some benchmarks and the performance was absolutely abysmal.
Do you need to work on a single boolean at a time? Try to work with entire words of memory if possible. Bitwise operations on a u64 will likely take the exact same amount of time, but you get 64x the productivity.
Since in Rust a boolean variable is represented as an 8 bit unsigned integer with 0 for false and 1 for true, the compiler can implement negation without a branch by computing the XOR of the value with 1.
That being said, while I'm not familiar with the Sieve of Atkin, at least for the Sieve of Eratosthenes, you really want to use bitfields rather than booleans. But the same trick can be used there to avoid a branch.
Related
I am learning rust and in the official tutorial, the author assigned the value 5 to a variable like so:
let x: i32 = 5;
I thought this was weird as one could use u8 as the type and the program would run fine. This got me thinking, are there any advantages to using a lower bit number? Is it faster?
The main advantage is that they use less memory. A vector<i32> with 1 billion elements will use 4GB, while a vector<u8> will use 1GB. This can be a significant advantage regardless of speed.
Arithmetic on smaller integer types on modern CPUs is not faster in general. There are some issues with using only part of a register but optimizers will almost certainly resolve these performance problems for you.
When you have a lot of integers and the optimizer can make use of vectorization (for example adding your 1 billion integers in the vector) then smaller types will typically yield better performance, because more of them fit in a SIMD register.
If you use them just as one scalar stack variable like in your example, I highly doubt there will be a difference in 99% of cases. Here other considerations are more important:
A bigger type will make overflows less likely, maybe you did calculate your maximal possible value wrong.
For public interfaces bigger types are more future proof.
Its better to cast from i8 to i32 than the other way round.
Should we use u32/i32 or it's lower variant (u8/i8, u16/i16) when dealing with limited range number like "days in month" which ranged from 1-30 or "score of a subject" which ranged from 0 to 100? Or why we shouldn't?
Is there any optimization or benefit on the lower variant (i.e. memory efficient)?
Summary
Correctness should be prioritized over performance and correctness-wise (for ranges like 1–100), all solutions (u8, u32, ...) are equally bad. The best solution would be to create a new type to benefit from strong typing.
The rest of my answer tries to justify this claim and discusses different ways of creating the new type.
More explanation
Let's take a look at the "score of subject" example: the only legal values are 0–100. I'd argue that correctness-wise, using u8 and u32 is equally bad: in both cases, your variable can hold values that are not legal in your semantic context; that's bad!
And arguing that the u8 is better, because there are less illegal values, is like arguing that wrestling a bear is better than walking through New York, because you only have one possibility of dying (blood loss by bear attack) as opposed to the many possibilities of death (car accident, knife attack, drowning, ...) in New York.
So what we want is a type that guarantees to hold only legal values. We want to create a new type that does exactly this. However, there are multiple ways to proceed; each with different advantages and disadvantages.
(A) Make the inner value public
struct ScoreOfSubject(pub u8);
Advantage: at least APIs are more easy to understand, because the parameter is already explained by the type. What is easier to understand:
add_record("peter", 75, 47) or
add_record("peter", StudentId(75), ScoreOfSubject(47))?
I'd say the latter one ;-)
Disadvantage: we don't actually do any range checking and illegal values can still occur; bad!.
(B) Make inner value private and supply a range checking constructor
struct ScoreOfSubject(pub u8);
impl ScoreOfSubject {
pub fn new(value: u8) -> Self {
assert!(value <= 100);
ScoreOfSubject(value)
}
pub fn get(&self) -> u8 { self.0 }
}
Advantage: we enforce legal values with very little code, yeah :)
Disadvantage: working with the type can be annoying. Pretty much every operation requires the programmer to pack & unpack the value.
(C) Add a bunch of implementations (in addition to (B))
(the code would impl Add<_>, impl Display and so on)
Advantage: the programmer can use the type and do all useful operations on it directly -- with range checking! This is pretty optimal.
Please take a look at Matthieu M.'s comment:
[...] generally multiplying scores together, or dividing them, does not produce a score! Strong typing not only enforces valid values, it also enforces valid operations, so that you don't actually divide two scores together to get another score.
I think this is a very important point I failed to make clear before. Strong typing prevents the programmer from executing illegal operations on values (operations that don't make any sense). A good example is the crate cgmath which distinguishes between point and direction vectors, because both support different operations on them. You can find additional explanation here.
Disadvantage: a lot of code :(
Luckily the disadvantage can be reduced by the Rust macro/compiler plugin system. There are crates like newtype_derive or bounded_integer that do this kind of code generation for you (disclaimer: I never worked with them).
But now you say: "you can't be serious? Am I supposed to spend my time writing new types?".
Not necessarily, but if you are working on production code (== at least somewhat important), then my answer is: yes, you should.
A no-answer answer: I doubt you would see any difference in benchmarks, unless you do A LOT of arithmetic or process HUGE arrays of numbers.
You should probably just go with the type which makes more sense (no reason to use negatives or have an upper bound in millions for a day of month) and provides the methods you need (e.g. you can't perform abs() directly on an unsigned integer).
There could be major benefits using smaller types but you would have to benchmark your application on your target platform to be sure.
The first and most easily realized benefit from the lower memory footprint is better caching. Not only is your data more likely to fit into the cache, but it is also less likely to discard other data in the cache, potentially improving a completely different part of your application. Whether or not this is triggered depends on what memory your application touches and in what order. Do the benchmarks!
Network data transfers have an obvious benefit from using smaller types.
Smaller data allows "larger" instructions. A 128-bit SIMD unit can handle 4 32-bit data OR 16 8-bit data, making certain operations 4 times faster. In benchmarks I´ve made these instructions do execute 4 times faster indeed BUT the whole application improved by less than 1%, and the code became more of a mess. Shaping your program into making better use of SIMD can be tricky.
As of signed/unsigned discussions unsigned has slightly better properties which a compiler may or may not take advantage of.
I was recently working on an implementation of calculating moving average from a stream of input, using Data.Sequence. I figured I could get the whole operation to be O(n) by using a deque.
My first attempt was (in my opinion) a bit more straightforward to read, but not a true a deque. It looked like:
let newsequence = (|>) sequence n
...
let dropFrontTotal = fromIntegral (newtotal - index newsequence 0)
let newsequence' = drop 1 newsequence.
...
According to the hackage docs for Data.Sequence, index should take O(log(min(i,n-i))) while drop should also take O(log(min(i,n-i))).
Here's my question:
If I do drop 1 someSequence, doesn't this mean a time complexity of O(log(min(1, (length someSequence)))), which in this case means: O(log(1))?
If so, isn't O(log(1)) effectively constant?
I had the same question for index someSequence 0: shouldn't that operation end up being O(log(0))?
Ultimately, I had enough doubts about my understanding that I resorted to using Criterion to benchmark the two implementations to prove that the index/drop version is slower (and the amount it's slower by grows with the input). The informal results on my machine can be seen at the linked gist.
I still don't really understand how to calculate time complexity for these operations, though, and I would appreciate any clarification anyone can provide.
What you suggest looks correct to me.
As a minor caveat remember that these are amortized complexity bounds, so a single operation could require more than constant time, but a long chain of operations will only require a constant times the number of the chain.
If you use criterion to benchmark and "reset" the state at every computation, you might see non-constant time costs, because the "reset" is preventing the amortization. It really depends on how you perform the test. If you start from a sequence an perform a long chain of operations on that, it should be OK. If you repeat many times a single operation using the same operands, then it could be not OK.
Further, I guess bounds such as O(log(...)) should actually be read as O(log(1 + ...)) -- you can't realistically have O(log(1)) = O(0) or, worse O(log(0))= O(-inf) as a complexity bound.
I'm going to have around 1000 strings that need to be sorted alphabetically.
std::set, from what I've read, is sorted. std::vector is not. std::set seems to be a simpler solution, but if I were to use a std::vector, all I would need to do is use is std::sort to alphabetize the strings.
My application may or may not be performance critical, so performance isn't necessarily the issue here (yet), but since I'll need to iterate through the container to write the strings to the file, I've read that iterating through a std::set is a tad bit slower than iterating through a std::vector.
I know it probably won't matter, but I'd like to hear which one you all would go with in this situation.
Which stl container would best suit my needs? Thanks.
std::vector with a one-time call to std::sort seems like the simplest way to do what you are after, computationally speaking. std::set provides dynamic lookups by key, which you don't really need here, and things will get more complicated if you have to deal with duplicates.
Make sure you use reserve to pre-allocate the memory for the vector, since you know the size ahead of time in this case. This prevents memory reallocation as you add to the vector (very expensive).
Also, if you use [] notation instead of push_back() you may save a little time avoiding bounds checking of push_back(), but that is really academic and insubstantial and perhaps not even true with compiler optimizations. :-)
I often see pprograms like this, where Int64 is an absolute performance killer on 32-bit platforms. My question is now:
If I need a specific word length for my task (in my case a RNG), is Int64 efficient on 64-bit platforms or will it still use the C-calls? And how efficient is converting an Int64 to an Int?
on a 64bit system Int64 should be fine, I don't know for sure though.
More importantly if you're doing Crypto or random-number-generation you MUST use the data type that the algorithm says to use, also be careful of signed-ness. If you do-not do this you will have the wrong results, which might mean that your cryptography is not secure or your random number generator is not really random (RNGs are hard, and many look random but aren't).
For any other type of work use Integer wherever you can, or even better, make your program polymorphic using the Integral type-class. Then if you think your program is slower than it should be profile it to determine where you should concentrate when you try to speed it up. if you use the Integral type-class changing from Integer to Int is easy. Haskell should be smart enough to specialize (most) code that uses polymorphism to avoid overheads.
Interesting article on 64 bit performance here:
Isn’t my code going to be faster on 64-bit???
As the article states, the big bottleneck isn't the processor, it's the cache and memory I/O.