TL;DR:
Is it possible to access textures atomically in WGSL?
By atomically, I mean like specified in the "Atomic Operations" section of the documentation of OpenGL's GL_TEXTURE_*.
If not, will changing to GLSL work in WGPU?
Background:
Hi, recently I have been experimenting with WGPU and WGSL, specifically trying to create a cellular automata and storing it's data in a texture_storage_2d.
I was having problems with the fact that accessing the texture asynchronously caused race conditions that made cells disappear (if two cells try to advance to
the same point at the same time, they will overwrite one another)
I did some research and couldn't find any solution to my problem in the WGSL spec, but I found something similar in OpenGL and GLSL with OpenGL's GL_TEXTURE_* called atomic operations on textures (which exist AFAIK only for u32 or i32 in WGSL).
Is there something like GL_TEXTURE_* in WGSL?
Or is there some alternative that I am not aware of?
And is changing to GLSL (while staying with WGPU) the only solution? Will it even work?
To answer the first part, there are no atomic texture operations in WGSL.
The Solution to the Problem
original reddit discussion
After doing some tests I confirmed two things:
I managed to successfully implement an atomic texture (code below).
When the texture is very large (my tests were on a 2000 X 2000 texture) the race conditions described do not occur. This can probably be explained by bank conflicts but I haven't researched it enough to know for sure.
Code
This following snippet is paraphrased from my original code, it is not tested but should work.
#group(0) #binding(0) var texture: texture_storage_2d<rg32uint, read_write>;
struct Locks {
locks: array<array<atomic<u32>, 50>, 50>,
};
#group(0) #binding(1) var<storage, read_write> locks: Locks;
fn lock(location: vec2<u32>) -> bool {
let lock_ptr = &locks.locks[location.y][location.x];
let original_lock_value = atomicLoad(lock_ptr);
if (original_lock_value > 0u) {
return false;
}
return atomicAdd(lock_ptr, 1u) == original_lock_value;
}
fn unlock(location: vec2<u32>) {
atomicStore(&locks.locks[location.y][location.x], 0u);
}
Ideally, I'd use atomicCompareExchangeWeak instead of that somewhat complex logic in lock, but atomicCompareExchangeWeak didn't seem to work on my machine so I created similar logic myself.
Just to clarify, reading from the texture should be possible at any time but writing to the texture at location should be done only if lock(location) returned true.
Don't forget to call unlock after every write and between shader calls to reset the locks :)
I want to monitor traces of my Node.js application, and I am using #opentelemetry/exporter-trace-otlp-grpc library for this purpose.
Now, I want to receive this traces in a Rust application. Unfortunately, Seems like there is no otlp receiver implementation in Rust as of now !
What is the best possible way (as of now) to collect these traces in my Rust application (preferably based on HTTP or GRPC) ?
Thanks in advance !
Based on your comment, you should rephrase your question. You don't want to receive traces in Rust. What you want is the context propagated from upstream application to your Rust process.
You need to use one of many possible context propagators in your application. If you don't have already I would suggest using W3C context propagation https://www.w3.org/TR/trace-context/. All the SDKs have the implementation of these propagators.
Here is what you would do is to set the global propagator in both the Node and Rust applications and the propagation should be done automatically unless you have some rust application which is not yet supported by OTEL, in which case you need to do it manually.
In JS application
const api = require("#opentelemetry/api");
const { W3CTraceContextPropagator } = require("#opentelemetry/core");
/* Set Global Propagator */
api.propagation.setGlobalPropagator(new W3CTraceContextPropagator());
In Rust application
use opentelemetry::global;
use opentelemetry::sdk::propagation::TraceContextPropagator;
...
global::set_text_map_propagator(TraceContextPropagator::new());
If for some reason your web server/client doesn't support the auto context propagation, given the OTEL rust is not very mature yet. You would have to inject/extract to and from header to achieve the propagation.
Manually Inject
global::get_text_map_propagator(|propagator| {
propagator.inject_context(&cx, &mut your_headers
});
Manually Extract
let parent_cx = global::get_text_map_propagator(|propagator| {
propagator.extract(&HeaderExtractor(req.headers()))
});
span = tracer.start_with_context("foo", &parent_cx);
Is it possible to benchmark programs in Rust? If yes, how? For example, how would I get execution time of program in seconds?
It might be worth noting two years later (to help any future Rust programmers who stumble on this page) that there are now tools to benchmark Rust code as a part of one's test suite.
(From the guide link below) Using the #[bench] attribute, one can use the standard Rust tooling to benchmark methods in their code.
extern crate test;
use test::Bencher;
#[bench]
fn bench_xor_1000_ints(b: &mut Bencher) {
b.iter(|| {
// Use `test::black_box` to prevent compiler optimizations from disregarding
// Unused values
test::black_box(range(0u, 1000).fold(0, |old, new| old ^ new));
});
}
For the command cargo bench this outputs something like:
running 1 test
test bench_xor_1000_ints ... bench: 375 ns/iter (+/- 148)
test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured
Links:
The Rust Book (section on benchmark tests)
"The Nightly Book" (section on the test crate)
test::Bencher docs
For measuring time without adding third-party dependencies, you can use std::time::Instant:
fn main() {
use std::time::Instant;
let now = Instant::now();
// Code block to measure.
{
my_function_to_measure();
}
let elapsed = now.elapsed();
println!("Elapsed: {:.2?}", elapsed);
}
There are several ways to benchmark your Rust program. For most real benchmarks, you should use a proper benchmarking framework as they help with a couple of things that are easy to screw up (including statistical analysis). Please also read the "Why writing benchmarks is hard" section at the very bottom!
Quick and easy: Instant and Duration from the standard library
To quickly check how long a piece of code runs, you can use the types in std::time. The module is fairly minimal, but it is fine for simple time measurements. You should use Instant instead of SystemTime as the former is a monotonically increasing clock and the latter is not. Example (Playground):
use std::time::Instant;
let before = Instant::now();
workload();
println!("Elapsed time: {:.2?}", before.elapsed());
The underlying platform-specific implementations of std's Instant are specified in the documentation. In short: currently (and probably forever) you can assume that it uses the best precision that the platform can provide (or something very close to it). From my measurements and experiences, this is typically approximately around 20 ns.
If std::time does not offer enough features for your case, you could take a look at chrono. However, for measuring durations, it's unlikely you need that external crate.
Using a benchmarking framework
Using frameworks is often a good idea, because they try to prevent you from making common mistakes.
Rust's built-in benchmarking framework (nightly only)
Rust has a convenient built-in benchmarking feature, which is unfortunately still unstable as of 2019-07. You have to add the #[bench] attribute to your function and make it accept one &mut test::Bencher argument:
#![feature(test)]
extern crate test;
use test::Bencher;
#[bench]
fn bench_workload(b: &mut Bencher) {
b.iter(|| workload());
}
Executing cargo bench will print:
running 1 test
test bench_workload ... bench: 78,534 ns/iter (+/- 3,606)
test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out
Criterion
The crate criterion is a framework that runs on stable, but it is a bit more complicated than the built-in solution. It does more sophisticated statistical analysis, offers a richer API, produces more information and can even automatically generate plots.
See the "Quickstart" section for more information on how to use Criterion.
Why writing benchmarks is hard
There are many pitfalls when writing benchmarks. A single mistake can make your benchmark results meaningless. Here is a list of important but commonly forgotten points:
Compile with optimizations: rustc -O3 or cargo build --release. When you are executing your benchmarks with cargo bench, Cargo will automatically enable optimizations. This step is important as there are often large performance difference between optimized and unoptimized Rust code.
Repeat the workload: only running your workload once is almost always useless. There are many things that can influence your timing: overall system load, the operating system doing stuff, CPU throttling, file system caches, and so on. So repeat your workload as often as possible. For example, Criterion runs every benchmarks for at least 5 seconds (even if the workload only takes a few nanoseconds). All measured times can then be analyzed, with mean and standard deviation being the standard tools.
Make sure your benchmark isn't completely removed: benchmarks are very artificial by nature. Usually, the result of your workload is not inspected as you only want to measure the duration. However, this means that a good optimizer could remove your whole benchmark because it does not have side-effects (well, apart from the passage of time). So to trick the optimizer, you have to somehow use your result value so that your workload cannot be removed. An easy way is to print the result. A better solution is something like black_box. This function basically hides a value from LLVM in that LLVM cannot know what will happen with the value. Nothing happens, but LLVM doesn't know. That is the point.
Good benchmarking frameworks use a block box in several situations. For example, the closure given to the iter method (for both, the built-in and Criterion Bencher) can return a value. That value is automatically passed into a black_box.
Beware of constant values: similarly to the point above, if you specify constant values in a benchmark, the optimizer might generate code specifically for that value. In extreme cases, your whole workload could be constant-folded into a single constant, meaning that your benchmark is useless. Pass all constant values through black_box to avoid LLVM optimizing too aggressively.
Beware of measurement overhead: measuring a duration takes time itself. That is usually only tens of nanoseconds, but can influence your measured times. So for all workloads that are faster than a few tens of nanoseconds, you should not measure each execution time individually. You could execute your workload 100 times and measure how long all 100 executions took. Dividing that by 100 gives you the average single time. The benchmarking frameworks mentioned above also use this trick. Criterion also has a few methods for measuring very short workloads that have side effects (like mutating something).
Many other things: sadly, I cannot list all difficulties here. If you want to write serious benchmarks, please read more online resources.
If you simply want to time a piece of code, you can use the time crate. time meanwhile deprecated, though. A follow-up crate is chrono.
Add time = "*" to your Cargo.toml.
Add
extern crate time;
use time::PreciseTime;
before your main function and
let start = PreciseTime::now();
// whatever you want to do
let end = PreciseTime::now();
println!("{} seconds for whatever you did.", start.to(end));
Complete example
Cargo.toml
[package]
name = "hello_world" # the name of the package
version = "0.0.1" # the current version, obeying semver
authors = [ "you#example.com" ]
[[bin]]
name = "rust"
path = "rust.rs"
[dependencies]
rand = "*" # Or a specific version
time = "*"
rust.rs
extern crate rand;
extern crate time;
use rand::Rng;
use time::PreciseTime;
fn main() {
// Creates an array of 10000000 random integers in the range 0 - 1000000000
//let mut array: [i32; 10000000] = [0; 10000000];
let n = 10000000;
let mut array = Vec::new();
// Fill the array
let mut rng = rand::thread_rng();
for _ in 0..n {
//array[i] = rng.gen::<i32>();
array.push(rng.gen::<i32>());
}
// Sort
let start = PreciseTime::now();
array.sort();
let end = PreciseTime::now();
println!("{} seconds for sorting {} integers.", start.to(end), n);
}
This answer is outdated! The time crate does not offer any advantages over std::time in regards to benchmarking. Please see the answers below for up to date information.
You might try timing individual components within the program using the time crate.
A quick way to find out the execution time of a program, regardless of implementation language, is to run time prog on the command line. For example:
~$ time sleep 4
real 0m4.002s
user 0m0.000s
sys 0m0.000s
The most interesting measurement is usually user, which measures the actual amount of work done by the program, regardless of what's going on in the system (sleep is a pretty boring program to benchmark). real measures the actual time that elapsed, and sys measures the amount of work done by the OS on behalf of the program.
Currently, there is no interface to any of the following Linux functions:
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts)
getrusage
times (manpage: man 2 times)
The available ways to measure the CPU time and hotspots of a Rust program on Linux are:
/usr/bin/time program
perf stat program
perf record --freq 100000 program; perf report
valgrind --tool=callgrind program; kcachegrind callgrind.out.*
The output of perf report and valgrind depends on the availability of debugging information in the program. It may not work.
I created a small crate for this (measure_time), which logs or prints the time until end of scope.
#[macro_use]
extern crate measure_time;
fn main() {
print_time!("measure function");
do_stuff();
}
The other solution of measuring execution time is creating a custom type, for example, a struct and implement Drop trait for it.
For example:
struct Elapsed(&'static str, std::time::SystemTime);
impl Drop for Elapsed {
fn drop(&mut self) {
println!(
"operation {} finished for {} ms",
self.0,
self.1.elapsed().unwrap_or_default().as_millis()
);
}
}
impl Elapsed {
pub fn start(op: &'static str) -> Elapsed {
let now = std::time::SystemTime::now();
Elapsed(op, now)
}
}
And using it in some function:
fn some_heavy_work() {
let _exec_time = Elapsed::start("some_heavy_work_fn");
// Here's some code.
}
When the function ends, the drop method for _exec_time will be called and the message will be printed.
I'm trying to get some performance metrics using the flame crate with code I've written using Rayon:
extern crate flame;
flame::start("TAG-A");
//Assume vec is a Vec<i32>
vec.par_iter_mut().filter(|a| a == 1).for_each(|b| func(b));
//func(b) operates on each i32 and sends some results to a channel
flame::end("TAG-A");
//More code but unrelated
flame::dump_stdout();
This works fine, but only gives information for the entire parallel iterator. I would like to get some more fine grained details on the function func.
I've tried adding a start/end within the function, but the runtime information is only available when I call flame::commit_thread() and then it seems to only print this to stdout. Ideally I'd like to print out the time spent without a given tag when I call dump at the end of my code.
Is there a way to dump tags from all threads? The documentation for flame isn't great.
I have a MyReader that implements Iterator and produces Buffers where Buffer : Send. MyReader produces a lot of Buffers very quickly, but I have a CPU-intensive job to perform on each Buffer (.map(|buf| ...)) that is my bottleneck, and then gather the results (ordered). I want to parallelize the CPU intense work - hopefully to N threads, that would use work stealing to perform them as fast as the number of cores allows.
Edit: To be more precise. I am working on rdedup. MyStruct is Chunker which reads io::Read (typically stdio), finds parts (chunks) of data and yields them. Then map() is supposed, for each chunk, to calculate sha256 digest of it, compress, encrypt, save and return the digest as the result of map(...). Digest of saved data is used to build index of the data. The order between chunks being processed by map(...) does not matter, but digest returned from each map(...) needs to be collected in the same order that the chunks were found. The actual save to file step is offloaded to yet another thread (writter thread). actual code of PR in question
I hoped I can use rayon for this, but rayon expect an iterator that is already parallizable - eg. a Vec<...> or something like that. I have found no way to get a par_iter from MyReader - my reader is very single-threaded in nature.
There is simple_parallel but documentation says it's not recommended for general use. And I want to make sure everything will just work.
I could just take a spmc queue implementation and a custom thread_pool, but I was hopping for an existing solution that is optimized and tested.
There's also pipeliner but doesn't support ordered map yet.
In general, preserving order is a pretty tough requirement as far as parallelization goes.
You could try to hand-make it with a typical fan-out/fan-in setup:
a single producer which tags inputs with a sequential monotonically increasing ID,
a thread pool which consumes from this producer and then sends the result toward the final consumer,
a consumer who buffers and reorders result so as to treat them in the sequential order.
Or you could raise the level of abstraction.
Of specific interest here: Future.
A Future represents the result of a computation, which may or may not have happened yet. A consumer receiving an ordered list of Future can simply wait on each one, and let buffering occur naturally in the queue.
For bonus points, if you use a fixed size queue, you automatically get back-pressure on the consumer.
And therefore I would recommend building something of CpuPool.
The setup is going to be:
use std::sync::mpsc::{Receiver, Sender};
fn produce(sender: Sender<...>) {
let pool = CpuPool::new_num_cpus();
for chunk in reader {
let future = pool.spawn_fn(|| /* do work */);
sender.send(future);
}
// Dropping the sender signals there's no more work to consumer
}
fn consume(receiver: Receiver<...>) {
while let Ok(future) = receiver.recv() {
let item = future.wait().expect("Computation Error?");
/* do something with item */
}
}
fn main() {
let (sender, receiver) = std::sync::mpsc::channel();
std::thread::spawn(move || consume(receiver));
produce(sender);
}
There is now a dpc-pariter crate. Simply replace iter.map(fn) with iter.parallel_map(fn), which will perform work in parallel while preserving result order. From the docs:
* drop-in replacement for standard iterators(*)
* preserves order
* lazy, somewhat like single-threaded iterators
* panic propagation
* support for iterating over borrowed values using scoped threads
* backpressure
* profiling methods (useful for analyzing pipelined processing bottlenecks)
Also, Rayon has an open issue with a great in-depth discussion of various implementation details and limitations.