Webassembly trig functions possible? - trigonometry

Does webassembly have support for trig functions? I mean like built in support because it seems we have to import those from javascript. It would be great if we had things like:
f32.sin
f32.cos
If it makes any sense. If they dont exist I assume its because the implementation is very system dependent.
The problem:
Imagine we had some really complex and computationally expensive formula which involves these math functions. I would like to compute everything in webassembly without relying on imports or subdividing my code where one part is run in webassembly and the other in javascript.
Apart from that I do believe semantically things look neater if trig functions were built in.

Trigonometric functions aren't available in WebAssembly though it's been discussed before 1, 2, 3. In general, WebAssembly provides opcodes for things that exist efficiently as CPU instructions, and trigonometric functions just don't have efficient CPU versions nowadays (even the x86 sin / cos is slow).
Further, we don't want to mandate specific precision bounds at this point in time. It's an art-form to specify trigonometric functions with the right precision and there hasn't been strong interest thus far.
In the future, we'd like to see code sharing of something like libm.wasm, which would remove the code download burden.
An argument could be made that implementations could be more efficient if one had ISA information, but we'd need someone championing that usecase with data to standardize it. I expect the first such case to be fast sqrt and inverse sqrt approximations, which are often used in games and do have good ISA support. I haven't heard of real performance-sensitive trigonometric uses otherwise in the context of WebAssembly. Not that I don't think they'd exist: it's a simple question of priorities for the Community Group.

Related

How to reliably influence generated code at near machine level using GHC?

While this may sound as theoretical question, suppose I decide to invest and build a mission-critical application written in Haskell. A year later I find that I absolutely need to improve performance of some very thin bottleneck and this will require optimizing memory access close to raw machine capabilities.
Some assumptions:
It isn't realtime system - occasional latency spikes are tolerable (from interrupts, thread scheduling irregularities, occasional GC etc.)
It isn't a numeric problem - data layout and cache-friendly access patterns are most important (avoiding pointer chasing, reducing conditional jumps etc.)
Code may be tied to specific GHC release (but no forking)
Performance goal requires inplace modification of pre-allocated offheap arrays taking alignment into account (C strings, bit-packed fields etc.)
Data is statically bounded in arrays and allocations are rarely if ever needed
What mechanisms does GHC offer to perfom this kind of optimization? By saying reliably I mean that if source change causes code to no longer perform, it is correctible in source code without rewriting it in assembly.
Is it already possible using GHC-specific extensions and libraries?
Would custom FFI help avoid C calling convention overhead?
Could a special purpose compiler plugin do it through a restricted source DSL?
Could source code generator from a "high-level" assembly (LLVM?) be solution?
It sounds like you're looking for unboxed arrays. "unboxed" in haskell-land means "has no runtime heap representation". You can usually learn whether some part of your code is compiled to an unboxed loop (a loop that performs no allocation), say, by looking at the core representation (this is a very haskell-like language, that's the first stage in compilation). So e.g. you might see Int# in the core output which means an integer which has no heap representation (it's gonna be in a register).
When optimizing haskell code we regularly look at core and expect to be able to manipulate or correct for performance regressions by changing the source code (e.g. adding a strictness annotation, or fiddling with a function such that it can be inlined). This isn't always fun, but will be fairly stable especially if you are pinning your compiler version.
Back to unboxed arrays: GHC exposes a lot of low-level primops in GHC.Prim, in particular it sounds like you want mutable unboxed arrays (MutableByteArray). The primitive package exposes these primops behind a slightly safer, friendlier API and is what you should use (and depend on if writing your own library).
There are many other libraries that implement unboxed arrays, such as vector, and which are built on MutableByteArray, but the point is that operations on that structure generate no garbage and likely compile down to pretty predictable machine instructions.
You might also like to check out this technique if you're doing numeric work and want to use a particular instruction or implement some loop directly in assembly.
GHC also has a very powerful FFI, and you can research about how to write portions of your program in C and interop; haskell supports pinned arrays among other structures for this purpose.
If you need more control than those give you then haskell is likely the wrong language. It's impossible to tell from your description if this is the case for your problem (Your requirements seem contradictory: you need to be able to write a carefully cache-tuned algorithm, but arbitrary GC pauses are okay?).
One last note: you can't rely on GHC's native code generator to perform any of the low-level strength reduction optimizations that e.g. GCC performs (GHC's NCG will probably never ever know about bit-twiddling hacks, autovectorization, etc. etc.). Instead you can try the LLVM backend, but whether you see a speedup in your program is by no means guaranteed.

Why devectorization in Julia is encouraged?

Seems like writing devectorized code is encouraged in Julia.
There is even a package that tries to do that for you.
My question is why?
First of all, speaking from the user experience aspect, vectorized code is more concise (less code, then less likelihood of bugs), more clear (hence easier to debug), more natural way of writing code (at least for someone who comes from scientific computing background, whom Julia tries to cater to). Being able to write something like vector'vector or vector'Matrix*vector is very important, because it corresponds to actual mathematical representation, and this is how scientific computing guys think of it in their head (not in nested loops). And I hate the fact that this is not the best way to write this, and reparsing it into loops will be faster.
At the moment it seems like there is a conflict between the goal of writing the code that is fast and the code that is concise/clear.
Secondly, what is the technical reason for this? Ok, I understand that vectorized code creates extra temporaries, etc., but vectorized functions (for example, broadcast(), map(), etc.) have a potential of multithreading them, and I think that the benefit of multithreading can outweigh the overhead of temporaries and other disadvantages of vectorized functions making them faster than regular for loops.
Do current implementations of vectorized functions in Julia do implicit multithreading under the hood?
If not, is there work / plans to add implicit concurrency to vectorized functions and to make them faster than loops?
For easy reading I decided to turn my comment marathon above into an answer.
The core development statement behind Julia is "we are greedy". The core devs want it to do everything, and do it fast. In particular, note that the language is supposed to solve the "two-language problem", and at this stage, it looks like it will accomplish this by the time v1.0 hits.
In the context of your question, this means that everything you are asking about is either already a part of Julia, or planned for v1.0.
In particular, this means that if your programming problem lends itself to vectorized code, then write vectorized code. If it is more natural to use loops, use loops.
By the time v1.0 hits, most vectorized code should be as fast, or faster, than equivalent code in Matlab. In many cases, this development goal has already been achieved, since many vector/matrix operations in Julia are sent to the appropriate BLAS routines by the compiler.
Regarding multi-threading, native multi-threading is currently being implemented for Julia, and I believe an experimental set of routines is already available on the master branch. The relevant issue page is here. Implicit multithreading for some vector/matrix operations is already in theory available in Julia, since Julia calls BLAS. I'm not sure if it is switched on by default.
Be aware though, that many vectorized operations will still (currently) be much faster in MATLAB, since MATLAB have been writing specialised multi-threaded C libraries for years and then calling them under the hood. Once Julia has native multi-threading, I expect Julia will overtake MATLAB, since at that point the entire dev community can scour the standard Julia packages and upgrade them to take advantage of native multi-threading wherever possible.
In contrast, MATLAB does not have native multi-threading, so you are relying on Mathworks to provide specialised multi-threaded routines in the form of underlying C libraries.
You can and should write vector'*matrix*vector (or perhaps dot(vector, matrix*vector) if you prefer a scalar output). For things like matrix multiplication, you're much better off using vectorized notation, because it calls the underlying BLAS libraries which are more heavily optimized than code produced by basically any language/compiler combination.
In other places, as you say you can benefit from devectorization by avoiding temporary intermediates: for example, if x is a vector, the expression
y = exp(x).*x + 5
creates 3 temporary vectors: one for a = exp(x), one for b = a.*x and one for y = b + 5. In contrast,
y = [exp(z)*z+5 for z in x]
creates no temporary intermediates. Since loops and comprehensions in julia are fast, there is no disadvantage to writing the devectorized version, and in fact it should perform slightly better (especially with performance annotations like #simd, where appropriate).
The arrival of threads may change things (making vectorized exp faster than a "naive" exp), but in general I'd say you should regard this as an "orthogonal" issue: julia will likely make multithreading so easy to use that you yourself might write operations using multiple threads, and consequently the vectorized "library" routine still has no advantage over code you might write yourself. In other words, you might use multiple threads but still write devectorized code to avoid those temporaries.
In the longer term, a "sufficiently smart compiler" may avoid temporaries by automatically devectorizing some of these operations, but this is a much harder route, with potential traps for the unwary, than it may seem.
Your statement that "vectorized code is always more concise, and easier to understand" is, however, not true: many times while writing Matlab code, you have to go to extremes to come up with a vectorized way of writing what are actually simple operations when thought of in terms of loops. You can search the mailing lists for countless examples; one that I recall on SO is How to find connected components in a matrix using Julia.

Expression trees vs IL.Emit for runtime code specialization

I recently learned that it is possible to generate C# code at runtime and I would like to put this feature to use. I have code that does some very basic geometric calculations like computing line-plane intersections and I think I could gain some performance benefits by generating specialized code for some of the methods because many of the calculations are performed for the same plane or the same line over and over again. By specializing the code that computes the intersections I think I should be able to gain some performance benefits.
The problem is that I'm not sure where to begin. From reading a few blog posts and browsing MSDN documentation I've come across two possible strategies for generating code at runtime: Expression trees and IL.Emit. Using expression trees seems much easier because there is no need to learn anything about OpCodes and various other MSIL related intricacies but I'm not sure if expression trees are as fast as manually generated MSIL. So are there any suggestions on which method I should go with?
The performance of both is generally same, as expression trees internally are traversed and emitted as IL using the same underlying system functions that you would be using yourself. It is theoretically possible to emit a more efficient IL using low-level functions, but I doubt that there would be any practically important performance gain. That would depend on the task, but I have not come of any practical optimisation of emitted IL, compared to one emitted by expression trees.
I highly suggest getting the tool called ILSpy that reverse-compiles CLR assemblies. With that you can look at the code actually traversing the expression trees and actually emitting IL.
Finally, a caveat. I have used expression trees in a language parser, where function calls are bound to grammar rules that are compiled from a file at runtime. Compiled is a key here. For many problems I came across, when what you want to achieve is known at compile time, then you would not gain much performance by runtime code generation. Some CLR JIT optimizations might be also unavailable to dynamic code. This is only an opinion from my practice, and your domain would be different, but if performance is critical, I would rather look at native code, highly optimized libraries. Some of the work I have done would be snail slow if not using LAPACK/MKL. But that is only a piece of the advice not asked for, so take it with a grain of salt.
If I were in your situation, I would try alternatives from high level to low level, in increasing "needed time & effort" and decreasing reusability order, and I would stop as soon as the performance is good enough for the time being, i.e.:
first, I'd check to see if Math.NET, LAPACK or some similar numeric library already has similar functionality, or I can adapt/extend the code to my needs;
second, I'd try Expression Trees;
third, I'd check Roslyn Project (even though it is in prerelease version);
fourth, I'd think about writing common routines with unsafe C code;
[fifth, I'd think about quitting and starting a new career in a different profession :) ],
and only if none of these work out, would I be so hopeless to try emitting IL at run time.
But perhaps I'm biased against low level approaches; your expertise, experience and point of view might be different.

Haskell, FFI, and Grand Central Dispatch?

I'm considering a functional language that will play well with my environment of C/Objective-C under FreeBSD, OSX, iOS. It looks like my best bet is to create functional-language libraries for specific functions, written in Haskell, compile with GHC, and use FFI to call this functional code as a standard C call.
My question is, how do I handle concurrency in this situation? One motivation for using a functional language is that for my problems where I want to operate on immutable datasets, I want to get a lot of parallelization going. However, using the approach I detail here, will I get ANY parallelization? It appears I can compile and dictate to use 2 threads, but is there any way to use GCD instead of threading (for all the reasons GCD is better than threads, such as the amount of parallelization automatically scaling per-platform)? Or, going with FFI as I describe, do I completely lose the ability to multithread?
This language seems like the best match for what I'm trying to do, but I want to learn if it's the right fit before I devote a significant amount of time to truly learn it
GHC's runtime replaces the need for GCD, doesn't it? And it already provides cross-platform parallelism.

Trivial mathematical problems as language benchmarks

Why do people insist on using trivial mathematical problems like finding numbers in the Fibonacci sequence for language benchmarks? Don't these usually get optimized to relativistic speeds? Isn't the brunt of the bottlenecks usually in I/O, system API calls, operations on strings and structures, processing large quantities of data, abstract object-oriented stuff, etc?
It is a throwback to the old days, when compiler technology for what we would now call basic math was still evolving rapidly.
Now, compiler evolution is more focused on exploiting new instructions for niche operations, 64-bit math, and so on.
Micro-benchmarks such as the ones you mention were useful, though, when evaluating the efficiency of the hotspot compiler when Java was first launched, and in evaluating the efficiency of .NET versus C/C++.
Your suggestion that I/O and system calls are the likely bottlenecks is correct, at least for some space of problems. But I notice you suggested string operations. One person's irrelevant micro-benchmark is another person's critical performance metric.
EDIT: ps, I also remember using linpack and other micro-benchmarks to compare versions of the JVM, and to compare vendors of the JVM. From v4 to v5 there was a big jump in perf, I guess the JIT compiler got more effective. Also, IBM's JVM was ahead of Sun's at that time, on Windows-x86.
Because if you want to benchmark the language/compiler, these "math problems" are good indicators of the "bare speed" of the generated code. Either they use the iterative solution, which is a tight loop and indicates how well can the compiler push the instructions to the processor, or they use the recursive solution, which indicates how does it handle recursive calls of short functions (inlining, tail-recursion etc.) (although the Ackermann function is usually used for that too).
Usually, the benchmark suite for the language contain tests benchmarking other parts as well - eg. gzip compression, text searching, object creation, virtual function call, exception throw/catch benchmarks.
The other things you've noticed, syscalls and IO are usually not included because
syscalls are in fact not that slow - applications don't spend significant porion of the time in the kernel, except for test specifically targeted at them or when something is seriously wrong with the program
syscall and IO performance does not depend on the language, but rather on the OS & hardware
I'd think a simple, well-established algorithm would remove the possibility that the benchmark is biased (whether through ignorance or malice) to favor one language. It is very difficult to write a complex program in two different languages exactly the same. Testing something like the efficiency of a multithreaded application in c# vs java, for example, would require developers skilled in multithreaded development both languages, and there would still be questions as to whether the benchmark app properly represents the general case, or if it is misrepresenting a special case that only one language handles well.
Back when the sieve of eratosthanes was a popular benchmark for C compilers, I thought it would be funny if one of the compiler authors would recognize the sieve code and replace it with a pre-computed lookup.

Resources