Why is there such big performance difference between browsers array operations?

Why is there such big performance difference between browsers array operations? - node.js

I was tuning my library goodcore and set up some performance tests to compare against native array functions. Then I ran them against Edge, FF, Chrome and Node 10.9 on my laptop. Of course my lib had mixed results but what was more interesting was that the difference between browsers were sometimes 30x between best and worst and it did not seem to vary completely between operations.
The arrays used are 10000 long with random ints between 0 and 100000.
EDIT Versions:
Chrome: 68.0.3440.106
FF: 62.0
Edge: 41.16299.371.0
Node: 10.9
Here are my results (only for native operations):
EDIT: now with correct values and also custom algorithms (no native)
The data shows ops/sec in Benchmark.js.
Is this due to datastructure implementation or micro optimizations?

Is this due to datastructure implementation or micro optimizations?
Yes.
Longer answer: probably both, but the only way to answer this for sure is to look at each browser's implementation in detail.
The larger differences that you've measured in particular look like they might be due to fundamentally different choices of data structures under the hood; however even with the same basic data structure, the efficiency of the rest of the implementation can make a huge difference (I've seen 10x - 100x).
Also, IMHO your results are somewhat suspicious: Chrome and Node use the same V8 engine and should have very similar performance. Results like "indexOf" or "splice(remove 1)", where you're seeing a ~10x difference between what should be the ~same result, indicate that something might be wrong in your benchmarks. And if those two results can't be trusted, then why would you have any more confidence in your Edge/Firefox results?
Speaking of benchmark quality: using only one type of array (only one size, only one type of contents, always dense) is another reason why your results probably don't reflect the full story; so please be careful with drawing any conclusions from this.
Why is there such big performance difference
Because making the Array built-in methods fast is a ton of engineering effort. Each browser's engineering team is doing their best to spend the time they have on the functionality they think matters the most. The result is that you'll see varying degrees of optimization in the various implementations.
If there are differences in chosen data structures under the hood (I don't know), then those are typically tradeoffs: one choice might be faster at X but slower at Y than another choice; or one might be faster but consume more memory; etc.

Related

How do I write a Multi threaded Alpha-Beta Search algorithm?

I'm trying to create a chess engine using an alpha beta minimax search algorithm, but the code is too slow. I've done all the optimisations I can think of, but it is still very slow in a single thread. I looked at the source code of some other engines to see how they do it and the chess programming wiki (https://www.chessprogramming.org/Parallel_Search#Parallel_Alpha-Beta), but the the code is beyond my level and I don't understand them. I couldn't find any written sources or code snippets either.
Can someone explain how to efficiently implement threading in an alpha-beta search algorithm? Thanks.

Alpha-beta is an inherently sequential algorithm as your alpha and beta values get updated continuously thorough the search and cutoffs are decided based on these values. For this reason getting any speedups by increasing the amount of threads is very hard, and the more threads you throw at it, the smaller the gains will be.
However there's still several ways to do it, most of them fairly complicated and they scale extremely poorly with more threads. The go to algorithm used to be the Young Brothers wait concept, it is a fairly complicated algo and it was used for example by Stockfish until a few years back. However with the increasing amount of cores available on modern computers the scaling was very poor and the code very complex. Today most modern engines use something called Lazy SMP. This algorithm is almost as simple as it can be and scales better than the others.
In lazy SMP all you have to do is start the exact same search as you would normally do, just on multiple threads. It relies on having a working transposition table through which the threads communicate with each other. The threads will never be exactly in sync and the randomness will lead each thread to explore slightly different parts of the search tree and then save their results into the transposition table, where it might be used by another thread. Of course there is a lot of repeating work done by each thread, however it is still better than trying to be clever about splitting the work and slowing down the algorithm, and this is especially true when you start scaling up the amount of threads.
I recommend you take a look at the chess programming wiki, where you can even find some pseudo code on how to implement it.
https://www.chessprogramming.org/Lazy_SMP
Though i should also point out that if what you are looking for is improving your time to depth, implementing multithreading won't do all that much for you (and in some extreme cases it might actually even slow it down!). What you need instead is more aggresive pruning of the search tree and more efficient implementation (eg. no memory allocations, so the garbage collector never has to run, etc.).

What's the overhead of the different forms of parallelism in Julia v0.5?

As the title states, what is the overhead of the different forms of parallelism, at least in the current implementation of Julia (v0.5, in case the implementation changes drastically in the future)? I am looking for some "practical measures", some general heuristics or ballparks to keep in my head for when it can be useful. For example, it's pretty obvious that multiprocessing won't give you gains in a loop like:
addprocs(4)
#parallel (+) for i=1:4
rand()
end
doesn't give you performance gains because each process is only taking one random number, but is there general heuristic for knowing when it will be worthwhile? Also, what about a heuristic for threading. It's surely a lower overhead than multiprocessing, but for example, with 4 threads, for what N is it a good idea to multithread:
A = rand(4)
Base.#threads (+) for i = 1:N
A[i%4+1]
end
(I know there isn't a threaded reduction right now, but let's act like there is, or edit with a better example). Sure, I can benchmark every example, but some good rules to keep in mind would go a long way.
In more concrete terms: what are some good rules of thumb?
How many numbers do you need to be adding/multiplying before threading gives performance enhancements, or before multiprocessing gives performance enhancements?
How much does the depend on Julia's current implementation?
How much does it depend on the number of threads/processes?
How much does the depend on the architecture? Are there good rules for knowing when the threshold should be higher/lower on a particular system?
What kinds of applications violate these heuristics?
Again, I'm not looking for hard rules, just general guidelines to guide development.

A few caveats: 1. I'm speaking from experience with version 0.4.6, (and prior), haven't played with 0.5 yet (but, as I hope my answer below demonstrates, I don't think this is essential vis-a-vis the response I give). 2. this isn't a fully comprehensive answer.
Nevertheless, from my experience, the overhead for multiple processes itself is very small provided that you aren't dealing with data movement issues. In other words, in my experience, any time that you ever find yourself in a situation of wishing something were faster than a single process on your CPU can manage, you're well past the point where parallelism will be beneficial. For instance, in the sum of random numbers example that you gave, I found through testing just now that the break-even point was somewhere around 10,000 random numbers. Anything more and parallelism was the clear winner. Generating 10,000 random number is trivial for modern computers, taking a tiny fraction of a second, and is well below the threshold where I'd start getting frustrated by the slowness of my scripts and want parallelism to speed them up.
Thus, I at least am of the opinion, that although there are probably even more wonderful things that the Julia developers could do to cut down on the overhead even more, at this point, anything pertinent to Julia isn't going to be so much of your limiting factor, at least in terms of the computation aspects of parallelism. I think that there are still improvements to be made in terms of enhancing both the ease and the efficiency of parallel data movement (I like the package that you've started on that topic as a good step. You and I would probably both agree there's still a ways more to go). But, the big limiting factors will be:
How much data do you need to be moving around between processes?
How much read/write to your memory do you need to be doing during your computations? (e.g. flops per read/write)
Aspect 1. might at times lean against using parallelism. Aspect 2. is more likely just to mean that you won't get so much benefit from it. And, at least as I interpret "overhead," neither of these really fall so directly into that specific consideration. And, both of these are, I believe, going to be far more heavily determined by your system hardware than by Julia.

Random output from Myrrix for the same input

I'm getting slightly different results each time I run Myrrix, even though I'm giving it the exact same input. (I'm only running the serving layer.)
Is this expected behavior and if so how much can I expect the results to vary?
My spontaneous guess would be that the algorithm Myrrix uses is inherently nondeterministic due to the fact that it's built to be massively parallelized --- would that be a correct assessment?

It is not related to the parallelism, but to the random initial conditions of the algorithm. You will see slightly different solutions each time. While that's normal, they shouldn't be too different. If they are, that indicates over-fitting: you might have a lot of features, or low lambda, for your data set. My first guess is that your data set is fairly small and the default of 30 features is quite big in comparison.

Expression trees vs IL.Emit for runtime code specialization

I recently learned that it is possible to generate C# code at runtime and I would like to put this feature to use. I have code that does some very basic geometric calculations like computing line-plane intersections and I think I could gain some performance benefits by generating specialized code for some of the methods because many of the calculations are performed for the same plane or the same line over and over again. By specializing the code that computes the intersections I think I should be able to gain some performance benefits.
The problem is that I'm not sure where to begin. From reading a few blog posts and browsing MSDN documentation I've come across two possible strategies for generating code at runtime: Expression trees and IL.Emit. Using expression trees seems much easier because there is no need to learn anything about OpCodes and various other MSIL related intricacies but I'm not sure if expression trees are as fast as manually generated MSIL. So are there any suggestions on which method I should go with?

The performance of both is generally same, as expression trees internally are traversed and emitted as IL using the same underlying system functions that you would be using yourself. It is theoretically possible to emit a more efficient IL using low-level functions, but I doubt that there would be any practically important performance gain. That would depend on the task, but I have not come of any practical optimisation of emitted IL, compared to one emitted by expression trees.
I highly suggest getting the tool called ILSpy that reverse-compiles CLR assemblies. With that you can look at the code actually traversing the expression trees and actually emitting IL.
Finally, a caveat. I have used expression trees in a language parser, where function calls are bound to grammar rules that are compiled from a file at runtime. Compiled is a key here. For many problems I came across, when what you want to achieve is known at compile time, then you would not gain much performance by runtime code generation. Some CLR JIT optimizations might be also unavailable to dynamic code. This is only an opinion from my practice, and your domain would be different, but if performance is critical, I would rather look at native code, highly optimized libraries. Some of the work I have done would be snail slow if not using LAPACK/MKL. But that is only a piece of the advice not asked for, so take it with a grain of salt.

If I were in your situation, I would try alternatives from high level to low level, in increasing "needed time & effort" and decreasing reusability order, and I would stop as soon as the performance is good enough for the time being, i.e.:
first, I'd check to see if Math.NET, LAPACK or some similar numeric library already has similar functionality, or I can adapt/extend the code to my needs;
second, I'd try Expression Trees;
third, I'd check Roslyn Project (even though it is in prerelease version);
fourth, I'd think about writing common routines with unsafe C code;
[fifth, I'd think about quitting and starting a new career in a different profession :) ],
and only if none of these work out, would I be so hopeless to try emitting IL at run time.
But perhaps I'm biased against low level approaches; your expertise, experience and point of view might be different.

Software to Tune/Calibrate Properties for Heuristic Algorithms

Today I read that there is a software called WinCalibra (scroll a bit down) which can take a text file with properties as input.
This program can then optimize the input properties based on the output values of your algorithm. See this paper or the user documentation for more information (see link above; sadly doc is a zipped exe).
Do you know other software which can do the same which runs under Linux? (preferable Open Source)
EDIT: Since I need this for a java application: should I invest my research in java libraries like gaul or watchmaker? The problem is that I don't want to roll out my own solution nor I have time to do so. Do you have pointers to an out-of-the-box applications like Calibra? (internet searches weren't successfull; I only found libraries)
I decided to give away the bounty (otherwise no one would have a benefit) although I didn't found a satisfactory solution :-( (out-of-the-box application)

Some kind of (Metropolis algorithm-like) probability selected random walk is a possibility in this instance. Perhaps with simulated annealing to improve the final selection. Though the timing parameters you've supplied are not optimal for getting a really great result this way.
It works like this:
You start at some point. Use your existing data to pick one that look promising (like the highest value you've got). Set o to the output value at this point.
You propose a randomly selected step in the input space, assign the output value there to n.
Accept the step (that is update the working position) if 1) n>o or 2) the new value is lower, but a random number on [0,1) is less than f(n/o) for some monotonically increasing f() with range and domain on [0,1).
Repeat steps 2 and 3 as long as you can afford, collecting statistics at each step.
Finally compute the result. In your case an average of all points is probably sufficient.
Important frill: This approach has trouble if the space has many local maxima with deep dips between them unless the step size is big enough to get past the dips; but big steps makes the whole thing slow to converge. To fix this you do two things:
Do simulated annealing (start with a large step size and gradually reduce it, thus allowing the walker to move between local maxima early on, but trapping it in one region later to accumulate precision results.
Use several (many if you can afford it) independent walkers so that they can get trapped in different local maxima. The more you use, and the bigger the difference in output values, the more likely you are to get the best maxima.
This is not necessary if you know that you only have one, big, broad, nicely behaved local extreme.
Finally, the selection of f(). You can just use f(x) = x, but you'll get optimal convergence if you use f(x) = exp(-(1/x)).
Again, you don't have enough time for a great many steps (though if you have multiple computers, you can run separate instances to get the multiple walkers effect, which will help), so you might be better off with some kind of deterministic approach. But that is not a subject I know enough about to offer any advice.

There are a lot of genetic algorithm based software that can do exactly that. Wrote a PHD about it a decade or two ago.
A google for Genetic Algorithms Linux shows a load of starting points.

Intrigued by the question, I did a bit of poking around, trying to get a better understanding of the nature of CALIBRA, its standing in academic circles and the existence of similar software of projects, in the Open Source and Linux world.
Please be kind (and, please, edit directly, or suggest editing) for the likely instances where my assertions are incomplete, inexact and even flat-out incorrect. While working in related fields, I'm by no mean an Operational Research (OR) authority!
[Algorithm] Parameter tuning problem is a relatively well defined problem, typically framed as one of a solution search problem whereby, the combination of all possible parameter values constitute a solution space and the parameter tuning logic's aim is to "navigate" [portions of] this space in search of an optimal (or locally optimal) set of parameters.
The optimality of a given solution is measured in various ways and such metrics help direct the search. In the case of the Parameter Tuning problem, the validity of a given solution is measured, directly or through a function, from the output of the algorithm [i.e. the algorithm being tuned not the algorithm of the tuning logic!].
Framed as a search problem, the discipline of Algorithm Parameter Tuning doesn't differ significantly from other other Solution Search problems where the solution space is defined by something else than the parameters to a given algorithm. But because it works on algorithms which are in themselves solutions of sorts, this discipline is sometimes referred as Metaheuristics or Metasearch. (A metaheuristics approach can be applied to various algorihms)
Certainly there are many specific features of the parameter tuning problem as compared to the other optimization applications but with regard to the solution searching per-se, the approaches and problems are generally the same.
Indeed, while well defined, the search problem is generally still broadly unsolved, and is the object of active research in very many different directions, for many different domains. Various approaches offer mixed success depending on the specific conditions and requirements of the domain, and this vibrant and diverse mix of academic research and practical applications is a common trait to Metaheuristics and to Optimization at large.
So... back to CALIBRA...
From its own authors' admission, Calibra has several limitations
Limit of 5 parameters, maximum
Requirement of a range of values for [some of ?] the parameters
Works better when the parameters are relatively independent (but... wait, when that is the case, isn't the whole search problem much easier ;-) )
CALIBRA is based on a combination of approaches, which are repeated in a sequence. A mix of guided search and local optimization.
The paper where CALIBRA was presented is dated 2006. Since then, there's been relatively few references to this paper and to CALIBRA at large. Its two authors have since published several other papers in various disciplines related to Operational Research (OR).
This may be indicative that CALIBRA hasn't been perceived as a breakthrough.

State of the art in that area ("parameter tuning", "algorithm configuration") is the SPOT package in R. You can connect external fitness functions using a language of your choice. It is really powerful.
I am working on adapters for e.g. C++ and Java that simplify the experimental setup, which requires some getting used to in SPOT. The project goes under name InPUT, and a first version of the tuning part will be up soon.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string