Why do the pre-defined functions always perform faster then user-defind function for the same functionality? Apologies if its too basic - programming-languages

In every programming languages, why the pre-defined
functions always performs faster than user-defined
function for the same functionality however we code with minimum Time Complexity?
Is it related to any resource utilization difference of these two types.?
This was my long term query.. I need atleast one better answer..

What you stated is simply not true. The claim that in any language user-defined functions perform worse than builtin is a very bold claim which is to be backed by hard data acquired using profiling.
Being more down-to-Earth though, we indeed might from time to time observe this behaviour with various platforms/lanugages/runtimes/libraries etc.
Mostly this is due to these two reasons:
Builtin functions are sometimes allowed to not be functions at all: the compiler is free to remove a function call completely (and replace it with something else like a bunch of machine instructions) if the behaviour of the code does not change.
In C, functions in the memcpy() family are a good example of this: in "release" builds, a good compiler for popular platforms usually replaces these cases with direct calls to CPU instructions which copy ranges of memory.
Builtin functions are often coded in platform-dependent assembly, so that when applicable, the function's body will be a finely tuned highly optimized low level code written for the target CPU's extended instruction set (such as SSE and its later iterations on Intel platforms).
If you meant not low-level languages like C but rather more high-level, targetting a virtual machine or interpreted then the answer should be rather more obvious.

Related

Is it possible to execute hand-coded bytecode scripts in the V8 engine?

I am in the early stages of a project that aims to estimate the power consumption of Javascript apps. Similar work has been done with Android apps through Java bytecode profiling, and I am hoping to apply a similar methodology using the bytecode generated by Ignition in the V8 engine. However, understandably, there seem to be more tools and resources available for granular analysis of Java bytecode.
My question is whether or not it is possible in V8 to have the engine run hand-coded bytecode scripts for testing purposes, rather than those generated through the compilation process from actual JS source.
The reason for doing this would the development of energy cost functions at the bytecode instruction level. To accomplish this, I'm hoping to run the same instruction (or set of instructions) repeatedly in a loop on a machine connected to speciliazed hardware to measure power draw. These measurements would then be used to inform an estimate of the total power consumption of a program by analysing the composition of the bytecode generated by V8.
(V8 developer here.)
No, this is not possible.
Bytecode is an internal implementation detail. It intentionally doesn't have an external interface (you can't get it out of V8, and can't feed it back in), and is not standardized or documented -- in fact, it could change any day.
Also, bytecode typically isn't executed very much, because "hot" functions get tiered up (by an evolving set of compilers). So while you could create an artificial lab setting where you stress-test bytecode execution, that would be so far removed from reality that I doubt the results would be very useful. In particular, they would definitely not allow you to make any meaningful statements about actual power consumption of a JavaScript program.
The only way to measure power consumption of a JavaScript program is to run the program in question and measure power consumption while doing so. Due to JavaScript's dynamic/flexible nature, there are no static rules as simple as "every + operation takes X microjoules, every obj.prop load takes Y microjoules". The reality will be "it depends". To give two obvious examples, adding two strings has a different cost from adding two integers (in terms of both time and power). A monomorphic load is much cheaper than a megamorphic load, loading a simple property has a different cost from having to walk the prototype chain and call a getter; optimization may be able to avoid the load entirely or it might not, depending on various kinds of surrounding circumstances.

Can a programming language do anthing outside of what the OS provides?

I am trying to gain a more holistic, general, and high-level understanding of programs and programming languages.
I would like to understand how they actually function. I understand at the lowest level is machine code which is 0 and 1s. Then you have assembly. Then you have another high level language where every instruction/function/method/call/routine whatever you want to call it maps to some instruction or group of instructions in assembly right? The higher level language cannot provide or do anything outside of what the lower level language assembly provides correct?
Similarly, since all code runs on an OS, that code can only do things that the OS provides. It is impossible for the code to do anything outside of what the OS actually provides correct?
The computer has an instruction set, machine code, which defines what can be done on the computer. Assembly code is essentially a more convenient representation of that, so assembly code can do anything the machine can do. A higher-level programming language has to run on the machine, so it can't do anything the machine can't do, though it may well be able to express it more conveniently (e.g. print "foo" rather than several dozen machine-code instructions). It is a matter of implementation choice whether the compiler for that programming language generates machine code directly, or assembly code, or any other form that might need further processing.
This brings us to the question of whether it's possible for a program (regardless of what it's written in) to do something the OS does not explicitly provide for. I find this an odd way to express it, since the point of writing a program is to give you some capability that you didn't have before, so in some sense you only write programs for things the OS does not explicitly provide you with. The problem lies in defining what the OS "provides". If it's a general-purpose OS, then its designers likely intend to "provide" the ability for you to write a wide range of programs. The OS may choose to provide some convenient feature (say, the ability to create files) but if it did not provide such a feature, you could perhaps do it yourself, given suitable motivation (and, for the file creation example, the ability to do disk I/O - possibly requiring you to write a disk driver too).
I am trying to gain a more holistic, general, and high-level
understanding of programs and programming languages.
I would like to understand how they actually function.
I recommend working through the understanding of modern hardware to get good performance and energy efficiency with the following as example:
With subword parallelism to improve matrix multiply by a factor of 4.
Doubling the performance by unrolling the loop to demonstrate the value of instruction level parallelism.
Doubling performance again by optimizing for caches using
blocking.
Finally, A speedup of 14 from 16 processors by using thread-level parallelism.
All four optimizations in total add just 24 lines of C code to a matrix multiply example you find in
Computer Organization and Design: The Hardware Software Interface (5th ed.)
or similar books.
To make a point, it really is worth it to dig deeper than just "learning python" even if it is a good start. So understanding low level really effects high level programming in many ways and thats what you are after according to your question.
There is actually not just Assembly - there are hardware description languages as VHDL thats handled in this topic f.e.:
https://electronics.stackexchange.com/questions/132611/whats-the-motivation-in-using-verilog-or-vhdl-over-c

How to reliably influence generated code at near machine level using GHC?

While this may sound as theoretical question, suppose I decide to invest and build a mission-critical application written in Haskell. A year later I find that I absolutely need to improve performance of some very thin bottleneck and this will require optimizing memory access close to raw machine capabilities.
Some assumptions:
It isn't realtime system - occasional latency spikes are tolerable (from interrupts, thread scheduling irregularities, occasional GC etc.)
It isn't a numeric problem - data layout and cache-friendly access patterns are most important (avoiding pointer chasing, reducing conditional jumps etc.)
Code may be tied to specific GHC release (but no forking)
Performance goal requires inplace modification of pre-allocated offheap arrays taking alignment into account (C strings, bit-packed fields etc.)
Data is statically bounded in arrays and allocations are rarely if ever needed
What mechanisms does GHC offer to perfom this kind of optimization? By saying reliably I mean that if source change causes code to no longer perform, it is correctible in source code without rewriting it in assembly.
Is it already possible using GHC-specific extensions and libraries?
Would custom FFI help avoid C calling convention overhead?
Could a special purpose compiler plugin do it through a restricted source DSL?
Could source code generator from a "high-level" assembly (LLVM?) be solution?
It sounds like you're looking for unboxed arrays. "unboxed" in haskell-land means "has no runtime heap representation". You can usually learn whether some part of your code is compiled to an unboxed loop (a loop that performs no allocation), say, by looking at the core representation (this is a very haskell-like language, that's the first stage in compilation). So e.g. you might see Int# in the core output which means an integer which has no heap representation (it's gonna be in a register).
When optimizing haskell code we regularly look at core and expect to be able to manipulate or correct for performance regressions by changing the source code (e.g. adding a strictness annotation, or fiddling with a function such that it can be inlined). This isn't always fun, but will be fairly stable especially if you are pinning your compiler version.
Back to unboxed arrays: GHC exposes a lot of low-level primops in GHC.Prim, in particular it sounds like you want mutable unboxed arrays (MutableByteArray). The primitive package exposes these primops behind a slightly safer, friendlier API and is what you should use (and depend on if writing your own library).
There are many other libraries that implement unboxed arrays, such as vector, and which are built on MutableByteArray, but the point is that operations on that structure generate no garbage and likely compile down to pretty predictable machine instructions.
You might also like to check out this technique if you're doing numeric work and want to use a particular instruction or implement some loop directly in assembly.
GHC also has a very powerful FFI, and you can research about how to write portions of your program in C and interop; haskell supports pinned arrays among other structures for this purpose.
If you need more control than those give you then haskell is likely the wrong language. It's impossible to tell from your description if this is the case for your problem (Your requirements seem contradictory: you need to be able to write a carefully cache-tuned algorithm, but arbitrary GC pauses are okay?).
One last note: you can't rely on GHC's native code generator to perform any of the low-level strength reduction optimizations that e.g. GCC performs (GHC's NCG will probably never ever know about bit-twiddling hacks, autovectorization, etc. etc.). Instead you can try the LLVM backend, but whether you see a speedup in your program is by no means guaranteed.

Haskell for mission-critical systems [duplicate]

I've been curious to understand if it is possible to apply the power of Haskell to embedded realtime world, and in googling have found the Atom package. I'd assume that in the complex case the code might have all the classical C bugs - crashes, memory corruptions, etc, which would then need to be traced to the original Haskell code that
caused them. So, this is the first part of the question: "If you had the experience with Atom, how did you deal with the task of debugging the low-level bugs in compiled C code and fixing them in Haskell original code ?"
I searched for some more examples for Atom, this blog post mentions the resulting C code 22KLOC (and obviously no code:), the included example is a toy. This and this references have a bit more practical code, but this is where this ends. And the reason I put "sizable" in the subject is, I'm most interested if you might share your experiences of working with the generated C code in the range of 300KLOC+.
As I am a Haskell newbie, obviously there may be other ways that I did not find due to my unknown unknowns, so any other pointers for self-education in this area would be greatly appreciated - and this is the second part of the question - "what would be some other practical methods (if) of doing real-time development in Haskell?". If the multicore is also in the picture, that's an extra plus :-)
(About usage of Haskell itself for this purpose: from what I read in this blog post, the garbage collection and laziness in Haskell makes it rather nondeterministic scheduling-wise, but maybe in two years something has changed. Real world Haskell programming question on SO was the closest that I could find to this topic)
Note: "real-time" above is would be closer to "hard realtime" - I'm curious if it is possible to ensure that the pause time when the main task is not executing is under 0.5ms.
At Galois we use Haskell for two things:
Soft real time (OS device layers, networking), where 1-5 ms response times are plausible. GHC generates fast code, and has plenty of support for tuning the garbage collector and scheduler to get the right timings.
for true real time systems EDSLs are used to generate code for other languages that provide stronger timing guarantees. E.g. Cryptol, Atom and Copilot.
So be careful to distinguish the EDSL (Copilot or Atom) from the host language (Haskell).
Some examples of critical systems, and in some cases, real-time systems, either written or generated from Haskell, produced by Galois.
EDSLs
Copilot: A Hard Real-Time Runtime Monitor -- a DSL for real-time avionics monitoring
Equivalence and Safety Checking in Cryptol -- a DSL for cryptographic components of critical systems
Systems
HaLVM -- a lightweight microkernel for embedded and mobile applications
TSE -- a cross-domain (security level) network appliance
It will be a long time before there is a Haskell system that fits in small memory and can guarantee sub-millisecond pause times. The community of Haskell implementors just doesn't seem to be interested in this kind of target.
There is healthy interest in using Haskell or something Haskell-like to compile down to something very efficient; for example, Bluespec compiles to hardware.
I don't think it will meet your needs, but if you're interested in functional programming and embedded systems you should learn about Erlang.
Andrew,
Yes, it can be tricky to debug problems through the generated code back to the original source. One thing Atom provides is a means to probe internal expressions, then leaves if up to the user how to handle these probes. For vehicle testing, we build a transmitter (in Atom) and stream the probes out over a CAN bus. We can then capture this data, formated it, then view it with tools like GTKWave, either in post-processing or realtime. For software simulation, probes are handled differently. Instead of getting probe data from a CAN protocol, hooks are made to the C code to lift the probe values directly. The probe values are then used in the unit testing framework (distributed with Atom) to determine if a test passes or fails and to calculate simulation coverage.
I don't think Haskell, or other Garbage Collected languages are very well-suited to hard-realtime systems, as GC's tend to amortize their runtimes into short pauses.
Writing in Atom is not exactly programming in Haskell, as Haskell here can be seen as purely a preprocessor for the actual program you are writing.
I think Haskell is an awesome preprocessor, and using DSEL's like Atom is probably a great way to create sizable hard-realtime systems, but I don't know if Atom fits the bill or not. If it doesn't, I'm pretty sure it is possible (and I encourage anyone who does!) to implement a DSEL that does.
Having a very strong pre-processor like Haskell for a low-level language opens up a huge window of opportunity to implement abstractions through code-generation that are much more clumsy when implemented as C code text generators.
I've been fooling around with Atom. It is pretty cool, but I think it is best for small systems. Yes it runs in trucks and buses and implements real-world, critical applications, but that doesn't mean those applications are necessarily large or complex. It really is for hard-real-time apps and goes to great lengths to make every operation take the exact same amount of time. For example, instead of an if/else statement that conditionally executes one of two code branches that might differ in running time, it has a "mux" statement that always executes both branches before conditionally selecting one of the two computed values (so the total execution time is the same whichever value is selected). It doesn't have any significant type system other than built-in types (comparable to C's) that are enforced through GADT values passed through the Atom monad. The author is working on a static verification tool that analyzes the output C code, which is pretty cool (it uses an SMT solver), but I think Atom would benefit from more source-level features and checks. Even in my toy-sized app (LED flashlight controller), I've made a number of newbie errors that someone more experienced with the package might avoid, but that resulted in buggy output code that I'd rather have been caught by the compiler instead of through testing. On the other hand, it's still at version 0.1.something so improvements are undoubtedly coming.

Trivial mathematical problems as language benchmarks

Why do people insist on using trivial mathematical problems like finding numbers in the Fibonacci sequence for language benchmarks? Don't these usually get optimized to relativistic speeds? Isn't the brunt of the bottlenecks usually in I/O, system API calls, operations on strings and structures, processing large quantities of data, abstract object-oriented stuff, etc?
It is a throwback to the old days, when compiler technology for what we would now call basic math was still evolving rapidly.
Now, compiler evolution is more focused on exploiting new instructions for niche operations, 64-bit math, and so on.
Micro-benchmarks such as the ones you mention were useful, though, when evaluating the efficiency of the hotspot compiler when Java was first launched, and in evaluating the efficiency of .NET versus C/C++.
Your suggestion that I/O and system calls are the likely bottlenecks is correct, at least for some space of problems. But I notice you suggested string operations. One person's irrelevant micro-benchmark is another person's critical performance metric.
EDIT: ps, I also remember using linpack and other micro-benchmarks to compare versions of the JVM, and to compare vendors of the JVM. From v4 to v5 there was a big jump in perf, I guess the JIT compiler got more effective. Also, IBM's JVM was ahead of Sun's at that time, on Windows-x86.
Because if you want to benchmark the language/compiler, these "math problems" are good indicators of the "bare speed" of the generated code. Either they use the iterative solution, which is a tight loop and indicates how well can the compiler push the instructions to the processor, or they use the recursive solution, which indicates how does it handle recursive calls of short functions (inlining, tail-recursion etc.) (although the Ackermann function is usually used for that too).
Usually, the benchmark suite for the language contain tests benchmarking other parts as well - eg. gzip compression, text searching, object creation, virtual function call, exception throw/catch benchmarks.
The other things you've noticed, syscalls and IO are usually not included because
syscalls are in fact not that slow - applications don't spend significant porion of the time in the kernel, except for test specifically targeted at them or when something is seriously wrong with the program
syscall and IO performance does not depend on the language, but rather on the OS & hardware
I'd think a simple, well-established algorithm would remove the possibility that the benchmark is biased (whether through ignorance or malice) to favor one language. It is very difficult to write a complex program in two different languages exactly the same. Testing something like the efficiency of a multithreaded application in c# vs java, for example, would require developers skilled in multithreaded development both languages, and there would still be questions as to whether the benchmark app properly represents the general case, or if it is misrepresenting a special case that only one language handles well.
Back when the sieve of eratosthanes was a popular benchmark for C compilers, I thought it would be funny if one of the compiler authors would recognize the sieve code and replace it with a pre-computed lookup.

Resources