How does Spark do bytecode to machine code instructions run time conversion?

How does Spark do bytecode to machine code instructions run time conversion? - apache-spark

After reading some articles about Whole State Code Generation, spark does bytecode optimizations to convert a query plan to an optimized execution plan.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html
Now my next question is but still after doing these optimizations related to bytecodes and all, it might still be plausible that conversion of those bytecode instructions to machine code instructions could be a possible bottleneck because this is done by JIT alone during the runtime of the process and for this optimization to take place JIT has to have enough runs.
So does spark do anything related to dynamic/runtime conversion of optimized bytecode ( which is an outcome of whole stage code gen) to machine code or does it rely on JIT to convert those byte code instructions to machine code instructions. Because if it relies on JIT then there are certain uncertainties involved.

spark does bytecode optimizations to convert a query plan to an optimized execution plan.
Spark SQL does not do bytecode optimizations.
Spark SQL simply uses CollapseCodegenStages physical preparation rule and eventually converts a query plan into a single-method Java source code (that Janino compiles and generates the bytecode).
So does spark do anything related to dynamic/runtime conversion of optimized bytecode
No.
Speaking of JIT, WholeStageCodegenExec does this check whether the whole-stage codegen generates "too long generated codes" or not that could be above spark.sql.codegen.hugeMethodLimit Spark SQL internal property (that is 8000 by default and is the value of HugeMethodLimit in the OpenJDK JVM settings).
The maximum bytecode size of a single compiled Java function generated by whole-stage codegen. When the compiled function exceeds this threshold, the whole-stage codegen is deactivated for this subtree of the current query plan. The default value is 8000 and this is a limit in the OpenJDK JVM implementation.
There are not that many physical operators that support CodegenSupport so reviewing their doConsume and doProduce methods should reveal whether if at all JIT might not kick in.

Related

Why should I not use incremental builds for release binaries?

I noticed that as my project grows, the release compilation/build time gets slower at a faster pace than I expected (and hoped for). I decided to look into what I could do to improve compilation speed. I am not talking about the initial build time, which involves compilation of dependencies and is largely irrelevant.
One thing that seems to be helping significantly is the incremental = true profile setting. On my project, it seems to shorten build time by ~40% on 4+ cores. With fewer cores the gains are even larger, as builds with incremental = true don't seem to use (much) parallelization. With the default (for --release) incremental = false build times are 3-4 times slower on a single core, compared to 4+ cores.
What are the reasons to refrain from using incremental = true for production builds? I don't see any (significant) increase in binary size or storage size of cached objects. I read somewhere it is possible that incremental builds lead to slightly worse performance of the built binary. Is that the only reason to consider or are there others, like stability, etc.?
I know this could vary, but is there any data available on how much of a performance impact might be expected on real-world applications?

Don't use an incremental build for production releases, because it is:
not reproducible (i.e. you can't get the exact same binary by compiling it again) and
quite possibly subtly broken (incremental compilation is way more complex and way less tested than clean compilation, in particular with optimizations turned on).

what's the different between the hotspot jvm interpreter and jit

what's the different between the hotspot jvm interpreter and jit?I had got confused from the opinion from the book i had read, the interpreter execute the code line by line, does it mean that the interpreter will translate the bytecode to the machine code and then execute them?

Compiler and Interpreter do the same job of converting High lavel code to machine understandable code
interpreter takes the single instruction of the code, translates it into an intermediate code and then into a machine code, executes it and takes another instruction and continue for all the instructions
JIT compiler does optimization on high lavel code and converts entire code into machine understandable code through scanning all the code.
You can find the more details here - http://www.whizlabs.com/blog/what-is-just-in-time-compiler-difference-between-compiler-and-interpreter/

CUDA: Nsight VS2010 profile device function

I would like to know how to profile a __device__ function which is inside a __global__ function with Nsight 2.2 on visual studio 2010. I need to know which function is consuming a lot of resources and time. I have CUDA 5.0 on CC 2.0.

Nsight Visual Studio Edition 3.0 CUDA Profiler introduces source correlated experiments. The Profile CUDA Activity supports the following source level experiments:
Instruction Count - Collects instructions executed, thread instructions executed, active thread histogram, predicated thread histogram for every user instruction in the kernel. Information on syscalls (printf) is not collected.
Divergent Branch - Collects branch taken, branch not taken, and divergence count for flow control instructions.
Memory Transactions - Collects transaction counts, ideal transaction counter, and requested bytes for global, local, and shared memory instructions.
This information is collected per SASS instruction. If the kernel is compiled with -lineinfo (--generate-line-info) the information can be rolled up to PTX and high level source code. Since this data is rolled up from SASS some statistics may not be intuitive to the high level source. For example a branch statistic may show 100% not taken when you expected 100% taken. If you look at the SASS code you may see that the compiler reversed the conditional.
Please also not that on optimized builds the compiler is sometimes unable to maintain line table information.
At this time hardware performance counters and timing is only available at the kernel level.
Device code timing can be done using clock() and clock64() as mentioned in comments. This is a very advanced technique which requires both ability to understand SASS and interpret results with respect to the SM warp schedulers.

Any C++ libraries to run a single program on multiple PC's (i.e. "use grid computing to run my app")

I'm after a method of converting a single program to run on multiple computers on a network (think "grid computing").
I'm using MSVC 2007 and C++ (non-.NET).
The program I've written is ideally suited for parallel programming (its doing analysis of scientific data), so the more computers the better.

The classic answer for this would be MPI (Message Passing Interface). It requires a bit of work to get your program to work well with message passing, but the end result is that you can easily launch your executable across a cluster of machines that are running a MPI daemon.
There are several implementations. I've worked with MPICH, but I might consider doing this with Boost MPI (which didn't exist last time I was in the neighborhood).

Firstly, this topic is covered here:
https://stackoverflow.com/questions/2258332/distributed-computing-in-c
Secondly, a search for "C++ grid computing library", "grid computing for visual studio" and "C++ distributed computing library" returned the following:
OpenMP+OpenMPI. OpenMP handles the running of single C++ program on multiple CPU cores within the same machine, OpenMPI handles the messaging between multiple machines. OpenMP+OpenMPI=grid computing.
POP-C++, see http://gridgroup.hefr.ch/popc/.
Xoreax Grid Engine, see http://www.xoreax.com/high_performance_grid_computing.htm. Xoreax focuses on speeding up builds of Visual Studio, but the Xoreax Grid Engine can also be applied to generic applications. Looking at http://www.xoreax.com/xge_xoreax_grid_engine.htm quotes, we see the quote "Once a task-set (a set of tasks for distribution along with their dependency definitions) is defined through one of the interfaces described below, it can be executed on any machine running an IncrediBuild Agent.". See the accompanying CodeGuru article at http://www.codeproject.com/KB/showcase/Xoreax-Grid.aspx
Alchemi, see http://www.codeproject.com/KB/threads/alchemi.aspx.
RightScale, see http://www.rightscale.com/pdf/Grid-Whitepaper-Technical.pdf. A quote from the examples section of this paper: "Pharmaceutical protein analysis: Several million protein compound comparisons were performed in less than a day – a task that would have taken over a week on the customer’s internal resources ..."

Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)?

I'm writing a ray tracer.
Recently, I added threading to the program to exploit the additional cores on my i5 Quad Core.
In a weird turn of events the debug version of the application is now running slower, but the optimized build is running faster than before I added threading.
I'm passing the "-g -pg" flags to gcc for the debug build and the "-O3" flag for the optimized build.
Host system: Ubuntu Linux 10.4 AMD64.
I know that debug symbols add significant overhead to the program, but the relative performance has always been maintained. I.e. a faster algorithm will always run faster in both debug and optimization builds.
Any idea why I'm seeing this behavior?
Debug version is compiled with "-g3 -pg". Optimized version with "-O3".
Optimized no threading: 0m4.864s
Optimized threading: 0m2.075s
Debug no threading: 0m30.351s
Debug threading: 0m39.860s
Debug threading after "strip": 0m39.767s
Debug no threading (no-pg): 0m10.428s
Debug threading (no-pg): 0m4.045s
This convinces me that "-g3" is not to blame for the odd performance delta, but that it's rather the "-pg" switch. It's likely that the "-pg" option adds some sort of locking mechanism to measure thread performance.
Since "-pg" is broken on threaded applications anyway, I'll just remove it.

What do you get without the -pg flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).
It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.

You are talking about two different things here. Debug symbols and compiler optimization. If you use the strongest optimization settings the compiler has to offer, you do so at the consequence of losing symbols that are useful in debugging.
Your application is not running slower due to debugging symbols, its running slower because of less optimization done by the compiler.
Debugging symbols are not 'overhead' beyond the fact that they occupy more disk space. Code compiled at maximum optimization (-O3) should not be adding debug symbols. That's a flag that you would set when you have no need for said symbols.
If you need debugging symbols, you gain them at the expense of losing compiler optimization. However, once again, this is not 'overhead', its just the absence of compiler optimization.

Is the profile code inserting instrumentation calls in enough functions to hurt you?
If you single-step at the assembly language level, you'll find out pretty quick.

Multithreaded code execution time is not always measured as expected by gprof.
You should time your code with an other timer in addition to gprof to see the difference.
My example: Running LULESH CORAL benchmark on a 2NUMA nodes INTEL sandy bridge (8 cores + 8 cores) with size -s 50 and 20 iterations -i, compile with gcc 6.3.0, -O3, I have:
With 1 thread running: ~3,7 without -pg and ~3,8 with it, but according to gprof analysis the code has ran only for 3,5.
WIth 16 threads running: ~0,6 without -pg and ~0,8 with it, but according to gprof analysis the code has ran for ~4,5 ...
The time in bold has been measured gettimeofday, outside the parallel region (start and end of main function).
Therefore, maybe if you would have measure your application time the same way, you would have seen the same speeduo with and without -pg. It is just the gprof measure which is wrong in parallel. In LULESH openmp version either way.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string