CUDA: Nsight VS2010 profile __device__ function - visual-c++

I would like to know how to profile a __device__ function which is inside a __global__ function with Nsight 2.2 on visual studio 2010. I need to know which function is consuming a lot of resources and time. I have CUDA 5.0 on CC 2.0.

Nsight Visual Studio Edition 3.0 CUDA Profiler introduces source correlated experiments. The Profile CUDA Activity supports the following source level experiments:
Instruction Count - Collects instructions executed, thread instructions executed, active thread histogram, predicated thread histogram for every user instruction in the kernel. Information on syscalls (printf) is not collected.
Divergent Branch - Collects branch taken, branch not taken, and divergence count for flow control instructions.
Memory Transactions - Collects transaction counts, ideal transaction counter, and requested bytes for global, local, and shared memory instructions.
This information is collected per SASS instruction. If the kernel is compiled with -lineinfo (--generate-line-info) the information can be rolled up to PTX and high level source code. Since this data is rolled up from SASS some statistics may not be intuitive to the high level source. For example a branch statistic may show 100% not taken when you expected 100% taken. If you look at the SASS code you may see that the compiler reversed the conditional.
Please also not that on optimized builds the compiler is sometimes unable to maintain line table information.
At this time hardware performance counters and timing is only available at the kernel level.
Device code timing can be done using clock() and clock64() as mentioned in comments. This is a very advanced technique which requires both ability to understand SASS and interpret results with respect to the SM warp schedulers.

Related

How does Spark do bytecode to machine code instructions run time conversion?

After reading some articles about Whole State Code Generation, spark does bytecode optimizations to convert a query plan to an optimized execution plan.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html
Now my next question is but still after doing these optimizations related to bytecodes and all, it might still be plausible that conversion of those bytecode instructions to machine code instructions could be a possible bottleneck because this is done by JIT alone during the runtime of the process and for this optimization to take place JIT has to have enough runs.
So does spark do anything related to dynamic/runtime conversion of optimized bytecode ( which is an outcome of whole stage code gen) to machine code or does it rely on JIT to convert those byte code instructions to machine code instructions. Because if it relies on JIT then there are certain uncertainties involved.
spark does bytecode optimizations to convert a query plan to an optimized execution plan.
Spark SQL does not do bytecode optimizations.
Spark SQL simply uses CollapseCodegenStages physical preparation rule and eventually converts a query plan into a single-method Java source code (that Janino compiles and generates the bytecode).
So does spark do anything related to dynamic/runtime conversion of optimized bytecode
No.
Speaking of JIT, WholeStageCodegenExec does this check whether the whole-stage codegen generates "too long generated codes" or not that could be above spark.sql.codegen.hugeMethodLimit Spark SQL internal property (that is 8000 by default and is the value of HugeMethodLimit in the OpenJDK JVM settings).
The maximum bytecode size of a single compiled Java function generated by whole-stage codegen. When the compiled function exceeds this threshold, the whole-stage codegen is deactivated for this subtree of the current query plan. The default value is 8000 and this is a limit in the OpenJDK JVM implementation.
There are not that many physical operators that support CodegenSupport so reviewing their doConsume and doProduce methods should reveal whether if at all JIT might not kick in.

Writing CUDA program for more than one GPU

I have more than one GPU and want to execute my kernels on them. Is there an API or software that can schedule/manage GPU resources dynamically? Utilizing resources of all available GPUs for the program.
A utility that may periodically report the available resource and my program will launch as many threads to GPUs.
Secondly, I am using Windows+ Visual Studio for my development. I have read that CUDA is supported on Linux. what changes do I need to do in my program?
I have more than GPUs and want to execute my kernels on them. Is there an API or software that can schedule/manage GPU resources dynamically.
For arbitrary kernels that you write, there is no API that I am aware of (certainly no CUDA API) that "automatically" makes use of multiple GPUs. Today's multi-GPU aware programs often use a strategy like this:
detect how many GPUs are available
partition the data set into chunks based on the number of GPUs available
successively transfer the chunks to each GPU, and launch the computation kernel on each GPU, switching GPUs using cudaSetDevice().
A program that follows the above approach, approximately, is the cuda simpleMultiGPU sample code. Once you have worked out the methodology for 2 GPUs, it's not much additional effort to go to 4 or 8 GPUs. This of course assumes your work is already separable and the data/algorithm partitioning work is "done".
I think this is an area of active research in many places, so if you do a google search you may turn up papers like this one or this one. Whether these are of interest to you will probably depend on your exact needs.
There are some new developments with CUDA libraries available with CUDA 6 that can perform certain specific operations (e.g. BLAS, FFT) "automatically" using multiple GPUs. To investigate this further, review the relevant CUBLAS XT documentation and CUFFT XT multi-GPU documentation and sample code. As far as I know, at the current time, these operations are limited to 2 GPUs for automatic work distribution. And these allow for automatic distribution of specific workloads (BLAS, FFT) not arbitrary kernels.
Secondly, I am using Windows+ Visual Studio for my development. I have read that CUDA is supported on Linux. what changes do I need to do in my program?
With the exception of the OGL/DX interop APIs CUDA is mostly orthogonal to choice of windows or linux as a platform. The typical IDE's are different (windows: nsight Visual Studio edition, Linux: nsight eclipse edition) but your code changes will mostly consist of ordinary porting differences between windows and linux. If you want to get started with linux, follow the getting started document.

linux- How to determine time taken by each function in C Program

I want to check time taken by each function and system calls made by each function in my project .My code is part of user as well as kernel space. So i need time taken in both space. I am interested to know performance in terms of CPU time and Disk IO. Should i use profiler tool ? if yes , which will be more preferable ? or what other option i have ?
Please help,
Thanks
As for kernel level profiling or time taken by some instructions or functions could be measured in clock tics used. To get actual how many clock ticks have been used to do a given task could be measured by kernel function as...
#include <sys/time.h>
unsigned long ini,end;
rdtscl(ini);
...your code....
rdtscl(end);
printk("time lapse in cpu clics: %lu\n",(end-ini));
for more details http://www.xml.com/ldd/chapter/book/ch06.html
and if your code is taking more time then you can also use jiffies effectively.
And for user-space profiling you can use various timing functions whicg give the time in nanosecond resolution or oprofile(http://oprofile.sourceforge.net/about/) & refer tis Timer function to provide time in nano seconds using C++
For kernel-space function tracing and profiling (which includes a call-graph format and the time taken by individual functions), consider using the Ftrace framework.
Specifically for function profiling (within the kernel), enable the CONFIG_FUNCTION_PROFILER kernel config: under Kernel Hacking / Tracing / Kernel function profiler.
It's Help :
CONFIG_FUNCTION_PROFILER:
This option enables the kernel function profiler. A file is created
in debugfs called function_profile_enabled which defaults to zero.
When a 1 is echoed into this file profiling begins, and when a
zero is entered, profiling stops. A "functions" file is created in
the trace_stats directory; this file shows the list of functions that
have been hit and their counters.
Some resources:
Documentation/trace/ftrace.txt
Secrets of the Ftrace function tracer
Using ftrace to Identify the Process Calling a Kernel Function
Well I only develop in userspace so I don't know, how much this will help you with disk IO or Kernelspace profiling, but I profiled a lot with oprofile.
I haven't used it in a while, so I cannot give you a step by step guide, but you may find more informations here:
http://oprofile.sourceforge.net/doc/results.html
Usually this helped me finding my problems.
You may have to play a bit with the opreport output, to get the results you want.

Does ARM sit idle while NEON is doing its operations?

Might look similar to: ARM and NEON can work in parallel?, but its not, I have some other issue ( may be problem with my understanding):
In the protocol stack, while we compute checksum, that is done on the GPP, I’m handing over that task now to NEON as part of a function:
Here is the checksum function that I have written as a part of NEON, posted in Stack Overflow: Checksum code implementation for Neon in Intrinsics
Now, suppose from linux this function is called,
ip_csum(){
…
…
csum = do_csum(); //function call from arm
…
…
}
do_csum(){
…
…
//NEON optimised code
…
…
returns the final checksum to ip_csum/linux/ARM
}
in this case.. what happens to ARM when NEON is doing the calculations? does ARM sit idle? or it moves on with other operations?
as you can see do_csum is called and we are waiting on that result ( or that is what it looks like)..
NOTE:
Speaking in terms of cortex-a8
do_csum as you can see from the link is coded with intrinsics
compilation using gnu tool-chain
Will be good if you also take Multi-threading or any other concept involved or comes into picture when these inter operations happen.
Questions:
Does ARM sit idle while NEON is doing its operations? ( in this particular case)
Or does it shelve this current ip_csum related code, and take up another process/thread till NEON is done? ( I'm almost dumb as to what happens here)
if its sitting idle, how can we make ARM work on something else till NEON is done?
(Image from TI Wiki Cortex A8)
The ARM (or rather the Integer Pipeline) does not sit idle while NEON instructions are processing. In the Cortex A8, the NEON is at the "end" of the processor pipeline, instructions flow through the pipeline and if they are ARM instructions they are executed in the "beginning" of the pipeline and NEON instructions are executed in the end. Every clock pushes the instruction down the pipeline.
Here are some hints on how to read the diagram above:
Every cycle, if possible, the processor fetches an instruction pair (two instructions).
Fetching is pipelined so it takes 3 cycles for the instructions to propagate into the decode unit.
It takes 5 cycles (D0-D4) for the instruction to be decoded. Again this is all pipelines so it affects the latency but not the throughput. More instructions keep flowing through the pipeline where possible.
Now we reach the execute/load store portion. NEON instructions flow through this stage (but they do that while other instructions are possibly executing).
We get to the NEON portion, if the instruction fetched 13 cycles ago was a NEON instruction it is now decoded and executed in the NEON pipeline.
While this is happening, integer instructions that followed that instruction can execute at the same time in the integer pipeline.
The pipeline is a fairly complex beast, some instructions are multi-cycle, some have dependencies and will stall if those dependencies are not met. Other events such as branches will flush the pipeline.
If you are executing a sequence that is 100% NEON instructions (which is pretty rare, since there are usually some ARM registers involved, control flow etc.) then there is some period where the the integer pipeline isn't doing anything useful. Most code will have the two executing concurrently for at least some of the time while cleverly engineered code can maximize performance with the right instructions mix.
This online tool Cycle Counter for Cortex A8 is great for analyzing the performance of your assembly code and gives information about what is executing in what units and what is stalling.
In Application Level Programmers’ Model, you can't really distinguish between ARM and NEON units.
While NEON being a separate hardware unit (that is available as an option on Cortex-A series
processors), it is the ARM core who drives it in a tight fashion. It is not a separate DSP which you can communicate in an asynchronous fashion.
You can write better code by fully utilizing pipelines on both units, but this is not same as having a separate core.
NEON unit is there because it can do some operations (SIMDs) much faster than ARM unit at a low frequency.
This is like having a friend who is good at math, whenever you have a hard question you can ask him. While waiting for an answer you can do some small things like if answer is this I should do this or if not instead do that but if you depend on that answer to go on, you need to wait for him to answer before going further. You could calculate the answer yourself but it will be much faster even including the communication time between two of you compared to doing all the math yourself. I think you can even extend this analogy like "you also need to buy some lunch to that friend (energy consumption) but in many cases it worths it".
Anyone who is saying ARM core can do other things while NEON core is working on its stuff is talking about instruction-level parallelism not anything like task-level parallelism.
ARM is not "idle" while NEON operations are executed, but controls them.
To fully use the power of both units, one can carefully plan an interleaved sequence of operations:
loop:
SUBS r0,r0,r1 ; // ARM operation
addpq.16 q0,q0,q1 ; NEON operation
LDR r0, [r1, r2 LSL #2]; // ARM operation
vld1.32 d0, [r1]! ; // NEON operation using ARM register
bne loop; // ARM operation controlling the flow of both units...
ARM cortex-A8 can execute in each clock cycle up to 2 instructions. If both of them are independent NEON operations, it's no use to put an ARM instruction in between. OTOH if one knows that the latency of a VLD (load) is large, one can place many ARM instruction in between the load and first use of the loaded value. But in each case the combined usage must be planned in advance and interleaved.

Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)?

I'm writing a ray tracer.
Recently, I added threading to the program to exploit the additional cores on my i5 Quad Core.
In a weird turn of events the debug version of the application is now running slower, but the optimized build is running faster than before I added threading.
I'm passing the "-g -pg" flags to gcc for the debug build and the "-O3" flag for the optimized build.
Host system: Ubuntu Linux 10.4 AMD64.
I know that debug symbols add significant overhead to the program, but the relative performance has always been maintained. I.e. a faster algorithm will always run faster in both debug and optimization builds.
Any idea why I'm seeing this behavior?
Debug version is compiled with "-g3 -pg". Optimized version with "-O3".
Optimized no threading: 0m4.864s
Optimized threading: 0m2.075s
Debug no threading: 0m30.351s
Debug threading: 0m39.860s
Debug threading after "strip": 0m39.767s
Debug no threading (no-pg): 0m10.428s
Debug threading (no-pg): 0m4.045s
This convinces me that "-g3" is not to blame for the odd performance delta, but that it's rather the "-pg" switch. It's likely that the "-pg" option adds some sort of locking mechanism to measure thread performance.
Since "-pg" is broken on threaded applications anyway, I'll just remove it.
What do you get without the -pg flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).
It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.
You are talking about two different things here. Debug symbols and compiler optimization. If you use the strongest optimization settings the compiler has to offer, you do so at the consequence of losing symbols that are useful in debugging.
Your application is not running slower due to debugging symbols, its running slower because of less optimization done by the compiler.
Debugging symbols are not 'overhead' beyond the fact that they occupy more disk space. Code compiled at maximum optimization (-O3) should not be adding debug symbols. That's a flag that you would set when you have no need for said symbols.
If you need debugging symbols, you gain them at the expense of losing compiler optimization. However, once again, this is not 'overhead', its just the absence of compiler optimization.
Is the profile code inserting instrumentation calls in enough functions to hurt you?
If you single-step at the assembly language level, you'll find out pretty quick.
Multithreaded code execution time is not always measured as expected by gprof.
You should time your code with an other timer in addition to gprof to see the difference.
My example: Running LULESH CORAL benchmark on a 2NUMA nodes INTEL sandy bridge (8 cores + 8 cores) with size -s 50 and 20 iterations -i, compile with gcc 6.3.0, -O3, I have:
With 1 thread running: ~3,7 without -pg and ~3,8 with it, but according to gprof analysis the code has ran only for 3,5.
WIth 16 threads running: ~0,6 without -pg and ~0,8 with it, but according to gprof analysis the code has ran for ~4,5 ...
The time in bold has been measured gettimeofday, outside the parallel region (start and end of main function).
Therefore, maybe if you would have measure your application time the same way, you would have seen the same speeduo with and without -pg. It is just the gprof measure which is wrong in parallel. In LULESH openmp version either way.

Resources