Galaxy S6 Renderscript performance - samsung-mobile

I have some problems with Renderscript. I did some test in order to evaluate the performance of Renderscript GPU compute. I used the ImageProcessing Renderscript Benchmark (android/platform/frameworks/rs/java/tests/ImageProcessing). I also used "adb shell setprop debug.rs.default-CPU-driver 1" to force the script to run on CPU.
The execution time obtained are:
GPU CPU
Test Levels Vec3 Relaxed 13.594595ms 13.413333ms
Test Levels Vec4 Relaxed 14.4ms 14.027778ms
Test Levels Vec3 Full 14.594203ms 15.0ms
Test Levels Vec4 Full 15.227273ms 15.242424ms
Test Blur radius 25 388.0ms 379.66666ms
Test Intrinsic Blur radius 25 52.842106ms 52.1ms
Test Greyscale 13.302631ms 13.493333ms
Test Grain 136.25ms 137.5ms
Test Fisheye Full 57.61111ms 59.235294ms
Test Fisheye Relaxed 59.764706ms 57.055557ms
Test Fisheye Approximate Full 54.473682ms 58.555557ms
Test Fisheye Approx Relaxed 58.555557ms 55.833332ms
Test Vignette Full 28.885714ms 27.86111ms
Test Vignette Relaxed 29.028572ms 28.166666ms
Test Vignette Approximate Full 22.288889ms 21.680851ms
Test Vignette Approx Relaxed 21.553192ms 21.76087ms
Test Group Test (emulated) 6.4166665ms 6.429487ms
Test Group Test (native) 6.335443ms 6.3757963ms
Test Convolve 3x3 38.653847ms 39.423077ms
Test Intrinsics Convolve 3x3 4.2777777ms 4.3608694ms
...
The CPU and GPU executions do not show any differences. It looks like the S6 device always choose the CPU instead of the GPU. I executed the same tests in other devices and the GPU execution is quite faster than the CPU.
Is the ARM mali Renderscript driver included into the S6 device?

Related

Timing issues while synthesizing a design in Verilog

I am working on a Decoder module based on BCH codes. The design is to be implemented on a Virtex-7 FPGA . I have basically three blocks . Syndrome computation block , Error locator finder and Error locator solver block . The syndrome computation block is working fine on the FPGA and it is working at 225 MHz clock . I am working on the Error locator finder block and it is giving me some timing issues. The problem is essentially this :
1) I have a module that has just a case statement . The case block has 1024 entries . The path that is failing contains this module . If I comment out this module , the design works fine . In the implemented design , this module is placed too far and because of this , I am getting a huge net/wire delay. Is there a way for me to have the case based module closer to my actual design ??
Any help will be appreciated. The net delay accounts for atleast 60 percent of the total delay . This is unacceptable for the problem I am trying to solve as this decoder needs to work atleast at 200 MHz
In the previous Xilinx FPGA tool suite called ISE, you had the ability to vary the Placer Cost Table (PCT), which results in different location placement of the logic cells which leads to varying timing results. The PCT could be iterated in different implementation runs (using SmartXplorer) which where stopped until a PCT was found with valid timing results.
Xilinx dropped this strategy due ineffectiveness in large FPGAs (like your Virtex 7 device is). But you have several predefined strategies which also can be run in parallel. Just open the Implementation Settings and try different strategies and see if one is working.
If not, you have to optimize your design on HDL level. In general, pipelining is a good strategy, but it depends strongly on your code. In general you have to reduce large combinational constructs and your case statement with 1024 entries is such a candidate of a large combinational construct.
Edit: See Xilinx UG904 to get an overview and a brief description of the different implementation strategies.

Can we use `shuffle()` instruction for reg-to-reg data-exchange between items (threads) in WaveFront?

As we known, WaveFront (AMD OpenCL) is very similar to WARP (CUDA): http://research.cs.wisc.edu/multifacet/papers/isca14-channels.pdf
GPGPU languages, like OpenCLâ„¢ and CUDA, are called SIMT because they
map the programmer’s view of a thread to a SIMD lane. Threads
executing on the same SIMD unit in lockstep are called a wavefront
(warp in CUDA).
Also known, that AMD suggested us the (Reduce) addition of numbers using a local memory. And for accelerating of addition (Reduce) suggests using vector types: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/01/AMD_OpenCL_Tutorial_SAAHPC2010.pdf
But are there any optimized register-to-register data-exchage instructions between items (threads) in WaveFront:
such as int __shfl_down(int var, unsigned int delta, int width=warpSize); in WARP (CUDA): https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
or such as __m128i _mm_shuffle_epi8(__m128i a, __m128i b); SIMD-lanes on x86_64: https://software.intel.com/en-us/node/524215
This shuffle-instruction can, for example, execute Reduce (add up the numbers) of 8 elements from 8 threads/lanes, for 3 cycles without any synchronizations and without using any cache/local/shared-memory (which has ~3 cycles latency for each access).
I.e. threads sends its value directly to register of other threads: https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
Or in OpenCL we can use only instruction gentypen shuffle( gentypem x, ugentypen mask ) which can be used only for vector-types such as float16/uint16 into each item (thread), but not between items (threads) in WaveFront: https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/shuffle.html
Can we use something looks like shuffle() for reg-to-reg data-exchange between items (threads) in WaveFront which more faster than data-echange via Local memory?
Are there in AMD OpenCL instructions for register-to-register data-exchange intra-WaveFront such as instructions __any(), __all(), __ballot(), __shfl() for intra-WARP(CUDA): http://on-demand.gputechconf.com/gtc/2015/presentation/S5151-Elmar-Westphal.pdf
Warp vote functions:
__any(predicate) returns non-zero if any of the predicates for the
threads in the warp returns non-zero
__all(predicate) returns non-zero if all of the predicates for the
threads in the warp returns non-zero
__ballot(predicate) returns a bit-mask with the respective bits
of threads set where predicate returns non-zero
__shfl(value, thread) returns value from the requested thread
(but only if this thread also performed a __shfl()-operation)
CONCLUSION:
As known, in OpenCL-2.0 there is Sub-groups with SIMD execution model akin to WaveFronts: Does the official OpenCL 2.2 standard support the WaveFront?
For Sub-Group there are - page-160: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf
int sub_group_all(int predicate) the same as CUDA-__all(predicate)
int sub_group_any(int predicate); the same as CUDA-__any(predicate)
But in OpenCL there is no similar functions:
CUDA-__ballot(predicate)
CUDA-__shfl(value, thread)
There is only Intel-specified built-in shuffle functions in Version 4, August 28, 2016 Final Draft OpenCL Extension #35: intel_sub_group_shuffle, intel_sub_group_shuffle_down, intel_sub_group_shuffle_down, intel_sub_group_shuffle_up: https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.txt
Also in OpenCL there are functions, which usually implemented by shuffle-functions, but there are not all of functions which can be implemented by using shuffle-functions:
<gentype> sub_group_broadcast( <gentype> x, uint sub_group_local_id );
<gentype> sub_group_reduce_<op>( <gentype> x );
<gentype> sub_group_scan_exclusive_<op>( <gentype> x );
<gentype> sub_group_scan_inclusive_<op>( <gentype> x );
Summary:
shuffle-functions remain more flexible functions , and ensure the fastest possible communication between threads with direct register-to-register data-exchanging.
But functions sub_group_broadcast/_reduce/_scan doesn't guarantee direct register-to-register data-exchanging, and these sub-group-functions less flexible.
There is
gentype work_group_reduce<op> ( gentype x)
for version >=2.0
but its definition doesn't say anything about using local memory or registers. This just reduces each collaborator's x value to a single sum of all. This function must be hit by all workgroup-items so its not on a wavefront level approach. Also the order of floating-point operations is not guaranteed.
Maybe some vendors do it register way while some use local memory. Nvidia does with register I assume. But an old mainstream Amd gpu has local memory bandwidth of 3.7 TB/s which is still good amount. (edit: its not 22 TB/s) For 2k cores, this means nearly 1.5 byte per cycle per core or much faster per cache line.
For %100 register(if not spills to global memory) version, you can reduce number of threads and do vectorized reduction in threads themselves without communicating with others if number of elements are just 8 or 16. Such as
v.s0123 += v.s4567
v.s01 += v.s23
v.s0 += v.s1
which should be similar to a __m128i _mm_shuffle_epi8 and its sum version when compiled on a CPU and non-scalar implementations will use same SIMD on a GPU to do these 3 operations.
Also using these vector types tend to use efficient memory transactions even for global and local, not just registers.
A SIMD works on only a single wavefront at a time, but a wavefront may be processed by multiple SIMDs, so, this vector operation does not imply a whole wavefront is being used. Or even whole wavefront may be computing 1st elements of all vectors in a cycle. But for a CPU, most logical option is SIMD computing work items one by one(avx,sse) instead of computing them in parallel by their same indexed elements.
If main work group doesn't fit ones requirements, there are child kernels to spawn and use dynamic width kernels for this kind of operations. Child kernel works on another group called sub-group concurrently. This is done within device-side queue and needs OpenCl version to be at least 2.0.
Look for "device-side enqueue" in http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf
AMD APP SDK supports Sub-Group

Beaglebone Black; Wrong SPI Frequency

I'm new at programming the Beaglebone Black and to Linux in general, so I'm trying to figure out what's happening when I'm setting up a SPI-connection. I'm running Linux beaglebone 3.8.13-bone47.
I have set up a SPI-connection, using a Device Tree Overlay, and I'm now running spidev_test.c to test the connection. For the application I'm making, I need a quite specific frequency. So when I run spidev_test and measure the frequency of the bits shiftet out, I don't get the expected frequency.
I'm sending a SPI-packet containing 0xAA, and in spidev_test I've modified the "spi_ioc_transfer.speed_hz" to 4000000 (4MHz). But I'm measuring a data transfer frequency of 2,98MHz. I'm seeing the same result with other speeds as well, deviations are usually around 25-33%.
How come the measured speed doesn't match the assigned speed?
How is the speed assigned in "speed_hz" defined?
How precise should I expect the frequency to be?
Thank you :)
Actually If you look closely on the DSO you can see that each clock cycles takes approx 312.5 ns , which makes the clock frequency to be 3.2Mhz,. May be the channel you're monitoring i
Then, the variation between the expected & actual speed,
In microncontrollers I've worked the all the peripherlas including the SPI derives ots clock from the Master clock which is supplied to the MCU(in your case MPU), the master frequency divided by some prescalar gives the frequency for periperal opearations, where as peripherals use this frequency and uses its prescalar for controlling the baud rate,
So in your case suppose if the master frequency is not proper this could lead to the behavior mentioned above.
So you have two options
1. Correct the MPU core frequency
2. Do a trial & error method to find the value which has to be given is spi test program to get the desired frequency

Measuring time: differences among gettimeofday, TSC and clock ticks

I am doing some performance profiling for part of my program. And I try to measure the execution with the following four methods. Interestingly they show different results and I don't fully understand their differences. My CPU is Intel(R) Core(TM) i7-4770. System is Ubuntu 14.04. Thanks in advance for any explanation.
Method 1:
Use the gettimeofday() function, result is in seconds
Method 2:
Use the rdtsc instruction similar to https://stackoverflow.com/a/14019158/3721062
Method 3 and 4 exploits Intel's Performance Counter Monitor (PCM) API
Method 3:
Use PCM's
uint64 getCycles(const CounterStateType & before, const CounterStateType &after)
Its description (I don't quite understand):
Computes the number core clock cycles when signal on a specific core is running (not halted)
Returns number of used cycles (halted cyles are not counted). The counter does not advance in the following conditions:
an ACPI C-state is other than C0 for normal operation
HLT
STPCLK+ pin is asserted
being throttled by TM1
during the frequency switching phase of a performance state transition
The performance counter for this event counts across performance state transitions using different core clock frequencies
Method 4:
Use PCM's
uint64 getInvariantTSC (const CounterStateType & before, const CounterStateType & after)
Its description:
Computes number of invariant time stamp counter ticks.
This counter counts irrespectively of C-, P- or T-states
Two samples runs generate result as follows:
(Method 1 is in seconds. Methods 2~4 are divided by a (same) number to show a per-item cost).
0.016489 0.533603 0.588103 4.15136
0.020374 0.659265 0.730308 5.15672
Some observations:
The ratio of Method 1 over Method 2 is very consistent, while the others are not. i.e., 0.016489/0.533603 = 0.020374/0.659265. Assuming gettimeofday() is sufficiently accurate, the rdtsc method exhibits the "invariant" property. (Yep I read from Internet that current generation of Intel CPU has this feature for rdtsc.)
Methods 3 reports higher than Method 2. I guess its somehow different from the TSC. But what is it?
Methods 4 is the most confusing one. It reports an order of magnitude larger number than Methods 2 and 3. Shouldn't it be also kind of cycle counts? Let alone it carries the "Invariant" name.
gettimeofday() is not designed for measuring time intervals. Don't use it for that purpose.
If you need wall time intervals, use the POSIX monotonic clock. If you need CPU time spent by a particular process or thread, use the POSIX process time or thread time clocks. See man clock_gettime.
PCM API is great for fine tuned performance measurement when you know exactly what you are doing. Which is generally obtaining a variety of separate memory, core, cache, low-power, ... performance figures. Don't start messing with it if you are not sure what exact services you need from it that you can't get from clock_gettime.

GPU Pixel and Texel write speed

Many of the embedded/mobile GPUs are providing access to performance registers called Pixel Write Speed and Texel Write speed. Could you explain how those terms can be interpreted and defined from the actual GPU hardware point of view?
I would assume the difference between pixel and texel is pretty clear for you. Anyway, just to make this answer a little bit more "universal":
A pixel is the fundamental unit of screen space.
A texel, or texture element (also texture pixel) is the fundamental unit of texture space.
Textures are represented by arrays of texels, just as pictures are
represented by arrays of pixels. When texturing a 3D surface (a
process known as texture mapping) the renderer maps texels to
appropriate pixels in the output picture.
BTW, it is more common to use fill rate instead of write speed and you can easily find all required information, since this terminology is quite old and widely-used.
Answering your question
All fill-rate numbers (whatever definition is used) are expressed in
Mpixels/sec or Mtexels/sec.
Well the original idea behind fill-rate was the number of finished
pixels written to the frame buffer. This fits with the definition of
Theoretical Peak fill-rate. So in the good old days it made sense to
express that number in Mpixels.
However with the second generation of 3D accelerators a new feature
was added. This feature allows you to render to an off screen surface
and to use that as a texture in the next frame. So the values written
to the buffer are not necessarily on screen pixels anymore, they might
be texels of a texture. This process allows several cool special
effects, imagine rendering a room, now you store this picture of a
room as a texture. Now you don't show this picture of the room but you
use the picture as a texture for a mirror or even a reflection map.
Another reason to use MTexels is that games are starting to use
several layers of multi-texture effects, this means that a on-screen
pixel is constructed from various sub-pixels that end up being blended
together to form the final pixel. So it makes more sense to express
the fill-rate in terms of these sub-results, and you could refer to
them as texels.
Read the whole article - Fill Rate Explained
Additional details can be found here - Texture Fill Rate
Update
Texture Fill Rate = (# of TMU - texture mapping unit) x (Core Clock)
The number of textured pixels the card can render to the
screen every second.
It is obvious that the card with more TMUs will be faster at processing texture information.
The performance registers/counters Pixel Write Speed and Texel Write speed maintain stats / count operations about pixel and texel processed/written. I will explain the peak (maximum possible) fill rates.
Pixel Rate
A picture element is a physical point in a raster image, smallest
element of display device screen.
Pixel rate is the maximum amount of pixels the GPU could possibly write to the local memory in one second, measured in millions of pixels per second. The actual pixel output rate also depends on quite a few other factors, most notably the memory bandwidth - the lower the memory bandwidth is, the lower the ability to get to the maximum fill rate.
The pixel rate is calculated by multiplying the number of ROPs (Raster Operations Pipelines - aka Render Output Units) by the the core clock speed.
Render Output Units : The pixel pipelines take pixel and texel information and process it, via specific matrix and vector operations, into a final pixel or depth value. The ROPs perform the transactions between the relevant buffers in the local memory.
Importance : Higher the pixel rate, higher is the screen resolution of the GPU.
Texel Rate
A texture element is the fundamental unit of texture space (a tile of
3D object surface).
Texel rate is the maximum number of texture map elements (texels) that can be processed per second. It is measured in millions of texels in one second
This is calculated by multiplying the total number of texture units by the core speed of the chip.
Texture Mapping Units : Textures need to be addressed and filtered. This job is done by TMUs that work in conjunction with pixel and vertex shader units. It is the TMU's job to apply texture operations to pixels.
Importance : Higher the texel rate, faster the game renders displays demanding games fluently.
Example : Not a nVidia fan but here are specs for GTX 680, (could not find much for embedded GPU)
Model Geforce GTX 680
Memory 2048 MB
Core Speed 1006 MHz
Shader Speed 1006 MHz
Memory Speed 1502 MHz (6008 MHz effective)
Unified Shaders 1536
Texture Mapping Units 128
Render Output Units 32
Bandwidth 192256 MB/sec
Texel Rate 128768 Mtexels/sec
Pixel Rate 32192 Mpixels/sec

Resources