OpenCL: What if i have more tasks then available work items?

OpenCL: What if i have more tasks then available work items? - multithreading

Let's make an example:
i want vector dot product made concurrently (it's not my case, this is only an example) so i have 2 large input vectors and a large output vector with the same size. the work items aviable are less then the sizes of these vectors. How can i make this dot product in opencl if the work items are less then the size of the vectors? Is this possible? Or i have just to make some tricks?
Something like:
for(i = 0; i < n; i++){
output[i] = input1[i]*input2[i];
}
with n > available work items

If by "available work items" you mean you're running into the maximum given by CL_DEVICE_MAX_WORK_ITEM_SIZES, you can always enqueue your kernel multiple times for different ranges of the array.
Depending on your actual workload, it may be more sensible to make each work item perform more work though. In the simplest case, you can use the SIMD types such as float4, float8, float16, etc. and operate on large chunks like that in one go. As always though, there is no replacement for trying different approaches and measuring the performance of each.

Divide and conquer data. If you keep workgroup size as an integer divident of global work size, then you can have N workgroup launches perhaps k of them at once per kernel launch. So you should just launch N/k kernels each with k*workgroup_size workitems and proper addressing of buffers inside kernels.
When you have per-workgroup partial sums of partial dot products(with multiple in-group reduction steps), you can simply sum them on CPU or on whichever device that data is going to.

Related

How can I use linux perf and interpret its output to understand CPU cache misses?

I am trying to measure the number of times memory references miss any CPU cache and need to fetch a cache line from memory. I have a very simple program that loads 100 million 4-byte integers into an array and then scans it or probes it randomly. I measure time, and then use perf to report various cache-related events: LLC-load, LLC-load-misses, LLC-store, LLC-store-misses. I am using Pop OS 18.10 (a variant of Ubuntu 18.10).
I run the program three ways:
1) Just load the array (100m integers).
2) Load the array and scan in physical order.
3) Load the array and read 100m random array locations.
#3 is 40x slower than #2, which is not surprising.
I am having some trouble both knowing what perf events to examine, and how to interpret the results:
I discovered the LLC-* events by googling, but they are not mentioned by "perf list".
I subtract the counts of events of the load-only run (#1) from the load-and-scan runs (#2, #3). The numbers are generally lower from the physical scan (#2) compared to the random access (#3). But from reading the documentation, and looking at the numbers, I don't really understand what the various events represent.
Does perf count events or does it sample them? If it's a true count, then I really can't make sense of the numbers I'm seeing. (E.g. the number of LLC-load-misses events doesn't match the number of cache line transfers that should be needed.)

Metal - Threads and ThreadGroups

I am learning Metal right now and trying to understand the lines below:
let threadGroupCount = MTLSizeMake(8, 8, 1) ///line 1
let threadGroups = MTLSizeMake(drawable.texture.width / threadGroupCount.width, drawable.texture.height / threadGroupCount.height, 1) ///line 2
command_encoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupCount) ///line 3
for the line 1, What is the 3 integers represent? My guess is to assign the number of threads to be used in the process but which is which?
What is the different between line 1 and 'line 2'? My guess again is the different between threads and thread groups. But I am not sure what is the fundamental difference and when to use what.

When dispatching a grid of work items to a compute kernel, it is your responsibility to divide up the grid into subsets called threadgroups, each of which has a total number of threads (width * height * depth) that is less than the maxTotalThreadsPerThreadgroup of the corresponding compute pipeline state.
The threadsPerThreadgroup size indicates the "shape" of each subset of the grid (i.e. the number of threads in each grid dimension). The threadgroupsPerGrid parameter indicates how many threadgroups make up the entire grid. As in your code, it is often the dimensions of a texture divided by the dimensions of your threadgroup size you've chosen.
One performance note: each compute pipeline state has a threadExecutionWidth value that indicates how many threads of a threadgroup will be scheduled and executed together by the GPU. The optimal threadgroup size will thus always be a multiple of threadExecutionWidth. During development, it's perfectly acceptable to just dispatch a small square grid as you're currently doing.

The first line gives you the number of threads per group (in this case two-dimensional 8x8), while the second line gives you the number of groups per grid. Then the dispatchThreadgroups(_:threadsPerThreadgroup:) function on the third line uses these two numbers. The number of groups can be omitted in which case it defaults to using one group.

Is there a way to accelerate matrix plots?

ggpairs(), like its grandparent scatterplotMatrix(), is terribly slow as the number of pairs grows. That's fair; the number of permutations of pairs grows factorially.
What isn't fair is that I have to watch the other cores on my machine sit idle while one cranks away at 100% load.
Is there a way to parallelize large matrix plots?
Here is some sample data for benchmarking.
num.vars <- 100
num.rows <- 50000
require(GGally)
require(data.table)
tmp <- data.table(replicate(num.vars, runif(num.rows)),
class = as.factor(sample(0:1,size=num.rows, replace=TRUE)))
system.time({
tmp.plot <- ggpairs(data=tmp, diag=list(continuous="density"), columns=1:num.vars,
colour="class", axisLabels="show")
print(tmp.plot)})
Interestingly enough, my initial benchmarks excluding the print() statement ran at tolerable speeds (21 minutes for the above). The print statement, when added, caused what appear to be segfaults on my machine. (Hard to say at the moment because the R session is simply killed by the OS).
Is the problem in memory, or is this something that could be parallelized? (At least the plot generation part seems amenable to parallelization.)

Drawing ggpairs plots is single threaded because the bulk of the work inside GGally:::print.ggpairs happens inside two for loops (somewhere around line 50, depending upon how you count lines):
for (rowPos in 1:numCol) {
for (columnPos in 1:numCol) {
It may be possible to replace these with calls to plyr::l_ply (or similar) which has a .parallel argument. I have no idea if the graphics devices will cope OK with several cores trying to simultaneous draw things on them though. My gut feeling is that getting parallel plotting to work robustly may be non-trivial, but it could also be a fun project.

Why is string manipulation more expensive?

I've heard this so many times, that I have taken it for granted. But thinking back on it, can someone help me realize why string manipulation, say comparison etc, is more expensive than say an integer, or some other primitive?

8bit example:
1 bit can be 1 or 0. With 2 bits you can represent 0, 1, 2, and 3. And so on.
With a byte you have 2^8 possibilities, from 0 to 255.
In a string a single letter is stored in a byte, so "Hello world" is 11 bytes.
If I want to do 100 + 100, 100 is stored in 1 byte of memory, I need only two bytes to sum two numbers. The result will need again 1 byte.
Now let's try with strings, "100" + "100", this is 3 bytes plus 3 bytes and the result, "100100" needs 6 bytes to be stored.
This is over-simplified, but more or less it works in this way.

The int data type in C# was carefully selected to be a good match with processor design. Which can store an int in a cpu register, a storage location that's an easy factor of 3 faster than memory. And a single cpu instruction to compare values of type int. The CMP instruction runs in less than a single cpu cycle, a fraction of a nano-second.
That doesn't work nearly as well for a string, it is a variable length data type and every single char in the string must be compared to test for equality. So it is automatically proportionally slower by the size of the string. Furthermore, string comparison is afflicted by culture dependent comparison rules. The kind that make "ss" and "ß" equal in German and "Aa" and "Å" equal in Danish. Nothing subtle to deal with, taken care of by highly optimized table-driven code inside the CLR. It can't beat CMP.

I've always thought it was because of the immutability of strings. That is, every time you make a change to the string, it requires allocating memory for a whole new string (rather than modifying the original in place).
Probably a woefully naive understanding but perhaps someone else can expound further.

There are several things to consider when looking at the "cost" of manipulating strings.
There is the cost in terms of memory usage, there is the cost in terms of CPU cycles used, and there is a cost associated with the complexity of the code involved.
Integer manipulation (Add, Subtract, Multipy, Divide, Compare) is most often done by the CPU at the hardware level, in few (or even 1) instruction. When the manipulation is done, the answer fits back in the same size chunk of memory.
Strings are stored in blocks of memory, which have to be manipulated a byte or word at a time. Comparing two 100 character long strings may require 100 separate comparison operations.
Any manipulation that makes a string longer will require, either moving the string to a bigger block of memory, or moving other stuff around in memory to allow growing the existing block.
Any manipulation that leaves the string the same, or smaller, could be done in place, if the language allows for it. If not, then again, a new block of memory has to be allocated and contents moved.

How to search for Possibilities to parallelize?

I have some serial code that I have started to parallelize using Intel's TBB. My first aim was to parallelize almost all the for loops in the code (I have even parallelized for within for loop)and right now having done that I get some speedup.I am looking for more places/ideas/options to parallelize...I know this might sound a bit vague without having much reference to the problem but I am looking for generic ideas here which I can explore in my code.
Overview of algo( the following algo is run over all levels of the image starting with shortest and increasing width and height by 2 each time till you reach actual height and width).
For all image pairs starting with the smallest pair
For height = 2 to image_height - 2
Create a 5 by image_width ROI of both left and right images.
For width = 2 to image_width - 2
Create a 5 by 5 window of the left ROI centered around width and find best match in the right ROI using NCC
Create a 5 by 5 window of the right ROI centered around width and find best match in the left ROI using NCC
Disparity = current_width - best match
The edge pixels that did not receive a disparity gets the disparity of its neighbors
For height = 0 to image_height
For width = 0 to image_width
Check smoothness, uniqueness and order constraints*(parallelized separately)
For height = 0 to image_height
For width = 0 to image_width
For disparity that failed constraints, use the average disparity of
neighbors that passed the constraints
Normalize all disparity and output to screen

Just for some perspective, it may not always be worthwhile to parallelize something.
Just because you have a for loop where each iteration can be done independently of each other, doesn't always mean you should.
TBB has some overhead for starting those parallel_for loops, so unless you're looping a large number of times, you probably shouldn't parallelize it.
But, if each loop is extremely expensive (Like in CirrusFlyer's example) then feel free to parallelize it.
More specifically, look for times where the overhead of the parallel computation is small relative to the cost of having it parallelized.
Also, be careful about doing nested parallel_for loops, as this can get expensive. You may want to just stick with paralellizing the outer for loop.

The silly answer is anything that is time consuming or iterative. I use Microsoft's .NET v4.0 Task Parallel Library and one of the interesting things about their setup is its "expressed parallelism." An interesting term to describe "attempted parallelism." Though, your coding statements may say "use the TPL here" if the host platform doesn't have the necessary cores it will simply invoke the old fashion serial code in its place.
I have begun to use the TPL on all my projects. Any place there are loops especially (this requires that I design my classes and methods such that there are no dependencies between the loop iterations). But any place that might have been just good old fashion multithreaded code I look to see if it's something I can place on different cores now.
My favorite so far has been an application I have that downloads ~7,800 different URL's to analyze the contents of the pages, and if it finds information that it's looking for does some additional processing .... this used to take between 26 - 29 minutes to complete. My Dell T7500 workstation with dual quad core Xeon 3GHz processors, with 24GB of RAM, and Windows 7 Ultimate 64-bit edition now crunches the entire thing in about 5 minutes. A huge difference for me.
I also have a publish / subscribe communication engine that I have been refactoring to take advantage of TPL (especially on "push" data from the Server to Clients ... you may have 10,000 client computers who have stated their interest in specific things, that once that event occurs, I need to push data to all of them). I don't have this done yet but I'm REALLY LOOKING FORWARD to seeing the results on this one.
Food for thought ...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string