Saturday, 16 August 2014

What was that NDRange thing?

Digging a little deeper and pulling apart the OpenCL parts of the Mandelbrot Example

(This blog post refers to version 1.6.0 of the Brown Deer OpenCL libraries on the Parallella. If you are using newer versions there could be significant differences.)

I thought I'd stop and go back over the Mandelbrot example and make sure I understood how the thing worked. 

There were two things that seemed odd. Firstly, there was only one fork command (or rather a forka command) and I thought there would have been 16 and secondly, what was that clndrange_init1D command doing?

Turns out that the two are intimately linked. The ndRange controls the number of calls to your kernel and the fork kicks off the process. If you use forka you can pass in additional arguments. 

The nd part is short for n-Dimensional so the clndrange_init1D seems to suggest that the space created in the malloc statement as a 1-Dimensional space. Therefore, clndrange_init2D and clndrange_init3D would somehow create 2D and 3D spaces - or that is what I thought. It turns out that things are not exactly as clever as they might first appear.

Let's start from the begining. OpenCL has the standard call:

clEnqueueNDRangeKernel(cl_command_queue queue,
cl_kernel kernel,
cl_uint work_dims
const size_t *global_work_offset,
const size_t *global_work_size
const size_t *local_work_size,
cl_uint num_events, 
const cl_event *wait_list, 
cl_event *event);


Where:
  • work_dims is the number of dimensions (1, 2 or 3)
  • global_work_offset is the global ID offsets in each dimension
  • global_work_size is the number of work-items in each dimension
  • local_work_size is the number of work-items in a work-group, in each dimension

The last three are pointers to arrays of integers with for example global_work_offset[0] referring to the 1st dimension, global_work_offset[1] referring to the 2nd dimension etc.
(Note: this is the "official" version of what these are supposed to do)


The Brown Deer stdcl library replaces this one call with a choice of:

clndrange_init1d(gtoff0, gtsz0, ltsz0) OR
clndrange_init2d(gtoff0, gtsz0, ltsz0, gtoff1, gtsz1, ltsz1) OR
clndrange_init3d(gtoff0, gtsz0, ltsz0, gtoff1, gtsz1, ltsz1, gtoff2, gtsz2, ltsz2)

followed by:

clfork(context, devnum, kernel, ndrange, flags ) OR, if you want to pass additional arguments:
clforka(context, devnum, kernel, ndrange, flags [, arg0, ..., argN ]) 

where 

gtoff0 is the global_work_offset for dimension zero
gtsz0 is the global_work_size for dimension zero and
ltsz0 is the local_work_size for dimension zero etc.

So clndrange_init?d is a convenient way of setting those variables when is then passed into the fork command via the clndrange_t* variable.

So, what does it actually do?


clndrange_init1D


I started with a 1 dimensional experiment. After much digging and experimentation here's the DIRT!

global_work_offset does NOTHING, Seriously! Don't believe me... check out the khronos documentation here: http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html
global_work_size is the number of times your kernel will be called (one call in opencl talk is one work-item) and
local_work_size is the number returned to you in your kernel when you call get_local_size(0)

That's it! No fancy partitioning of the data space. Nothing. It took me quite a while to realise that it was that simple. 

There are some quirks to this call. The combination of global_work_size and local_work_size are a bit sensitive. I have not tried every combination available but you often get a segmentation fault due to changes in both the global_work_size and local_work_size. For global_work_size less than 16 it seems to have problems if it and the local_work_size are not equal. Greater than 16 you could get an error "Exceeded maximum thread block size" but apart from getting an error, it seems to still work sometimes. It seems to work if global_work_size is a multiple of 16. If things are not right you can encounter Segmentation faults and Invalid Instruction faults or you may find that your kernel does not seem to be called at all. (I have used 16 here because I'm using an Epiphany-16. I should be referring to the number of cores on your accellerator.)

Let's have a look what the mandelbrot example did.

First, it allocated some space in the standard accelerator context stdacc with:

cl_uchar* pixels = (cl_uchar*) clmalloc(stdacc, width * height * 3, 0);

(The *3 is because it creates an RGB bit map which needs 3 bytes per pixel.)

After grabbing the compiled kernel from the context with clsym(...), it calls:

clndrange_t ndr = clndrange_init1d(0, height, 16);

so it wants is kernel to be called height times. The 16 is actually ignored because the kernel never calls get_local_size(0).

Finally it calls:

clforka(stdacc, 0, krn, &ndr, CL_EVENT_WAIT, iterations, width, startx, starty, dx_over_width, dy_over_height, pixels);

where it passes in the width of each line as an argument along with the data structure, pixels. The kernel then generates one horizontal line of the final picture using the call get_global_id(0) to determine which line it is on. (A slightly cleaner way would have been to pass width as the third argument in the clndrange_init1d() and then to call get_local_size(0)).

So much for clndrange_init1D. The mandelbrot set is a good example of a problem where the calculation of each point is independent of the next. If you have such a problem then this is a simple model that would be sufficient.

Next I tried a 2 dimensional version.


clndrange_init2D


Let me first say that I found clndrange_init2D to be a little temperamental. It seems that some combinations of values result in the kernel not being called at all. What's more, once it is not working, regressing to the previous state did not result it in working again. It was very frustrating. Therefore, everything written below must be read in with the knowledge that I gave up without actually understanding what was going on.

The 2D call seems (see previous paragraph) to call the kernel gtsz0 multiplied by gtsz1 times. The local size values ltsz0 and ltsz1 are merely returned by get_local_size(0|1).

clndrange_init2d( gtoff0, gtsz0, ltsz0, gtoff1, gtsz1, ltsz1)

gtoff0: NULL! Nothing to do here
gtsz0: the number of calls for the first dimension 
ltsz0: the value returned by calling get_local_size(0)
gtoff1: NULL, as you might expect
gtsz1: the number of calls for the second dimension (one for each gtsz0 call)
ltsz1: the number returning by calling get_local_size(1)

I didn't go into clndrange_init3D but I'd lay money on it working in the same was as clndrange_init2D.


nD Example 


I took the mandelbrot example and removed all the fancy mathematics. I allocated two chunks of global memory and wrote a 1D kernel and a 2D kernel that just initialised the space to a given value.

The critical bits are:

int  bytesPerCore = 16; // how many bytes we want each core to process
int workItems = 32;     // the total number workItems (threads sent to the Epiphany)

wrkArea1D = (cl_uchar*) clmalloc(stdacc, workItems * bytesPerCore, 0);
wrkArea2D = (cl_uchar*) clmalloc(stdacc, workItems * bytesPerCore, 0);

clndrange_t ndr1D = clndrange_init1d(NULL, workItems, bytesPerCore); 
clndrange_t ndr2D = clndrange_init2d(NULL, workItems, bytesPerCore/4, NULL, workItems, bytesPerCore/4);

and then in a common function call:

clforka(stdacc, 0, krn, ndr, CL_EVENT_WAIT, 1, rows, cols, wrkArea);

where 
krn is either "k_init1D" or "k_init2D"
ndr is either ndr1D or ndr2D and
wrkArea is either wrkArea1D or wrkArea2D

The kernel k_init1D works in the same way as the mandelbrot example processing 16 bytes of data in each of the 32 calls using get_global_id(0) to figure out where it should be.

The k_init2D kernel breaks the data set of the same size into 4x4 "tiles" so that adjacent data could be processed at the same time. I thought that I could cast the global data into a 2 dimensional array but that didn't work so I had to do all of the offset arithmetic in code. While this is not difficult it did make the 2D kernel considerably longer and given that speed is of the essence I would suggest that the only reason to do this would be if the algorithm overall would work better in 2D than in 1D (or that you dig unnecessary complexity).

In the resultant data sets I include the value of get_global_id(0) to show which call processed which chunk of data. The 1D data has the global id in the first column and the result of get_local_size(0) in the second column. The first five lines are:

0       16      1       1       1       1       1       1       1       1       1       1       1       1       1       1
1       16      1       1       1       1       1       1       1       1       1       1       1       1       1       1
2       16      1       1       1       1       1       1       1       1       1       1       1       1       1       1
3       16      1       1       1       1       1       1       1       1       1       1       1       1       1       1
4       16      1       1       1       1       1       1       1       1       1       1       1       1       1       1

The 2D data set includes the result of get_global_id(1) which shows it jumping around between 28 and 31. This is the value returned by get_global_id(1) as at the last call to the kernel. The top left value is the value returned by get_global_id(0) and the next value is value returned by get_global_id(1). The results from the first four tiles (1 to 3) are:

0       28     1       1       1       31     1       1       2       31     1       1       3       29     1       1
1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1

The 2D call was also a lot more sensitive than the 1D. As I mentioned above some combinations of values in the 2D version work and some don't. I suspect that the total number of calls must be a multiple of the number of cores available. So, before writing a big kernel with the assumption of clndrange_init2D working, think about how you want to process your data and check that a simple version works first and 

The nD example code


The code for the example is a little long to list here so grab it from github. From you eclipse workspace directory execute:

git clone https://github.com/nickoppen/nD.git

load it into eclipse and push, pull, tweak and generally rip it up till you are an ndrange EXPERT. And, if I've gotten anything wrong, please let me know.


Up Next


I don't know about you but I'm getting a bit tired of waiting around for Eclipse. AndyC posted a procedure on getting Code::Blocks working on the Parallella (http://forums.parallella.org/viewtopic.php?f=13&t=1658). 

I've installed Code::Blocks and it does perform better than Eclipse. Andy's procedure is designed for the standard SDK and I got into a mess when I tried to compile an OpenCL program on it so I'm going to have to do some tweaking. If I make some progress I'll write a post about it.