Parallella Ideas, Designs and Executions: forka

Showing posts with label forka. Show all posts

Tuesday, 19 April 2016

Getting COPRTHR and MPI running

In this post I get to grips with COPRTHR with the new MPI library installed and solve the puzzle as to why direct calls to the epiphany SDK did not work. There were some tricks to getting COPRTHR working but some unexpected benefits as well.

Introduction

Working towards my goal of a general purpose infrastructure to wrap around my neural network simulator, I did some investigation into the Message Passing Interface (MPI) and direct memory access (DMA) memory transfers. The plan is to use MPI to provide the signalling between the host and epiphany and between epiphany cores. The DMA part comes in as a way to parallelise the processing so that you are transferring the next data set into local memory while you are processing the last set. DMA is also a lot quicker for larger data sets than individual reads.

To get MPI working I decided to bite the bullet and get into COPRTHR. While this was a little confusing at first once I'd gotten everything wired properly it all made sense. If you understand stdcl the overall structure is the same and so far, I have not found anything that you can do in stdcl that you cannot do in COPRTHR.

I also want to use MPI because it allows asynchronous data transfer between cores. While direct memory writes are fast, the core being written to plays no part in the process and thus other coordination methods are required to ensure that the data has arrived before it is used. It also requires that the core doing the writing writes to the correct location. This is fine if all cores have the same code base but if they do not there needs to be signaling between sender and receiver to ensure that the sender is not writing over the wrong data (or program code!). MPI provides the abstraction needed to ensure an orderly, arms-length transfer between cores.

There was an unexpected benefit in that COPRTHR has a call that allows you to allocate a chunk of memory from within you kernel. While you still have the option to dynamically change your program and recompile on the fly as I described in a previous post, there is a call that allows you access to unused memory on the epiphany which, if used carefully, can store data for local processing.

Preliminaries

The programs developed for this blog post were done on the 2015.01.30 image using COPRTHR version 1.6.2. The 2016.03 image was release recently so I'll install that soon and update anything that fails (which I don't expect it to do). Please also be aware that version 2.0 of COPRTHR is in the works so check your version before proceeding.

You can have a look at the example code or to clone the source files execute:

git clone https://github.com/nickoppen/coprthr_basic.git

Installing MPI

MPI has been around for a number of years and is a set of calls, implemented on different systems that allows message passing between those systems. Brown Deer brought out a COPRTHR extension recently that implemented those calls for message passing between epiphany cores. While it is an extension to version 1.6.2 I expect that COPRTHR version 2 will have MPI built in with greater functionality (although no details have been release as yet).

Get the COPRTHR mpi extension from here:

http://www.browndeertechnology.com/code/bdt-libcoprthr_mpi-preview.tgz

Just download it, un-compress it somewhere out of the way and run the install script as root:

sudo ./install.sh

Using eSDK calls in an OpenCL program

You can actually use eSDK calls directly from your OpenCL kernel code. However, I found that there is an inconsistency between the layout of the eSDK and the hard wired compile scripts used by the clcc compiler.

To test if this inconsistency exists on your system, fetch the latest version of the parallella examples from github which includes a couple MPI examples:

cd
git fetch parallella-examples
cd parallella-examples/mpi-nbody
make

If this compiles without errors then you are fine.

If you get errors such as:

undefined reference to `e_dma_copy(void*, void*, unsigned long)

undefined reference to `e_mutex_unlock(unsigned int, unsigned int, int*) and

undefined reference to `e_mutex_unlock(unsigned int, unsigned int, int*)

try the following:

ls /opt/adapteva/esdk/tools/e-gnu/epiphany-elf/sys-include

if this gives you "No such file or directory" run the following (note: this is a single command without any breaks):

sudo ln -s /opt/adapteva/esdk/tools/e-gnu/epiphany-elf/include /opt/adapteva/esdk/tools/e-gnu/epiphany-elf/sys-include

Then run:

make clean
make

again from the mpi-nbody directory to check if that fixed it.

If that does not work?

This was the problem with my installation. The way I discovered it (thanks to jar) was by using the cldebug tool:

cldebug -t ./temp -- clcc --coprthr-cc -mtarget=e32 -D__link_mpi__ --dump-bin -I/opt/adapteva/esdk/tools/e-gnu/epiphany-elf/include -I ./ -DECORE=16 -DNPARTICLE=2048 -DSTEPS=32 -DSIZE=1 -DMPI_BUF_SIZE=1024 -DCOPRTHR_MPI_COMPAT mpi_tfunc.c

This looks very complicated but it really just calls the clcc command on the kernel file. The output looks even more horrendous but it contains the output of all the steps that the Brown Deer pre-processor goes through to produce a working executable.

If there is still something missing in your compile, sift through all of this text and look for something that is wrong. Given that I was missing a reference, I focused on include files and directories which occur after -I and -i switches. I found that it was including sys-include which didn't exist on my system. Instead, the .h files (e_dma.h and e_mutex.h) were in a directory called include. To fix it with minimum disruption, I created a soft link (ln -s) of the existing include and called it sys-include. Thus sys-include was an alias for include and the compiler was happy.

Compiling

Below are the steps changes you need to make in Code::Block to get a program to compile using COPRTHR. If you have not used Code::Blocks I suggest that you have a read through my post about setting it up from scratch. In this post I'm only going to cover the changes that you need to make to the stdcl set up described there. If you are an emacs/make fan, I've included a Makefile along with the example code.

Compiler Options

Kernel Compilation

The first two changes are critical for correct compilation of your kernel code. Go to the Compiler Settings menu (Settings | Compiler) and selection the compiler you have set up for compiling with clcc (mine is called BD OpenCL).

Under the Compiler Settings tab choose the Other options tab and type the switches shown below:

Then under the #defines tab define the following two definitions:

Host Application Compilation

Go back to the gcc compiler settings and change the Linker settings tab to include the COPRTHR link libraries (I'm not sure what the m library is for. It is in all the examples so I included it for good measure.):

Then under Search directories maker sure that the compiler can find your included header files:

And that the linker can find the COPRTHR libraries:

Project Properties

As always in a new project, you need to define a build target but in this case the output attributes are not important:

This is the tricky bit. The compiler, with the switches defined above, generates a binary file in the Execution working directory (your project directory by default). This file has the rather verbose extension of bin.3.e32 tacked onto the end of your kernel file name. Forget the .o file generated by the compiler, it is the bin.3.e32 file that you need at execution time. If your host application is being generated and run from your bin/Debug or bin/Release directory, you must copy the bin.3.e32 file to the same place.

To do this use the Post-build steps with a copy command and while you are at it, make the file readable by everyone. (Note: my kernel file is called pfunc.c. Replace this with your kernel file name.)

Also, while you are in this box and after saving your new settings, pop up to the root project settings (coprthr_basic in this example) and switch off All Warnings (-Wall). It will just be annoying.

Then, because we are not interested in the .o or the linker output, switch off linking for the kernel file or files.

Using COPRTHR Host-side

The overall structure of a COPRTHR program is the same as an stdcl program as described in my previous post. The steps you need to go through are also the same but with different calls.

Loading and Compiling

Stdcl has the commands clopen for opening the epiphany and to load a pre-compiled and JIT compiled kernels and clsopen for a kernel stored as an internal string.

COPRTHR has two commands:

int coprthr_dopen( const char* path, int flags); to open the epiphany (which should be matched with a coprthr_dclose()) where path is the #defined value COPRTHR_DEVICE_E32 and flags is COPRTHR_O_THREAD. The integer returned is the device descriptor that is needed in a number of subsequent calls.

and at least one of the following:

coprthr_program_t coprthr_cc_read_bin( const char* path, int flags ); to open a pre-compiled object file (our bin.3.e32 file), where path is the file location of the binary and flags is zero. The return value is a handle to the compiled file.

or

coprthr_program_t coprthr_cc( const char* src, size_t len, const char* opt, char** log ); compiles a string (src) using the compile options (opt) returning handle to the program. The log parameter is a pointer the compiler output. Pass in a NULL pointer and a suitable amount of space will be allocated (which must be then freed).

e.g.

int dd = coprthr_dopen(COPRTHR_DEVICE_E32, COPRTHR_O_THREAD);
printf("dd=%d\n",dd);
if (dd<0)
{
printf("device open failed\n");
exit(0);
}

coprthr_program_t prg;
prg = coprthr_cc_read_bin("./pfunc.cbin.3.e32", 0);
printf("prg=%p \n",prg);
if (!(prg))
{
printf("file mpi_pfunc.cbin.3.e32 not found\n");
coprthr_dclose(dd);
exit(0);

}

There is no call to compile source code in a file so read it into a string and compile it from there.

Allocating Shared Memory

Similar to the stdcl call clmalloc, COPRTHR has the call the call:

coprthr_mem_t coprthr_dmalloc(int dd, size_t size, int flags); where a memory space of size is allocated on the device (dd) returning a handle of type coprthr_mem_t. The argument flags is not used.

I call it a handle rather than a pointer because it is really not working memory. If you try and use it as working memory you will be bitterly disappointed. The only way to write to and read from it is with the calls described in the next section.

Shared memory can be resized using the coprthr_drealloc(dd, size, flags) call and should be freed using coprthr_dfree(int dd, coprthr_mem_t mem) when you are done.

e.g.

coprthr_mem_t p_data_mem = coprthr_dmalloc(dd, WIDTH*sizeof(float), 0);

Writing to and Reading From Shared Memory

In stdcl you would declare a variable of type cl_T * where T is a compatible simple type (e.g.int or float), call clmalloc(), initialise it in your host application and then call clmsync to transfer the data to shared memory.

COPRTHR does things a little differently. The handle returned by coprthr_dmalloc refers to memory in shared space only. To initialise it you need to declare, allocate and write to memory in your host program and then call:

coprthr_event_t coprthr_dwrite(int dd, coprthr_mem_t mem, size_t offset, void* ptr, size_t len, int flags); where dd is your device descriptor, mem is your memory handle, offset is how far into the shared memory you want to start writing, ptr is a pointer to your host storage the contents of which you want written into shared memory, len is the length (in bytes) of the data you want written and flags is one of COPRTHR_E_NOWAIT or COPRTHR_E_WAIT. The return value is an event handle which can be used in the call coprthr_dwaitevent(dd, event).

The contents of your host memory will be written to shared memory. In a similar way, once your kernel has finished its computation and written its results back, you call:

coprthr_event_t coprthr_dread(int dd, coprthr_mem_t mem, size_t offset, void* ptr,

size_t len, int flags); where the arguments and return values are the same except of course the data gets read from shared memory into your host memory.

There is also coprthr_dcopy(...) which copies data from one device's shared memory into another but that is not relevent on the Parallella given that there is only one device.

e.g.

float host_buffer[WIDTH];
for (i=0; i < WIDTH; i++) host_buffer[i] = -1 * i; /// write data to shared DRAM coprthr_mem_t p_data_mem = coprthr_dmalloc(dd, WIDTH*sizeof(float), 0); coprthr_dwrite(dd, p_data_mem, 0, host_buffer, WIDTH*sizeof(float), COPRTHR_E_WAIT);

followed later by:

coprthr_dread(dd, p_data_mem, 0, host_buffer, WIDTH*sizeof(float),COPRTHR_E_WAIT);

Retrieving a Kernel

To call a kernel you need to retrieve a link to it from the program file you just read in or compiled. Similar to stdcl clsym call COPRTHR has:

coprthr_sym_t coprthr_getsym( coprthr_program_t program, const char* symbol); where program is the returned program handle and symbol is the name of the kernel. The return value is the handle to the kernel.

e.g.

coprthr_sym_t thr = coprthr_getsym(prg,"p_func_thread_mpi");
if (thr == (coprthr_sym_t)0xffffffff)
{
printf("kernel p_func_thread not found\n");
coprthr_dclose(dd);
exit(0);

}

Calling a Kernel

There are three ways to call a kernel or kernels. The MPI library uses the call:

coprthr_mpiexec(int dd, int thrCount, comprthr_sym_t thr, void * args, int size, int flags); where dd is the device descriptor, thrCount is the number of threads you want to launch, thr is the handle to the kernel you are calling, args is a type-less pointer to a structure containing the kernels arguments (see next paragraph), size is the size (in bytes) of the structure you are passing and flags (presumably because I have not seen them used) is one of COPRTHR_E_NOWAIT or COPRTHR_E_WAIT. I also presume that the call returns a variable of type coprthr_event_t.

To use the args variable in this call you need to typedef a structure that contains the variables you want to pass to your kernel. Do this in a separate include file and include it in both the host file and kernel file. Declare a variable of this type in your host program in host memory. The simple data types (int, float etc) can be written to the structure but the arrays need to be allocated in shared memory and initialised as described above. The structure is then passed to the kernel as a type-less pointer which can then be cast back to the structure type inside your kernel code.

e.g.

my_args_t args = {
.n = 2.0,
.width = WIDTH,
.p_mem = coprthr_memptr(p_data_mem, 0),
};

coprthr_mpiexec(dd, ECORES, thr, &args, sizeof(args), 0);

The alternative ways are the COPRTHR calls that were available before the MPI library appeared. These are:

coprthr_event_t coprthr_dexec( int dd, coprthr_kernel_t krn, unsigned int nargs, void** args, unsigned int nthr, void* reserved, int flags ); where dd is the device descriptor, krn is the handle to the kernel, nargs is the number of elements in the args variable (see next paragraph), args is a pointer to an array of pointers of values to be passed to the kernel, nthr is the number of times the kernel should be executed, reserved is not used and flags is one of COPRTHR_E_NOWAIT or COPRTHR_E_WAIT. The return value is an event that can be used used in the call coprthr_dwaitevent(dd, event).

and:

coprthr_event_t coprthr_dnexec( int dd, unsigned int nkrn, coprthr_kernel_t v_krn[], unsigned int v_nargs[], void** v_args[], unsigned int v_nthr[], void* v_reserved[], int flags); which allows a number of different kernels to be executed at once. Each array is a is a collection of the same arguments in the dexec call.

The big difference between mpiexec and dexec is how the arguments are passed. The dexec call only accepts arrays and those arrays have to be in shared memory. This means that passing singleton data via a one element array which is a bit clumsy. But remember, if you use mpiexec to call the kernel, you don't have to use any MPI.

e.g.

float n[] = { 2.0 };
int width[] = { WIDTH };
coprthr_mem_t p_arg_n = coprthr_dmalloc(dd, sizeof(float), 0);
coprthr_mem_t p_arg_width = coprthr_dmalloc(dd, sizeof(int), 0);
coprthr_dwrite(dd, p_arg_n,0,&n, sizeof(float), COPRTHR_E_WAIT);
coprthr_dwrite(dd, p_arg_width, 0, &width, sizeof(int), COPRTHR_E_WAIT);
void * p_args[] = { &p_arg_n, &p_arg_width, &p_data_mem };

coprthr_dexec(dd, thr, 3, p_args, ECORES, 0, COPRTHR_E_WAIT);

Changes to Kernel Code

Finally, now that you have called your kernel, you need to retrieve your arguments. Remember that shared memory is slow so if you want to use them more than once it is better to make a local copy.

Kernels called with coprthr_mpiexec you need to cast the single kernel argument as your defined structure:

__kernel void p_func_thread_mpi( void * p_args )

{

my_args_t* pargs = (my_args_t*)p_args;

...

}

With coprthr_dexec you need to de-reference the singleton arguments:

__kernel void p_func_thread_std( float * p_arg_n,  int* p_arg_width, void * p_mem )

{

 float n = p_arg_n[0];

 int cnt = p_arg_width[0] / ECORES;

 float * g_pmem = (float *)p_mem;  /// to be used in dma copy

...

}

Reserving Local Memory

In stdcl there is no way to dynamically allocate space other than with source code modification and dynamic compilation. COPRTHR has two calls that allows you to use free space in local memory:

void * coprthr_tls_sbrk(int size); which returns the contents of the system variable containing the first byte of free memory and sets the free memory address system variable to be size bytes further along

and:

coprthr_tls_brk(void * addr); sets the system variable containing address of the first byte of free memory to be addr.

These calls should be used as following (if you are into neatness):

/// store the starting position of free space

void * memfree = coprthr_tls_sbrk(0);

/// grab a chunk of free space for local storage

float * localSpace = (float*)coprthr_tls_sbrk(cnt * sizeof(float));

/// do something with the space

/// reset the free space pointer to where it was initially

coprthr_tls_brk(memfree);

WARNING: The value returned by coprthr_tls_sbrk is an actual memory address. Remember this is bare metal programming with no comfy operating system preventing you writing data where you shouldn't.

Copying using DMA

Once you have a reference to shared memory (called global memory in OpenCL terms) and allocated some space, you need to copy it into memory local to the core. The eSDK documentation shows the basic synchronous copy command:

int e_dma_copy(void *dst, void *src, size_t bytes); which copies bytes of data from src to dst returning 0 if successful. This is a synchronous copy using the E_DMA_1 channel.

In our example the src is the pointer to global memory and the dst is the memory address returned by coprthr_tls_sbrk. You'd better make sure that bytes in e_dma_copy is the same as size in coprthr_tls_sbrk.

e.g.

e_dma_copy(localSpace, g_pmem + (cnt * rank), cnt*sizeof(float));

Asynchronous use of the DMA channels requires the use of:

int e_dma_start(e_dma_desc_t *descriptor, e_dma_id_t chan);

int e_dma_busy(e_dma_id_t chan);

void e_dma_wait(e_dma_id_t chan);

void e_dma_set_desc(e_dma_id_t chan, unsigned config, e_dma_desc_t *next_desc, unsigned stride_i_src, unsigned stride_i_dst, unsigned count_i, unsigned count_o, unsigned stride_o_src, unsigned stride_o_dst, void *addr_src, void *addr_dst, e_dma_desc_t *descriptor);

Which I will cover in a future post.

Final Thoughts

This post is more a how-to post on using COPRTHR and a toe-in-the-water look at MPI and DMA copying. While I'm convinced that the combination will lead to a better structured and faster kernels, I'm not sure by how much and at what cost in terms of the complexity of the program.

My next step with be to compare MPI with direct memory writes along the lines of my data passing experiment. Until then, please let me know if I've missed anything.

Sunday, 1 February 2015

OpenCL on the Parallella - Structure, Communication and Debugging for Noobs

In this post, I'll cover the basic program structure of a simple OpenCL program, inter-core communication, dynamic space allocation and debugging when there is no debugger.

I'm writing on a Parallella-16 (server version) using the Brown Deer stdcl library (version 1.6) which is using the eSDK (version 5.13.09.10) from the 14.04 Ubuntu build. My example code can be cloned from github with the command:

git clone -b blog9Files https://github.com/nickoppen/nnP.git

Programming in OpenCL

The first thing to remember about OpenCL was that it was devised to provide a "standard" language to off load numerically intensive tasks from the CPU to a GPU. The Epiphany accelerator is similar to a GPU in that it is a remote device, with no operating system and memory is not shared with the CPU.

This means that you have to transmit the program and data to the accelerator, execute the program and retrieve the results. Once the program has halted, you have to start the whole process again. The data and code you had there previously will have disappeared.

Also, there are no basic operating system functions like threads or interrupts. You can program them yourself if you want but you have to start from scratch (or use something like freeRTOS). This is what's known as "bare metal" programming.

So, if this is what you are up for, let's start.

Basic Structure

If you've read my post on NDRanges will be familiar with the calls clndrange_init1D and forka. These are a short hand way of distributing a kernel to a specific number of cores where you want those cores to carry out exactly the same task on data you supply. This is the classic Single Instruction Multiple Data (SIMD) execution plan that suits some problems very well. Matrix multiplication which lies at the heart of the feed-forward back-propagation neural network is one such problem.

The following diagram and code illustrates how this works for the Parallella.

In the above diagram, I've included all the calls you need to get your program and data to and from the accelerator. Here are the critical bits on the host side:

You must use the OpenCL types for all variables that you will pass. Here I've use cl_T to indicate OpenCL types. They start with cl_ followed by one of the basic types (int, float etc.). This will ensure that the types that you use on the host side will be the same width as the accelerator.
Pointer variables (e.g. arrays) must be allocated to an OpenCL context (the standard context for the accelerator is stdacc) using clmalloc. Note that argument 2 is the size of the space allocated in bytes, so multiply your array size by the size of the contained type.
Use the clopen command that suits your chosen compilation strategy. The first one is for kernels that have been linked at compile time (see my previous blog post on Code::Blocks). The second is for the JIT compiler with the cl code in a file and the third (clsopen) is for the JIT compiler with the kernel in a string (for more discussion see the section on Space Allocation below).
The clmsync using the CL_MEM_DEVICE indicates that the data is outbound from the host to the accelerator. CL_MEM_HOST indicates inbound data.
For a discussion of cl_ndrange and forka see my previous blog post titled "What was that NDRange thing?". Note that all arguments are passed by value, therefore it is not necessary to call clmsync for simple variable types like cl_int and cl_float.
All calls that enqueue an activity to the accelerator have the flag CL_EVENT_NO_WAIT. This indicates that the host should execute the command and continue without waiting for the accelerator to complete the task. The cl_wait command ensures that they are all finished before continuing. (The cl_flush is there because the example code from Brown Deer uses one. The clmesg that appears during execution indicates that cl_flush has not been implemented. I think it is supposed to complete all "enqueuements".)

On the Epiphany side you are (almost) back in the wonderful world of C-language programming. I say almost because there are a few additional things you have to think about and a few restrictions (especially if you have no debugger).

The arguments to the kernel are passed by value therefore, the arrays (arr_in, and arr_out) are pointer into the Epiphany's global memory (I'm not sure where this memory actually is but it is remote from the core). I think that it is better to treat these arguments like Unix named pipes rather than variables. You can use them directly but the communication overhead in doing so is going to be large. Therefore, I declared local arrays and copied the remote data to them before starting the computation. Once this is complete, the results are copied back to remote memory before the kernel exits.

Note that I have used a simple for loop to copy the data from global to local memory. The examples supplied by Brown Deer all use the memcopy extension. I have not been able to find out how OpenCL extensions are used on the Parallella and so I've opted for a simple alternative that works in the same way which I'll replace as soon as I'm able.

Note also that the local arrays are declared statically with a fixed size at compile time. There was a post on the forums where djm asked about dynamic allocation and the response from dar (David Ritchie at Brown Deer) was not very promising (see: here). Therefore I've opted again for a simple, obvious approach using static arrays the size of which I manipulate on the fly (see Space Allocation below).

Once you are in your kernel, the need to put cl_ on the front of variables disappears. I've put the symbol __private in front of the arrays but I think that is redundant. With no memory equivalent to OpenCL __local memory (memory shared between groups of cores), anything declared within the kernel is automatically private.

Space Allocation

Static array sizes are a problem if you have to decide how large they are before you know what data they will have to accommodate. If you use the offline compiler (as described in my Code::Blocks post) this is what you have to do. However, with a Just In Time compiler you can calculate the size required and generate code (or compiler directives) when the host code is running and alter the kernel source code before it is compiled.

Calling the JIT compiler can be done in two ways. The simpler way is using the call:

void * handle = clopen(stdacc, strKernelSourceFile, CLLD_NOW);

This reads the source file named in argument 2 and compiles it straight away. The source file is just like any other.

You can also use:

void * handle = clsopen(stdacc, strKernelCode, CLLD_NOW);

In this call, argument 2 is a string containing the actual code. The advantage of this call is that there are no files to be read and your kernel code is compiled into your executable. It is still not exactly secure (anyone can use a text editor to open the executable file and look for readable text) but at least you don't need to deliver two files when you deploy your system. The downside is that the code needs new line characters to make it humanly readable. You can deliver your kernel in one enormous line and the compiler will not complain but that is not really practical for humans. To use a new line character in a string you need to use the \ character at the end of each line. I decided to use the first method just to avoid having to maintain the \ characters.

I think the solution is to develop using the first strategy and then write a quick pre-processing step that generates the string from the source code for Release version. This would either add the \ before the new line character or strip out the new line characters all together (along with the // comments that need a new line to terminate).

Example

In my feed forward example I use a dynamically create include file filled with #defined variables to contain all characteristics of the neural network that do not change. To generate them I open, write to and close a file called cldefs.inc.

At the top of the file:
#define PATHTOCLDEFSFILE "//home//linaro//Work//nnP//cldefs.inc"

Inside the class, I calculate the values I want to define (see the function setNetworkTopology) and then call:

void writeDefsFile()
{
fstream * pFile;
cl_int i;

if (checkExists(PATHTOCLDEFSFILE, true))
{
pFile = new fstream();
pFile->open(PATHTOCLDEFSFILE, ios::out);
(*pFile) << "#define CORECOUNT 16\n";
(*pFile) << "#define LAYERCOUNT " << layerCount << "\n#define OUTPUTLAYER " << (layerCount - 1) << "\n";
(*pFile) << "#define MAXWEIGHTTOLAYER " << maxWeightToLayer << "\n";
(*pFile) << "#define LARGESTDERIVEDLAYER " << largestDerivedLayer << "\n";
(*pFile) << "#define LARGESTINPUTLAYER " << largestInputLayer << "\n";
(*pFile) << "#define INITWIDTHARRAY {";
for (i=0; i<layerCount-1; i++)
(*pFile) << (*layers)[i].nodeCount << ",";
(*pFile) << (*layers)[i].nodeCount << "}\n";
pFile->close();
delete pFile;
}
else
throw format_Error(ENN_ERR_CL_DEFS_NOT_FOUND);
}

Which produces the file:

#define CORECOUNT 16

#define LAYERCOUNT 4

#define OUTPUTLAYER 3

#define MAXWEIGHTTOLAYER 493

#define LARGESTDERIVEDLAYER 29

#define LARGESTINPUTLAYER 29

#define INITWIDTHARRAY {8,17,29,12}

And finally, in the cl file the #include and an example to remind me:

#include "/home/linaro/Work/nnP/cldefs.inc"

/// cldefs.inc contains #defines for all static variables

/// example contents of cldefs.inc

///#define CORECOUNT 16

///#define LAYERCOUNT 4

///#define OUTPUTLAYER 3          

///#define MAXWEIGHTTOLAYER 1024

///#define LARGESTDERIVEDLAYER 32

///#define LARGESTINPUTLAYER 32   

///#define INITWIDTHARRAY {32,32,16,16}/

And later the declarations:

    /// local storage

    __private int   widths[] = INITWIDTHARRAY;

    __private float wgt[MAXWEIGHTTOLAYER];

    __private float biases[LARGESTDERIVEDLAYER];

    __private float in[LARGESTINPUTLAYER];

    __private float derived[LARGESTDERIVEDLAYER];

Works a treat!

Intercore Communication

If you've read my very first post on the design of the neural network simulator, you will have seen that each intermediate value calculated by one core needs to be distributed to all of the other cores for use in the next round. To do this I use this little gem which was brought to my attention by djm:

#define LOCAL_MEM_ADDRESS_BASE(gid) (((32 + ((gid) / 4)) << 26) | ((8 + ((gid) % 4)) << 20))

This calculates the memory base address of a core given the global id (int gid = get_global_id(0)). It is defined as a macro to avoid the overhead of a function call.

To use this you then need to provide the offset of the structure into which you want to write your value:

*(float *)(LOCAL_MEM_ADDRESS_BASE(gid_next) + ((unsigned int) derived) + (index * sizeof(float))) = value_being_passed;

The components of this snippet are:

LOCAL_MEM_ADDRESS_BASE calculates the beginning of the memory range
((unsigned int) derived) is the address (pointer to) the derived value array which is being cast to an unsigned int for the addition
(index * sizeof(float)) is the index of the array into which we want to write the value_being_passed. Don't forget, this is all byte arithmetic so you need to multiply the index by the size of the contain type to land up in the right place.

One BIG note here: this will only work if the code on the core that is being written to is the same as the core doing the writing. The location of the remote array is based on the address on the local core. If the core that is being written to has allocated the array in a different place you'll be writing somewhere you don't expect.

Example

Here is an example from my feed forward kernel:

gid_next = (gid == (CORECOUNT - 1)) ? 0 : gid + 1;

while (gid_next != gid)

{

for (n=localFirstNode; n < localLastNode; n++)

*(float *)(LOCAL_MEM_ADDRESS_BASE(gid_next) + ((unsigned int) derived) + (n*sizeof(float))) = derived[n];

gid_next = (gid_next == CORECOUNT - 1) ? 0 : gid_next + 1;

barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);

// debug - watch the derived values arriving at core 0 from the other nodes

// if (gid == 0)

// for (i=0; i<curLayerWidth; i++)

// debug[d++] = derived[i];

}

Each core iterates through the values it has derived and writes them into the derived array of each other core.

Note the barrier at the end of each iteration through the nodes. This is to keep all cores in sync while they go about messing with each other memory. This is handy during debugging but barriers should be used sparingly given they stall all cores until they all have reached the barrier. This ensures that the slowest core determines the execution time of all cores up to that point in time. In this case I don't believe that they are necessary because I don't need anything else to be ready before I do the next write. I'll check that out when I focus on performance rather than correctness.

I've left some debugging code in this snippet. I'll cover that below.

A Few Notes About Performance and Efficiency

While I have not spent too much time of performance or efficiency there were a few things that occurred to me while I was writing this kernel.

One thing that was obvious enough to be put in from the beginning. I pass all the weights needed to calculate the output for the whole network into one big array called weights. This array dominates the space requirement for the calculation. For this reason, I only copy in the weights needed to calculate the nodes assigned to each core into a local array called wgt. In this way not only is the calculate spread out over all cores but the space requirement is as well. (NB now that I come to write this I notice that I've actually allocated enough space on each core for all the weights between the biggest two layers - 16 times as much as they actually need. I'll fix that up in the next version).

There is also something that I can improve on the host side. The structure that I've presented above is that the host (ARM cores) provide the input data to the accelerator, fires off the kernel and then waits til it is done. With neural networks and many other applications, the kernel is called multiple times with new data. Therefore, it would be more efficient for the host to leave the accelerator running while it collects the next input set. Then, after the next input set is ready to be sent, would wait. Thus, the execution time would be the slower of the host or the accelerator rather than the addition of the two.

The inter-core communication method that I've described above also leave a lot to be desired. Each core writes to each other core's memory regardless of how far it is away in the mesh. By my calculation, if each core wrote one value to each other core, that would generate 640 inter-core write operations. It's hard to determine how long this would take given that they all happen in parallel and there would be clashes, but I'm sure there is a better way. Two possibilities come to mind:

Use the base address for the core to determine where it is in the mesh and then only write to it's nearest neighbour in a predefined topology, starting with the value the core had calculated itself and then passing on values that it had received from it's neighbours.
Use the Epiphany's broadcast mechanism.

I've got no idea how to do either of these just yet so they will have to be the subject of another post.

Debugging

I mentioned in my last post that I did not have any luck getting the Epiphany debugger going with OpenCL code. This is still the case.

Because I was mainly working with floats, I declared an array of floats on the host and made it 2048 items wide. I passed this to the kernel and gave myself an index int d = 0;. As the kernel executed I wrote values directly into the __global array, incrementing d on each assignment. I could write from one core as in the debug example above. Or, if I wanted for example 18 values from each core, I reinitialised d as 18 times the global id and each core wrote to different parts of the array. Then, on the host side, I pulled them back just like my output values and wrote them to standard output separated by commas. I then copied the output to a file and loaded it into a spreadsheet as a csv file. I had to then make sense out of it with a lot of copying and pasting but at least I could see what was going on. Primitive, but the best I came up with and it did get the job done. I've left a few examples in the source on github.

The process I used was:

Check the input using the debugger on the host side. If you don't start from a known position you'll have no hope of understanding what's going on in the kernel.
Check the input on the accelerator side. Copy the input from the input arrays directly to the debug array and make sure that the data is arriving as you expect
Follow the execution through writing the intermediate results to the debug array.
Keep the execution constrained using barriers so that you know where the cores are up to when you write to the debug array

A few tips that I picked up along the way:

If you get a Segmentation Fault there is a problem on the host side. It could show up during a clmsync but it is a host problem. I found that it was usually that I had not allocated enough space in an array.
If there is a problem on the accelerator side the kernel will not return. Kill the waiting host program, scratch your head for a while and try again.
Initialise your arrays so that you can tell the difference between a calculated value that is wrong and a memory location that has not been written to. When you are dealing with lots of floats, the rubbish that is in the array before it is used looks very much like wrong answers.

Up Next...

I hope this summary of what I've learnt to date on a "serious" working kernel is of assistance to you. If you want to know more about the neural network side I'll annotate the code on github which hopefully give you a better idea of how it works.

From here I'm going to move onto the simplest implementation of the back propagation algorithm that I can write. Given that the feed forward algorithm is the first part of the training sequence followed by back propagation I'll have to figure out how best to get these two to work together. Along with that I'll figure out how the performance benchmarking works so that as I work on increasing the performance I'll be able to tell if my efforts have paid off.

As always, please let me know if I've missed anything of if you have any questions...