Sunday, 6 December 2015

Getting It Done - What I learnt from finishing the Neural Network Algorithm

Sometimes, implementation causes rethinks. In this post I'll review the design decisions I've made getting the back propagation algorithm working, hopefully shedding some light on the practicalities of implementing an algorithm using the Parallella architecture that may be helpful to others.

Before reading this post, I highly recommend watching the presentation by Anish Varghese from the ANU. They did some tricky stuff in machine code to gain a performance boost that is not available in straight OpenCL but their comments on the relative speed of inter-core communication and ARM-Epiphany communication is relevant regardless of language. 

Writing this post I'm using Brown Deer OpenCL (version 1.6.2) and the Ubuntu 14.04 image from 30 Jan 2015.

To have a look at the code referred to in this post look here or execute the following shell command:

git clone -b blog12Files https://github.com/nickoppen/nnP.git


Introduction


Since the last posting, I have written the code that trains my neural network - a process called back propagation. The first thing I had to do was to restructure the feed forward kernel so that it did not discard the intermediate values that are needed in the backward pass. However, it was the management of the complexity that proved the biggest challenge. To manage the complexity I made two design decisions one of which worked out and one that didn't.

Restructure of Feed Forward


My primary goal was to get the algorithm working correctly with speed and efficiency in second place. I had already written the first part of the process (called the feed forward pass) and I'd made some design decisions that were suitable for that process in isolation but proved poor choices for back propagation. 

A neural network is made up of a number of layers (>=3). Each layer takes an input vector and produces an output vector which then become the input to the next layer. When only running the network forward the intermediate results can be discarded and initially that is what I did.

However, when I came to write the back propagation process, I needed these values. I had to change the feed forward algorithm to keep them. While I think I came up with a better solution in the end, the basic lesson, "start with the end in mind" aka "Top down design" still applies.


Complexity - Sometimes you can make it worse!


Back propagation is a tricky little algorithm. Adjusting the weights requires calculating the error in the output and for the intermediate layers that is a sum of the errors on the outgoing weights. In a single threaded implementation you just take one (intermediate) node at a time and iterate through the outgoing weights which are all available in shared memory, calculate the error and then iterate through the incoming weights and calculate the adjustment. While this is a little simplified, the key point is that all of the data is available at all times. 

In the Epiphany architecture each processor has its own memory and so how the task is split up and how data is shared between processors are critical design decisions and come with their own set of problems.

Decision 1 - What goes where


In my first blog post I discussed carving up the task. It was a very "thinking about" sort of post. It had to be, I hadn't gotten the Parallella yet. For the purposes of this post, I'll take a simpler approach. EVERY CORE GETS 1/16th OF THE JOB (or as close to it as possible).

Complexity 1: Indexes

That does not sound complicated but even in a simple, input -> process -> output where you are processing data in arrays you start to get a proliferation of indexes. For example:



You have one for the global input, one for the local copy of the chunk processed by the individual core from which you produce a chunk of output which you then have to copy back to global memory before exiting. If the dimensions of the input and output are different this means that you are juggling four indexes.

A neural network is a little more complicated still. They are multi-dimensional with the size of each dimension (i.e. each layer) changing as you pass through the data structure.

Complexity Reducing Idea No. 1: Keep all the arrays the same size. Therefore the data structure private to the core is the same size as the global arrays. You have to make is as big as the biggest layer and then recalculate where each core starts and stops as you pass through the data. The data flow would look like this (highly simplified):



Clearly there is wasted space in the memory local to all cores (where space is at a premium) but traversing all arrays needs only one index.

Initially I had the global input and the local copy of the input aligned and the local copy of the node biases aligned and the intermediate values and the output aligned. That worked while I was only running the network forward. Then I got to do the back propagation and remembered that I needed to keep the intermediate values generated by the hidden layers. Then I decided that it would be a good idea to keep a copy of the node biases (and weights) so that they could be updated locally and the training could be run multiple times. Then I decided that it would be more efficient for the input to be copied onto the front of the array containing the output for each layer then it could be treated in the same way as the output from layer zero.

Idea No.1 turned out to be not very bright at all. Now I have virtually no alignment and a lot more wasted space that I bargained for - none of the benefit and a much higher cost. The other thing about back propagation is that you are referring to values from the current layer, the next layer and the previous layer in order to adjust the weights so you end up with lots of indexes anyway. This mess is still a part of the files for this post. I'll clean it up for the next version.

The only solution as I see it is to minimise the use of private memory and use descriptive names for the indexes. Coming up with good names is tedious and difficult but in a complex program it will make a huge difference. You will also end up with a lot of "parallel" indexes in for loops, thus: 

local_index = global_start - core_offset;   // set the relative start positions
for (global_array_iterator = global_start; global_array_iterator < global_end; global_array_iterator++)
{
     local_array[local_index] = genius(global_array[global_array_iterator]);
     local_index++;    // manually advance one iterator
}

Also, using the overall strategy of copy-in, process, copy-out will mean that mixing the local (_private) memory indexes and the global memory indexes are not mixed too much.

I store all of the indexes in an array of structures so that I don't have to work it out each time. This meant that there was an array of indexes that needed its own index but that came for free because it was the same as the layer that I was working on (i.e. the outermost for loop index).

Decision 2: Passing Intermediate Values


So, each core is processing 1/16th of the data. In my neural network simulator I decided that the variable length data is passed in via a couple of arrays and that the space for them is compiled into the kernel using JIT compilation as I described in a previous post. The method I used to decide which core processed which section of the data was as follows:

#define CORECOUNT 16

int gid = get_global_id(0);
int allCores = dataWidth / CORECOUNT;  /// all cores get this many
int leftOver = dataWidth % CORECOUNT;  /// the remainder are distributed one per core starting from core0 until all are allocated

int start = ((gid * allCores) + min(gid, leftOver)); 
int end = start + allCores + ((gid < leftOver) ? 1 : 0);

This works well for each core to determine where to start processing in the global data structure. What it does not tell you is which core is processing an arbitrary data point. 

If you have read my hand-waving description of the back propagation process you will recall that the "back" part refers to the backwards flow of data to previous layers in order to determine the error of those layers (let's call this data "weight deltas"). 

I've arranged my weights to be contiguous for the node that receives their contribution in the forward pass. This allows the forward pass to calculate the node value by traversing a contiguous sequence of weight values. In the backwards pass the weight delta of each weight has to be returned to its "origin node". To pass it directly back to the core that "owns" that node I needed a similar simple formula.

Complexity 2: Inter-core communication

With processing distributed over a number of cores, figuring out what core is responsible for which datum can be difficult.

Complexity Reducing Idea No. 2: Use global memory as a scratch pad to communicate between the cores.

The good thing about global memory is that it is big (32Mb). The bad thing is that it is slow. The ugly thing is that all cores have equal access to all of it and therefore you'd better be careful about race conditions.

To return the weight delta to it's "origin node", I calculate it when updating "destination node" and write it to the global array called g_weightDeltas. Then, when I come to calculate the error of the "origin node" I read it from g_weightDeltas adding them as I go. This is not quite as simple as it sounds. When calculating the error of each "origin node" you need a weight delta from all nodes in the subsequent layer. I decided to organise g_weightDeltas in the same way is the weights themselves for consistency. This means that writing them happens in the same order as calculating them but reading them requires some index gymnastics to pick the right value from the array, thus:

outputError += g_weightDeltas[nextLayer_globalWgtZero + (w * curLayerWidth) + layerNodeIterator];

where:
outputError is the accumulated error for the node
nextLayer_globalWgtZero is the index of first weight connecting the current layer with the next layer (don't forget that I'm organising the weight deltas in the same way as the weights)
curLayerWidth is the number of nodes in the current layer (therefore the number of weights for each node in the next layer)
w is the current weight
layerNodeIterator is the current node

Idea No. 2 did work out. For a little index gymnastics I saved a bit of tricky programming (aka laborious debugging) and I now have a way of checking the inter-core communication when I come to do it (leaving the global array there for now I can check the directly passed values to the global array values). The global array is easy to debug because I can call clmsync on return and check it's values. 

For this stage of the project, I'm not really concerned about speed but I will have to address this in the next version. I avoid race conditions in a macro way by ensuring that every layer has updated itself and the global array before the global array is read by using a barrier command. 

There is one little issue to remember when using global memory for internal purposes. It needs to be declared and allocated on the host. You don't need to do anything with it on the host but it cannot be allocated in a kernel.


Looking Forward


Having (yet again) learnt to design before coding, I've had a think about how I want the overall system to look. 

In my value passing experiment, I noticed that the overhead of launching a kernel is huge. Therefore the best way of getting the most performance out of the Epiphany is to launch the kernel once and then keep feeding it with data until there is not more data left.

This suits back propagation well because training requires multiple iterations through the training set which could be too big for global memory. Also, loading it into global memory before launching the kernel would introduce a serial component to the process that could be done by the host in parallel with the training that is happening on the Epiphany (and that's what we are all about after all).

Multiple executions of the feed forward algorithm also makes sense in certain applications. If you are using the neural net to recognise features in a video stream you will probably want to pass frames to the network as they arrive. Waiting for the kernel to launch for each frame would probably cause frame to be dropped while the host waits.

So, my target system looks like this:





It looks a bit complicated but the key thing to get is that the host is reading the next lot of input data into global memory while the Epiphany is processing. It then uses some sort of signalling mechanism to indicate that the data is ready. When the Epiphany is done, it signals back to the result reader thread to pull back the results from global memory.

To simplify the diagram, I've assumed that the main host thread is always going to be slower than the Epiphany and therefore never has to wait. While this might be the case in some applications, it may also have to wait to ensure that it does not get too far ahead and starts over-writing unprocessed data.


Up Next...


The next job is to figure out how to do the signalling. I'm going to have a look at the MPI libraries now included in the Brown Deer system. I think that they will do the job. 

After that, I'll tighten up local memory and figure out how to pass the weight deltas back to their origin node in an efficient manner.

As always, please comment, especially if I've missed something or if you have a suggestion. You can comment either here on Blogger, on the Parallella forum or on reddit.