## Intro

Over a couple of blogs I'll break down the process of getting the best performance out of a parallel architecture for processing neural networks.A few points before I begin:

First let me say that I am only going to look at Feed Forward - Back Propagation networks in the first instance. I might get into other types of neural networks with other training methods at a later time but for now I'm keeping things simple. This is also not a rigorous academic work. I'm only going to cover neural network theory in as much detail as I need in order to illustrate my program design.

Second, I must point out that this is an experiment. The method I develop here might not be the most efficient and if you believe you have a better way, please let me know. If I find a better way I'll update the post.

Third, I'm focussing on the Adapteva Parallella architecture. Whatever I write might work on other platforms but no promises from me. If you think it will work elsewhere please give it a try and let me know.

## Carving Up the Task

To get the best performance it is important to carve up the task so that the available hardware is as busy is it can be. Let's begin by looking at the task.#### Feed Forward - Back Propagation Networks

This is the basic shape of a feed forward - back propagation network:

Diagram 1 - Basic Structure

An input vector |x| (x(1) to x(n)) are fed into the network and produce an output vector |z| (z(1) to z(o)) via an intermediate or "hidden" vector |y| (y(1) to y(m)). What happens to transform inputs to outputs will be explained in the next paragraph. In the diagram, each circle is referred to as a Node and the nodes for each vector |x|, |y| and |z| are collectively known as a layer. Each node in the input layer is connected by a link (a.k.a. an arc) to every node in the hidden layer and similarly, each hidden node is linked to every output node.

The transition between input nodes and hidden nodes is shown in the following diagram:

Diagram 2 - Calculation of a single node (y(p))

This example a single node in the hidden layer is calculated (y(p)). To calculate y(p), each input value is multiplied by a weight (w(1,p) to w(n,p) in this diagram). The result is summed, added to bias value b(p) and passed through the Activation Function. The activation function is usually a non-linear function such as sigmoid or tanh.

This is the Feed Forward component of the behaviour. Back Propagation refers to the process by which weights and threshold values are adjusted in order for the network to "recognise" an input pattern and produce the output pattern. To use back propagation you need a large number of matched input and output vectors. The input vectors are presented to the network the resultant outputs produced by the feed forward pass are compared to the desired output in the matched set. The error for each output node is calculated and the weights between the hidden nodes and output nodes as well as the threshold value are then adjusted to reduce the magnitude of the error. Once the weights of the hidden to output layer links have been adjusted, the weights of the input to hidden layers are adjusted along with the threshold values on the hidden nodes. Thus the error is propagated backwards through the network. This is termed "training the network".

In the feed forward step weight multiplication must happen for each link between every input value and each hidden node and every hidden node and each output node. This is the bulk of the computation that must be optimised. The summation and Activation Function also represent significant work.

In the back propagation step each weight and threshold value must adjusted in proportion to it's contribution to the error. The adjustment is also "dampened" by a factor known as the learning rate (0..1) thus the weight adjustment is the multiplication of three values. Therefore, training (a feed forward pass followed by a back propagation pass) represents significantly more work than a feed forward pass alone.

#### Organising the Feed Forward Pass

Splitting apart Diagram 1 the complete task now looks like this:

Diagram 3 - Diagram 1 rearranged

I have rearranged Diagram 1 to highlight the regularity of the computation. To calculate a derived value (i.e. hidden nodes and output nodes) the same same pattern must be repeated. To again redraw the task in a more programatic way:

Diagram 4 - Node values and weights as arrays

Diagram 4 now has all the computation steps to calculate a derived value (hidden nodes in this case). The input vector and weight vectors are shown as array elements that need to be multiplied together to form the input to the summation and subsequently the activation function.

This is now the sort of problem that we can break down into work items for Parallella cores.

## Let's Get Practical

#### One Node per Core

Probably the most obvious way to split up the task is to assign one node to each core. This has the advantage of being simple to understand and therefore program. The big disadvantage is that it is unlikely that optimal performance will be achieved because it is unlikely that the number of nodes is evenly divisible by the number of cores.

#### Round Robin

It would also be possible to assign input * weight calculations from each link to each core in a Round Robin manner. This would ensure that all cores got the same amount of work but collecting all of the weighted inputs together for the summation step would be a communication headache.

#### Partitions

The best way I have come up with to split up the task is to partitions based on the number of cores. Assume that there are six input nodes and 16 hidden nodes. To split the task evenly between cores, core J, somewhere in the middle of the array would get the following task:

In Example 2 above there is only one job to do so starting at the lowest index and working through to the highest would be the most obvious option.

Example 1 however would be best done in reverse, i.e. the partial sum that is passed to the next core should be calculated first and passed so that it is ready when that core needs it. In this case the execution flow would be as follows:

0. Pass the input values and partitioned weight vectors (w(x,y) and w(y,z) to each core

1. Execute the input * weight calculations and any partial sum that must be passed to the next core

2. Pass the partial sum

3. Execute the input * weight / summation / activation function for the nodes that are wholly the task of the core

5. Pass the derived node value to the neighbouring cores to update their local memory in preparation for the next layer

6. Execute the input * weight calculation for any node where a partial sum is expected from an upstream core

7. Wait for the partial sum (it should already be there from step 2)

8. Add the incoming partial sum to the sum of the locally calculated values

9. Calculate the activation function for the node

10. Pass the derived node value to the neighbouring cores

11. Repeat steps 1-9 for the second layer ignoring step 5 (no need to update neighbouring cores)

12. Pass the final output values back to the host.

Diagram 5 - Core Assignment Example 1

Assigning the weight calculations evenly means that each core gets 10 to do. With 6 input values that means that there will be an overlap of 4 on each core.

If a core completes a whole set of link multiplications (for yq in this example) then it has all the information it needs to do the summation and activation function for that core.

If it starts but does not finish the job then it passes the partial sum onto the next core (core K in this example) which then finishes the sum and applies the activation function (for yr).

If it completes the job but does not start it then it must wait for the previous core to pass the partial sum before completing the sum and applying the activation function (for yp in this example).

The only other case to consider is when a core does not get to start or complete a whole derived node's worth of input * weight calculations.

Diagram 6 - Core Assignment Example 2

In this case, the core must wait for the incoming partial sum before passing it on to the next core. This would be a common occurrence on the Epiphany-64.

### Moving from the Hidden Layer to the Output Layer

The partitioning strategy assumes that all inputs to the calculation (x1..6 in the example) are know at the time the process starts. This can be set up to be that way initially when calculating the hidden nodes. All inputs and weights needed by each core can be sent to it during the set up phase.

At the end of the calculation of the hidden node values however, each core only has those hidden values that it has calculated itself. Therefore an efficient method must be devised to transmit all calculated values to all cores.

The most obvious method would be to write all hidden node values back to global memory and then each core can read the ones that it does not have (or read all of them if it is quicker).

My approach would be to try something different. Why not use the on-chip write network to transmit the hidden node values to the left and right as they are calculated. In this way, the transmission can be started before all of the values are known and each core can start the second phase calculation when it has a full set of values.

It might be best to initially pass the hidden node values in one direction (e.g. to a core with a higher index see next section) and then when the core has received all of its upstream values then to pass in the other direction. This would mean that the on-chip write network would be writing in one direction at a time and therefore hopefully would avoid collisions.

It might be best to initially pass the hidden node values in one direction (e.g. to a core with a higher index see next section) and then when the core has received all of its upstream values then to pass in the other direction. This would mean that the on-chip write network would be writing in one direction at a time and therefore hopefully would avoid collisions.

I'm not sure which strategy will work best. The on-chip write network would be pretty busy for a period. However, if each core can determine when it has a full set of values this strategy prevents the need for coordination make sure that all values have been posted to global memory before the cores are updated. Also, the off-chip write network is 8 times slower and would also get pretty congested when all the cores were updating.

### Epiphany Internal Network Configuration

The partitioning strategy as outlined above has a linear relationship between the cores that reflects the linear organisation of the data. I propose that the most efficient flow of data between cores is as shown:

Diagram 7 - Data Pathway on a Epiphany-16

## Execution Flow

I think that it is worth having a think about how the execution will happen and how the inter-core communication will fit into it.In Example 2 above there is only one job to do so starting at the lowest index and working through to the highest would be the most obvious option.

Example 1 however would be best done in reverse, i.e. the partial sum that is passed to the next core should be calculated first and passed so that it is ready when that core needs it. In this case the execution flow would be as follows:

0. Pass the input values and partitioned weight vectors (w(x,y) and w(y,z) to each core

1. Execute the input * weight calculations and any partial sum that must be passed to the next core

2. Pass the partial sum

3. Execute the input * weight / summation / activation function for the nodes that are wholly the task of the core

5. Pass the derived node value to the neighbouring cores to update their local memory in preparation for the next layer

6. Execute the input * weight calculation for any node where a partial sum is expected from an upstream core

7. Wait for the partial sum (it should already be there from step 2)

8. Add the incoming partial sum to the sum of the locally calculated values

9. Calculate the activation function for the node

10. Pass the derived node value to the neighbouring cores

11. Repeat steps 1-9 for the second layer ignoring step 5 (no need to update neighbouring cores)

12. Pass the final output values back to the host.

## Next Up... Training

Now that I've covered the feed forward process, I'll spend some time thinking about training. Again, it is the transition from the output layer back to the hidden layer that seems to be the tricky bit.

Please post any comments if you have spotted any problems with my process or if you have any other ideas.