Parallella Ideas, Designs and Executions: feed forward

Showing posts with label feed forward. Show all posts

Sunday, 6 December 2015

Getting It Done - What I learnt from finishing the Neural Network Algorithm

Sometimes, implementation causes rethinks. In this post I'll review the design decisions I've made getting the back propagation algorithm working, hopefully shedding some light on the practicalities of implementing an algorithm using the Parallella architecture that may be helpful to others.

Before reading this post, I highly recommend watching the presentation by Anish Varghese from the ANU. They did some tricky stuff in machine code to gain a performance boost that is not available in straight OpenCL but their comments on the relative speed of inter-core communication and ARM-Epiphany communication is relevant regardless of language.

Writing this post I'm using Brown Deer OpenCL (version 1.6.2) and the Ubuntu 14.04 image from 30 Jan 2015.

To have a look at the code referred to in this post look here or execute the following shell command:

git clone -b blog12Files https://github.com/nickoppen/nnP.git

Introduction

Since the last posting, I have written the code that trains my neural network - a process called back propagation. The first thing I had to do was to restructure the feed forward kernel so that it did not discard the intermediate values that are needed in the backward pass. However, it was the management of the complexity that proved the biggest challenge. To manage the complexity I made two design decisions one of which worked out and one that didn't.

Restructure of Feed Forward

My primary goal was to get the algorithm working correctly with speed and efficiency in second place. I had already written the first part of the process (called the feed forward pass) and I'd made some design decisions that were suitable for that process in isolation but proved poor choices for back propagation.

A neural network is made up of a number of layers (>=3). Each layer takes an input vector and produces an output vector which then become the input to the next layer. When only running the network forward the intermediate results can be discarded and initially that is what I did.

However, when I came to write the back propagation process, I needed these values. I had to change the feed forward algorithm to keep them. While I think I came up with a better solution in the end, the basic lesson, "start with the end in mind" aka "Top down design" still applies.

Complexity - Sometimes you can make it worse!

Back propagation is a tricky little algorithm. Adjusting the weights requires calculating the error in the output and for the intermediate layers that is a sum of the errors on the outgoing weights. In a single threaded implementation you just take one (intermediate) node at a time and iterate through the outgoing weights which are all available in shared memory, calculate the error and then iterate through the incoming weights and calculate the adjustment. While this is a little simplified, the key point is that all of the data is available at all times.

In the Epiphany architecture each processor has its own memory and so how the task is split up and how data is shared between processors are critical design decisions and come with their own set of problems.

Decision 1 - What goes where

In my first blog post I discussed carving up the task. It was a very "thinking about" sort of post. It had to be, I hadn't gotten the Parallella yet. For the purposes of this post, I'll take a simpler approach. EVERY CORE GETS 1/16th OF THE JOB (or as close to it as possible).

Complexity 1: Indexes

That does not sound complicated but even in a simple, input -> process -> output where you are processing data in arrays you start to get a proliferation of indexes. For example:

You have one for the global input, one for the local copy of the chunk processed by the individual core from which you produce a chunk of output which you then have to copy back to global memory before exiting. If the dimensions of the input and output are different this means that you are juggling four indexes.

A neural network is a little more complicated still. They are multi-dimensional with the size of each dimension (i.e. each layer) changing as you pass through the data structure.

Complexity Reducing Idea No. 1: Keep all the arrays the same size. Therefore the data structure private to the core is the same size as the global arrays. You have to make is as big as the biggest layer and then recalculate where each core starts and stops as you pass through the data. The data flow would look like this (highly simplified):

Clearly there is wasted space in the memory local to all cores (where space is at a premium) but traversing all arrays needs only one index.

Initially I had the global input and the local copy of the input aligned and the local copy of the node biases aligned and the intermediate values and the output aligned. That worked while I was only running the network forward. Then I got to do the back propagation and remembered that I needed to keep the intermediate values generated by the hidden layers. Then I decided that it would be a good idea to keep a copy of the node biases (and weights) so that they could be updated locally and the training could be run multiple times. Then I decided that it would be more efficient for the input to be copied onto the front of the array containing the output for each layer then it could be treated in the same way as the output from layer zero.

Idea No.1 turned out to be not very bright at all. Now I have virtually no alignment and a lot more wasted space that I bargained for - none of the benefit and a much higher cost. The other thing about back propagation is that you are referring to values from the current layer, the next layer and the previous layer in order to adjust the weights so you end up with lots of indexes anyway. This mess is still a part of the files for this post. I'll clean it up for the next version.

The only solution as I see it is to minimise the use of private memory and use descriptive names for the indexes. Coming up with good names is tedious and difficult but in a complex program it will make a huge difference. You will also end up with a lot of "parallel" indexes in for loops, thus:

local_index = global_start - core_offset; // set the relative start positions

for (global_array_iterator = global_start; global_array_iterator < global_end; global_array_iterator++)

{

local_array[local_index] = genius(global_array[global_array_iterator]);

local_index++; // manually advance one iterator

}

Also, using the overall strategy of copy-in, process, copy-out will mean that mixing the local (_private) memory indexes and the global memory indexes are not mixed too much.

I store all of the indexes in an array of structures so that I don't have to work it out each time. This meant that there was an array of indexes that needed its own index but that came for free because it was the same as the layer that I was working on (i.e. the outermost for loop index).

Decision 2: Passing Intermediate Values

So, each core is processing 1/16th of the data. In my neural network simulator I decided that the variable length data is passed in via a couple of arrays and that the space for them is compiled into the kernel using JIT compilation as I described in a previous post. The method I used to decide which core processed which section of the data was as follows:

#define CORECOUNT 16

int gid = get_global_id(0);

int allCores = dataWidth / CORECOUNT; /// all cores get this many

int leftOver = dataWidth % CORECOUNT; /// the remainder are distributed one per core starting from core0 until all are allocated

int start = ((gid * allCores) + min(gid, leftOver));

int end = start + allCores + ((gid < leftOver) ? 1 : 0);

This works well for each core to determine where to start processing in the global data structure. What it does not tell you is which core is processing an arbitrary data point.

If you have read my hand-waving description of the back propagation process you will recall that the "back" part refers to the backwards flow of data to previous layers in order to determine the error of those layers (let's call this data "weight deltas").

I've arranged my weights to be contiguous for the node that receives their contribution in the forward pass. This allows the forward pass to calculate the node value by traversing a contiguous sequence of weight values. In the backwards pass the weight delta of each weight has to be returned to its "origin node". To pass it directly back to the core that "owns" that node I needed a similar simple formula.

Complexity 2: Inter-core communication

With processing distributed over a number of cores, figuring out what core is responsible for which datum can be difficult.

Complexity Reducing Idea No. 2: Use global memory as a scratch pad to communicate between the cores.

The good thing about global memory is that it is big (32Mb). The bad thing is that it is slow. The ugly thing is that all cores have equal access to all of it and therefore you'd better be careful about race conditions.

To return the weight delta to it's "origin node", I calculate it when updating "destination node" and write it to the global array called g_weightDeltas. Then, when I come to calculate the error of the "origin node" I read it from g_weightDeltas adding them as I go. This is not quite as simple as it sounds. When calculating the error of each "origin node" you need a weight delta from all nodes in the subsequent layer. I decided to organise g_weightDeltas in the same way is the weights themselves for consistency. This means that writing them happens in the same order as calculating them but reading them requires some index gymnastics to pick the right value from the array, thus:

outputError += g_weightDeltas[nextLayer_globalWgtZero + (w * curLayerWidth) + layerNodeIterator];

where:
outputError is the accumulated error for the node
nextLayer_globalWgtZero is the index of first weight connecting the current layer with the next layer (don't forget that I'm organising the weight deltas in the same way as the weights)
curLayerWidth is the number of nodes in the current layer (therefore the number of weights for each node in the next layer)
w is the current weight
layerNodeIterator is the current node

Idea No. 2 did work out. For a little index gymnastics I saved a bit of tricky programming (aka laborious debugging) and I now have a way of checking the inter-core communication when I come to do it (leaving the global array there for now I can check the directly passed values to the global array values). The global array is easy to debug because I can call clmsync on return and check it's values.

For this stage of the project, I'm not really concerned about speed but I will have to address this in the next version. I avoid race conditions in a macro way by ensuring that every layer has updated itself and the global array before the global array is read by using a barrier command.

There is one little issue to remember when using global memory for internal purposes. It needs to be declared and allocated on the host. You don't need to do anything with it on the host but it cannot be allocated in a kernel.

Looking Forward

Having (yet again) learnt to design before coding, I've had a think about how I want the overall system to look.

In my value passing experiment, I noticed that the overhead of launching a kernel is huge. Therefore the best way of getting the most performance out of the Epiphany is to launch the kernel once and then keep feeding it with data until there is not more data left.

This suits back propagation well because training requires multiple iterations through the training set which could be too big for global memory. Also, loading it into global memory before launching the kernel would introduce a serial component to the process that could be done by the host in parallel with the training that is happening on the Epiphany (and that's what we are all about after all).

Multiple executions of the feed forward algorithm also makes sense in certain applications. If you are using the neural net to recognise features in a video stream you will probably want to pass frames to the network as they arrive. Waiting for the kernel to launch for each frame would probably cause frame to be dropped while the host waits.

So, my target system looks like this:

It looks a bit complicated but the key thing to get is that the host is reading the next lot of input data into global memory while the Epiphany is processing. It then uses some sort of signalling mechanism to indicate that the data is ready. When the Epiphany is done, it signals back to the result reader thread to pull back the results from global memory.

To simplify the diagram, I've assumed that the main host thread is always going to be slower than the Epiphany and therefore never has to wait. While this might be the case in some applications, it may also have to wait to ensure that it does not get too far ahead and starts over-writing unprocessed data.

Up Next...

The next job is to figure out how to do the signalling. I'm going to have a look at the MPI libraries now included in the Brown Deer system. I think that they will do the job.

After that, I'll tighten up local memory and figure out how to pass the weight deltas back to their origin node in an efficient manner.

As always, please comment, especially if I've missed something or if you have a suggestion. You can comment either here on Blogger, on the Parallella forum or on reddit.

Tuesday, 1 October 2013

Training in Parallel

Training - the hard bit

Feeding forward is only part of the story. Any useful, real-world application with a significant number of inputs will need to be trained in an automatic fashion using a significant number of example data sets. Reed and Marks[1], the text that I mainly use, quotes Hinton[2] who says that a typical network with w weights requires requires O(w) training patterns and O(w) weight updates for good generalisation. On a serial machine this would require O(w^3) that would only be reduce by a factor of w on parallel hardware giving a training time of O(w^2). Clearly, an efficient implementation of the training algorithm is required.

How training works

Typically, before training commences, the weights on the links and bias values on the nodes are set to random values. When you feed your input pattern into the untrained network and calculated the output pattern, it will not bear any resemblance to the desired output. The weights and biases will have to be adjusted so that the calculated output matches the desired output (within predefined limits). The adjustment of each weight and bias is proportional to the amount of error it contributed to the final result.

For an output node the calculation is straight forward. The error is the desired value minus the calculated value. This error is then multiplied by the first derivative of the activation function giving the delta for the bias and incoming links. This delta is is used to "nudge" the incoming weights and output node bias in the direction that would result in a calculated output closer to the desired pattern.

The magnitude of the nudge is determined by a value called the "learning rate". This is a value between 0 and 1 which is a multiplicand when updating the weights and bias values.

The error for the hidden layer is a little more difficult. Each weight between the hidden and output layer has contributed a little to the error of the output nodes. The error of a hidden node is the sum of these contributions (specifically, the weight times the output node delta). Again a delta for the hidden node is calculated summing the error on each weight and the first derivative of the activation function of the node.

Once the error on each hidden and output node has been calculated the incoming weights and node biases must be updated. Reed and Marks describe two variations called Batch Update and Online Update.

Training Variation 1 - Batch Update

In Batch Update the whole training set is passed through the network. The deltas for every node for every training set are calculated. The "total error" for the training set is the average of each node's deltas. Once this has happened the weights and biases are updated once.

Training Variation 2 - Online Update

In Online Update, each training pattern, (i.e. one set of inputs and the matching output) is run through the network, the deltas are calculated and the weights are updated straight away.

Practical Considerations

When to Stop

The idea with training is that the network is updated so that the discrepancy between the generated output and desired output is small while maintaining the generality of the solution. This is a balancing act. Clearly you want the network to produce an output that is a recognisable pattern (e.g. if an output is ON then it is greater that 0.8 and if it is OFF then it is less that -0.8). However, if you train the network too much it will eventually get to the stage that in only recognises the training set that you used.

Recognising the situation where you have achieved the desired output levels can be done using a technique such as the Sum of the Squared Error (SSE).

Over training is not so easy. In practice, you keep a known set of inputs with their corresponding outputs aside as a test set. When you think that the network is getting close to a general solution you would run the test set (without doing any updates) and see if its output is as good a result as the training set. If it does then great! Press Save. If the test set results are significantly worse than the training test then you might have gone too far.

Once you have a well generalised network you may want continue training to see if you can improve the result but if the test set results start to diverge then you should back up to your last known general solution.

For the purposes of this design we need to keep in mind the calculation of the SSE and the ability to run a test set and not update afterwards. The maintenance and use of the test set will be left to the host application or as a manual process.

How do you know if you have the best network?

You can think of the weights in a neural network like a large multi-dimensional surface. A good solution represents a low point on this surface where all of the important weights are at a minimum.

Diagram 8 - 2D Representation of a solution space

It is possible that your network to get stuck in a local minimum that does not represent a good solution. In this case no amount of training will make it better. There also may be a number of good solutions, one of which is the best.

The only way of finding the best solution is to train the network many times from different starting points. The starting point in training is the set of random number that represent the initial weights. Therefore our system must have the ability to "unlearn" everything (hopefully after the user has pressed Save) and start again using a new set of random numbers.

Keeping the Cores Busy

To get the best performance out of the available hardware we should also consider how best to use all the features of the epiphany architecture.

Clearly the feed forward pass and error calculation (and weight update in online mode) are going to keep the core busy for a significant time and I presume that this task will be the processing bottleneck. Therefore keeping the cores busy, reducing the waiting time will be the key to optimum performance.

The off-chip network is connected to local memory via the DMA controller. To keep the cores "fed" with data, we should try to arrange the host process to send the next training batch to the core's local memory while it is still working on the current one. This should allow the next batch to commence as soon as it has finished.

Where we left off...

At the end of the feed forward pass (assuming that the host has been diligent and passed the target output values while the cores were busy) our local memory would look like this:

Diagram 9 - Local memory at the end of the Feed Forward Pass

In this diagram:

red indicates values that have been passed to the core by the host (t(u) and t(v) are the target values for z(u) and z(v))
blue indicates values that have been calculated by a core (z(u), z(v), y(p) and y(q) have been calculated on Core J while y(1).. y(p-1) have been passed from upstream cores and y(q+1) .. y(N) from downstream cores)
purple indicates values that could either be calculated or passed (i.e. the weights)

Training Stage 1: Calculate the Output Error and Delta

Calculating the error and associated delta for an output node is trivial. The host can determine which core will calculate each final output value and send the target values to it.

Training Stage 2: Calculate the Hidden Error and Delta

The hidden node error is a little more difficult. The problem is that in my current model, the outbound weights from each hidden node are distributed across all of the cores. A few possible solutions come to mind:

1. Swap space for speed

In the example, Core J can only calculate the "hidden" error that is associated with Output(zu) and Output(zv) because it only has the links between Hidden(yp), Hidden(yq) and Output(zu) and Output(zv). It actually wants to calculate the whole error attributed to Hidden(yp) and Hidden(uq). To do this it would have to have a copy of all the weights between it's hidden nodes (yp and yq) and all the output node deltas.

This is possible if each core had a copy of its own outbound weights and we could distribute the output deltas by using the same mechanism we used with the hidden layer values.

Diagram 10 - Space for Speed compromise

Clearly this strategy requires each core to have two sets of Hidden-Output links, the inbound links for its output nodes and the outbound links for it's hidden nodes. When training in batch mode the weights don't change from one training set to the next so the two sets of weights start synchronised and remain so until the update pass.

The additional work to constantly update two sets of weights required for on-line mode suggests that this strategy would only be contemplated for batch mode.

2. Calculate and distribute

A less memory intensive method would be to calculate the weight * delta for all weights and deltas available to the core and to pass the those to the core that needs them.

This would mean that data flowing around would use the fast the on-chip write network to its fullest extent. The value calculated by each core would only have to be sent to the core that needs it therefore the path would be determined by the epiphany's "x then y" rule. The largest number of hops would be 6 (on an epiphany-16) which would be for example between Core 1 and Core 13 at opposite corners as described in Diagram 7.

Once the hidden deltas have been calculated by the core that owns the hidden node it is at least in the right place. That core can either accumulate it for a later batch update or it can be used to update the node's inbound weights straight away in online mode.

3. Let the ARM cores look after it

Clearly neither of these a great solutions. There is another possibility however. The host CPUs (i.e. the ARM cores) also have a full set of data. Up until now we have only required them to keep the sending the data to the cores and not do any computation. There are two of them and both have considerable resources available in terms of memory and number crunching power.

If the output value for each output core or the output delta is passed back to the host, then it could work out remaining deltas while it is waiting for the next training pattern to be processed. The decision what to pass back to the ARM core would be based on how long the host takes to do its part of the job. The host's main task is still to keep the work up to the epiphany.

Again, batch mode would be fine with this. The ARM cores would accumulate the deltas and when the training set was done, send the deltas to each core which could then update the weights. This would introduce a short pause for the epiphany cores while the final output and hidden deltas and total batch errors are calculated and the sent to each core for the update.

Online mode... again... would have a problem. If the weights are to be updated every training example then the epiphany cores would be waiting around for the ARM cores to calculate and send the updates. This does not seem to be a good solution.

Training Stage 3: Update Weights and Biases

Once the output and hidden node errors have been calculated then each bias and weight needs to be nudged towards a better solution. Given that each core has a copy of each incoming weight and (after we figure out the best way of determining the hidden layer error) each will have the error of its own nodes then the update of the weights is straight forward. Each weight is assigned weight (i.e. itself) * delta * learningRate.

In online mode this would happen straight away and after the update the delta could be discarded.

In batch mode the error would have to be accumulated somewhere, either by each core or by the host CPU and when the training set was complete the final "total batch error" could be calculated. The accumulated errors would then be used to update the weights.

Up Next...

While we wait for our Parallellas to arrive I though I'd pull apart my Windows version and get a half decent interface together. I'll start on some documentation for that but it won't mean much until the guts have been written.

Also, I'll look around for some decent well known test data. I want to be about to get a couple of big problems to run through it and test out the scalability of the solution. If anyone knows of some good public test data please send me a link.

[1] Reed, R.D. and Marks, R. J. "Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks", MIT Press 1999.

[2] Hinton, G.E. Connectionist learning procedures. Artificial Intelligence 40(1): 143-150. 1989.

Monday, 2 September 2013

Parallel Neural Networks - Carving up the task

Intro

Over a couple of blogs I'll break down the process of getting the best performance out of a parallel architecture for processing neural networks.

A few points before I begin:

First let me say that I am only going to look at Feed Forward - Back Propagation networks in the first instance. I might get into other types of neural networks with other training methods at a later time but for now I'm keeping things simple. This is also not a rigorous academic work. I'm only going to cover neural network theory in as much detail as I need in order to illustrate my program design.

Second, I must point out that this is an experiment. The method I develop here might not be the most efficient and if you believe you have a better way, please let me know. If I find a better way I'll update the post.

Third, I'm focussing on the Adapteva Parallella architecture. Whatever I write might work on other platforms but no promises from me. If you think it will work elsewhere please give it a try and let me know.

Carving Up the Task

To get the best performance it is important to carve up the task so that the available hardware is as busy is it can be. Let's begin by looking at the task.

Feed Forward - Back Propagation Networks

This is the basic shape of a feed forward - back propagation network:

Diagram 1 - Basic Structure

An input vector |x| (x(1) to x(n)) are fed into the network and produce an output vector |z| (z(1) to z(o)) via an intermediate or "hidden" vector |y| (y(1) to y(m)). What happens to transform inputs to outputs will be explained in the next paragraph. In the diagram, each circle is referred to as a Node and the nodes for each vector |x|, |y| and |z| are collectively known as a layer. Each node in the input layer is connected by a link (a.k.a. an arc) to every node in the hidden layer and similarly, each hidden node is linked to every output node.

The transition between input nodes and hidden nodes is shown in the following diagram:

Diagram 2 - Calculation of a single node (y(p))

This example a single node in the hidden layer is calculated (y(p)). To calculate y(p), each input value is multiplied by a weight (w(1,p) to w(n,p) in this diagram). The result is summed, added to bias value b(p) and passed through the Activation Function. The activation function is usually a non-linear function such as sigmoid or tanh.

This is the Feed Forward component of the behaviour. Back Propagation refers to the process by which weights and threshold values are adjusted in order for the network to "recognise" an input pattern and produce the output pattern. To use back propagation you need a large number of matched input and output vectors. The input vectors are presented to the network the resultant outputs produced by the feed forward pass are compared to the desired output in the matched set. The error for each output node is calculated and the weights between the hidden nodes and output nodes as well as the threshold value are then adjusted to reduce the magnitude of the error. Once the weights of the hidden to output layer links have been adjusted, the weights of the input to hidden layers are adjusted along with the threshold values on the hidden nodes. Thus the error is propagated backwards through the network. This is termed "training the network".

In the feed forward step weight multiplication must happen for each link between every input value and each hidden node and every hidden node and each output node. This is the bulk of the computation that must be optimised. The summation and Activation Function also represent significant work.

In the back propagation step each weight and threshold value must adjusted in proportion to it's contribution to the error. The adjustment is also "dampened" by a factor known as the learning rate (0..1) thus the weight adjustment is the multiplication of three values. Therefore, training (a feed forward pass followed by a back propagation pass) represents significantly more work than a feed forward pass alone.

Organising the Feed Forward Pass

Splitting apart Diagram 1 the complete task now looks like this:

Diagram 3 - Diagram 1 rearranged

I have rearranged Diagram 1 to highlight the regularity of the computation. To calculate a derived value (i.e. hidden nodes and output nodes) the same same pattern must be repeated. To again redraw the task in a more programatic way:

Diagram 4 - Node values and weights as arrays

Diagram 4 now has all the computation steps to calculate a derived value (hidden nodes in this case). The input vector and weight vectors are shown as array elements that need to be multiplied together to form the input to the summation and subsequently the activation function.

This is now the sort of problem that we can break down into work items for Parallella cores.

Let's Get Practical

One Node per Core

Probably the most obvious way to split up the task is to assign one node to each core. This has the advantage of being simple to understand and therefore program. The big disadvantage is that it is unlikely that optimal performance will be achieved because it is unlikely that the number of nodes is evenly divisible by the number of cores.

Round Robin

It would also be possible to assign input * weight calculations from each link to each core in a Round Robin manner. This would ensure that all cores got the same amount of work but collecting all of the weighted inputs together for the summation step would be a communication headache.

Partitions

The best way I have come up with to split up the task is to partitions based on the number of cores. Assume that there are six input nodes and 16 hidden nodes. To split the task evenly between cores, core J, somewhere in the middle of the array would get the following task:

Diagram 5 - Core Assignment Example 1

Assigning the weight calculations evenly means that each core gets 10 to do. With 6 input values that means that there will be an overlap of 4 on each core.

If a core completes a whole set of link multiplications (for yq in this example) then it has all the information it needs to do the summation and activation function for that core.

If it starts but does not finish the job then it passes the partial sum onto the next core (core K in this example) which then finishes the sum and applies the activation function (for yr).

If it completes the job but does not start it then it must wait for the previous core to pass the partial sum before completing the sum and applying the activation function (for yp in this example).

The only other case to consider is when a core does not get to start or complete a whole derived node's worth of input * weight calculations.

Diagram 6 - Core Assignment Example 2

In this case, the core must wait for the incoming partial sum before passing it on to the next core. This would be a common occurrence on the Epiphany-64.

Moving from the Hidden Layer to the Output Layer

The partitioning strategy assumes that all inputs to the calculation (x1..6 in the example) are know at the time the process starts. This can be set up to be that way initially when calculating the hidden nodes. All inputs and weights needed by each core can be sent to it during the set up phase.

At the end of the calculation of the hidden node values however, each core only has those hidden values that it has calculated itself. Therefore an efficient method must be devised to transmit all calculated values to all cores.

The most obvious method would be to write all hidden node values back to global memory and then each core can read the ones that it does not have (or read all of them if it is quicker).

My approach would be to try something different. Why not use the on-chip write network to transmit the hidden node values to the left and right as they are calculated. In this way, the transmission can be started before all of the values are known and each core can start the second phase calculation when it has a full set of values.

It might be best to initially pass the hidden node values in one direction (e.g. to a core with a higher index see next section) and then when the core has received all of its upstream values then to pass in the other direction. This would mean that the on-chip write network would be writing in one direction at a time and therefore hopefully would avoid collisions.

I'm not sure which strategy will work best. The on-chip write network would be pretty busy for a period. However, if each core can determine when it has a full set of values this strategy prevents the need for coordination make sure that all values have been posted to global memory before the cores are updated. Also, the off-chip write network is 8 times slower and would also get pretty congested when all the cores were updating.

Epiphany Internal Network Configuration

The partitioning strategy as outlined above has a linear relationship between the cores that reflects the linear organisation of the data. I propose that the most efficient flow of data between cores is as shown:

Diagram 7 - Data Pathway on a Epiphany-16

Execution Flow

I think that it is worth having a think about how the execution will happen and how the inter-core communication will fit into it.

In Example 2 above there is only one job to do so starting at the lowest index and working through to the highest would be the most obvious option.

Example 1 however would be best done in reverse, i.e. the partial sum that is passed to the next core should be calculated first and passed so that it is ready when that core needs it. In this case the execution flow would be as follows:

0. Pass the input values and partitioned weight vectors (w(x,y) and w(y,z) to each core
1. Execute the input * weight calculations and any partial sum that must be passed to the next core
2. Pass the partial sum
3. Execute the input * weight / summation / activation function for the nodes that are wholly the task of the core
5. Pass the derived node value to the neighbouring cores to update their local memory in preparation for the next layer
6. Execute the input * weight calculation for any node where a partial sum is expected from an upstream core
7. Wait for the partial sum (it should already be there from step 2)
8. Add the incoming partial sum to the sum of the locally calculated values
9. Calculate the activation function for the node
10. Pass the derived node value to the neighbouring cores
11. Repeat steps 1-9 for the second layer ignoring step 5 (no need to update neighbouring cores)
12. Pass the final output values back to the host.

Next Up... Training

Now that I've covered the feed forward process, I'll spend some time thinking about training. Again, it is the transition from the output layer back to the hidden layer that seems to be the tricky bit.

Please post any comments if you have spotted any problems with my process or if you have any other ideas.