Showing posts with label epiphany-64. Show all posts
Showing posts with label epiphany-64. Show all posts

Friday, 27 March 2015

Intercore Communication and Profiling (sort of...)

In this post I will describe three ways of passing data between cores and provide some empirical comparisons using a rudimentary method of profiling.


In my last post I described a method of passing data from one core to the others using a broadcast strategy. Each core calculated the base address of each other core and wrote to a data structure at a known address. At the time I wrote my last post, I believed that this is could be improved and made a little more programmer friendly. To test out some alternative data passing methods, I designed a little experiment and used a rudimentary profiling strategy. I did find a significantly better alternative, but not where I expected.

Writing this post I'm using Brown Deer OpenCL (version 1.6) and the original Ubuntu 14.04 image.


To get my sample code execute:

git clone -b blog10files https://github.com/nickoppen/passing.git


Profiling in OpenCL


OpenCL provides the cl_event structure to pass onto the execution queue, created with the CL_QUEUE_PROFILING_ENABLED flag set to allow timing information to be retrieved using the clGetEventProfilingInfo call. This seems pretty straight forward but unfortunately this is currently not supported (according to the Parallella Quick Start Guide).

This leaves us with the timing support provided by C++. The library <ctime> provides the calls time(time_t & time) and clock_t clock = clock(). The time() call returns the number of seconds since 1 Jan 1970 which is too coarse to measure the execution of a small kernel. The clock() call returns the number of system clock ticks which is much smaller time intervals and, if all of the profiling is done on the same machine, will be accurate enough to get a feel for comparative execution times.

Surrounding the forka call time stamps seems simple enough but one must remember that the command queue is asynchronous so any calls prior to the forka and the forka itself must be called with the flag CL_EVENT_WAIT.

#include <ctime>

clock_t tstart, tend;


clmsync(stdacc, 0, debug, CL_MEM_DEVICE|CL_EVENT_WAIT);

tstart = clock();
clforka(stdacc, 0, krn, &ndr, CL_EVENT_WAIT, n, 1, debug);
tend = clock();

The main problem with this approach is that it measures everything that the forka does. Have a look at Adam Taylor's host code for a simple "Hello, World!" application. There is a lot of setting up code, which is buried within the forka. This is a little clumsy but will have to do for now.


Intercore Value Passing


Core ID Abstraction


In my last post I used the little gem (from djm):

#define LOCAL_MEM_ADDRESS_BASE(gid) (((32 + ((gid) / 4)) << 26) | ((8 + ((gid) % 4)) << 20))

to calculate the base address of the core and then used:

*(float *)(LOCAL_MEM_ADDRESS_BASE(gid_next) + ((unsigned int) derived) + (index * sizeof(float))) = value_being_passed;

Now, I love bit shifting as much as the next programmer but I don't really have to work out what every base address is. If I know my global id ( gid = get_global_id() ) I can figure out where the executing core is in the mesh and its relationship with all other cores because the base addresses don't change.

My second love is hexadecimal memory locations but they really increase your nerd status when your wife or girlfriend looks over your shoulder and wonders what on earth you are doing late into the night. Seriously though, does 0x8080 really mean anything to you? And what is it's relationship with 0x84b0. How many hops are there between the two?

The eSDK has a way of partitioning the cores based on their location in the grid. The stdcl library does not support this but it is still a good idea. Instead of 0x8080 we could say core00 and then it is obvious that there are 5 hops to core31 (formerly known as 0x84b0).

That suggests to me that the first snippet above (although pure poetry) can be abstracted away into some nice, convenient and most importantly, efficient include file that looks like this:

#define core00 0x80800000
#define core10 0x80900000
#define core20 0x80a00000

...


Topology Abstraction


The Epiphany is a rectangular mesh with each core connected to its nearest neighbours. This is great, but what if your application is better suited to a ring structure where it only had to communicate with two cores on each side of it in the ring. For this application, it would be convenient to relate to the next core as NEXT and the previous core as PREV. In this case, the ring topology would look like this:







If data only needed to flow around in one direction, then you would only need to refer to NEXT.

Similarly, you could define a row topology:




Or, if you really did need 3 Dimensional processing, a mesh topology where you only had to refer to NORTH, SOUTH, EAST and WEST would be really handy:




In the row and grid topology, the ends (e.g. RIGHT of core00) get assigned 0x0.

I have implemented these three topologies and used the ring in my sample code. The relevant code is:

unsigned int NEXT, PREV, ringIndex, gidOrder[CORECOUNT];
...
initRing(&NEXT, &PREV, &ringIndex, gidOrder);


The initRing call initialises all of the variables which are:
  • NEXT: the next core base address in the ring
  • PREV: the previous core base address in the ring
  • ringIndex: the position of the executing core within the ring (core00 is assigned 0 and core01 is 15)
  • gidOrder: an array with the global_ids in the order in which they come in the ring (gidOrder[ringIndex] == get_global_id(0)


Such definitions would even abstract away most of the need to refer to coreXX and would mean that, once set up, no core would have to execute an if... else... to figure out where it is in the mesh. The only exception would be using the row and mesh topologies, the code would have to check that it is not on the edge with no core further down the chain.

Abstracting the Assignment


One final abstraction is to clean up the assignment. To this end I've defined:

#define NEIGHBOUR_LOC(CORE, STRUCTURE, INDEX, SIZEOFTYPE) (CORE + ((unsigned int)STRUCTURE) + (INDEX * SIZEOFTYPE))
#define NEIGHBOR_LOC(CORE, STRUCTURE, INDEX, SIZEOFTYPE) (CORE + ((unsigned int)STRUCTURE) + (INDEX * SIZEOFTYPE))

(The only difference between the two is the spelling of neighbour/neighbor.) 

Which is used (for an int assignment):

*(int*)NEIGHBOUR_LOC(NEXT, vLocal, i, sizeof(int)) = vLocal[i];

I still find this a bit clumsy but better than the original.

The Passing Experiment


The whole point of this exercise was to see if I could find a better alternative to the core-to-core broadcast method I implemented in my feed-forward pass of my neural network.

(For those who don't want to wade through my first post, the feed-forward pass is a series of matrix multiplication steps where the task is split between all cores. Each step produces another matrix which then is one input of the next step. Every core needs all of the values produced in the previous step, therefore every core must pass its results to every other core.)

The Original Method - Broadcast


The original method iterated through every remote core and wrote directly into it's memory. The code was simple enough but my "back of the envelope" calculation estimated that if every core just sent one value, there would be 640 intercore hops in the whole process.

I did improve on the original a little. I only calculated the remote core's base ID once (rather than every time I sent a value which is a obviously a waste of time).

Alternative Zero - Broadcast No Wait


I was also pretty sure that I didn't need so many calls to barrier() given that each core "owned" a chunk of the storage array on all cores and it alone wrote to it. So my first change was to implement the same broadcast strategy but with no barriers.

I wanted to write the best kernel I could so I generated an array of all core IDs (core00... core33) and iterated through the array sending all values to it except when the destination core ID was the same as the local core ID (no point in sending the value to itself).

Then I came up with two alternative strategies to test using my ring topology described above.

Alternative One - Unicast


Unicast uses the ring topology and passes it's local values to the next core. Then, it passes the values it has just received from the previous core in the ring onto the next core and so on until all values have flowed around the ring.

This would minimise the number of clashes but it does not use very much of the available bandwidth. Hence:


Alternative Two - Multicast


I read somewhere that the Epiphany has two channels on the cMesh (used for intercore write messages). Multicast (which should really be called Bicast) again uses the ring topology but passes values in two directions, thereby using both channels. Like Unicast, it first sends its own values (but in both directions) and then sends the values it has just received from its neighbours - values received from PREV are passed onto NEXT and values received from NEXT are passed onto PREV).


The Results


To see some sort of sensitivity I called each method 16 times. Each call I incremented an argument (n) from 1 to 16. All kernels passed n values to its neighbours. I expected to see some difference in execution time based on the kernels algorithm and some sort of trend line as n increased. However, my initial experiment did not show any difference between the methods or the volume of data passed at all. The chart looked like this:





The Y-axis is the number of clock ticks per call and the X-axis has the passing method (broardcast, broadcast No Wait, multicast and unicast), in ascending order of the number of values passed.

This was a bit disappointing at first. Taking into account a bit of random scatter, all the methods looked the same and there was no discernible trend line based on n. The average for all methods was just over 119,000 ticks. I thought I should take a larger sample.

At first, I surrounded everything in a loop and ran it 100 times. This made hardly any difference at all. The average nudged up a time bit but the overall picture looked the same.

As I mentioned above, the timing method I used measured the whole of the forka call. Because there was so little change between 1  and 100 iterations, I can only assume that the 119,000 odd clock ticks were all overhead!

Clearly, to see any sort of difference I'd have to dramatically increase the workload. So... 100,000 iterations...




Finally, a data set that tells a story! The averages (which include the 119,000 tick loading time) were:


  • Broadcast: 2,545,798 ticks
  • Broadcast No Wait: 302,535 ticks
  • Multicast: 1,521,429 ticks
  • Unicast: 2,623,446 ticks
And a roughly steady increase with the amount of data being sent.


But hang on... what story does it tell?


I was completely amazed by this result. Getting rid of the calls to barrier() decreased the execution time by almost 90%! Sure the algorithm was a little better but not that much!

Similarly, Multicast, with almost half the calls to barrier(), ran in almost half the time.

Here I was thinking that the number of hops was the dominant time consumer!

Conclusions


From this experiment, I propose the following conclusions:


0. Don't Wait... Minimise the use of barrier()


Coordinating cores using calls to barrier() is expensive.

While I thought my fancy topologies would speed things up by reducing the number of hops, the algorithm needed the cores to coordinate. In a "pass-it-along" scenario, each core had to make sure that the cores around it had successfully delivered their value(s) before it could pass it (or them) along.

Broadcast, using point-to-point communication didn't need coordination so the extra transmission cost was insignificant.

1. Give your cores a lot of work to do. 


The thing that slows parallel architectures down in general and especially with distributed memory architectures is communication overhead. This is usually interpreted as, "The cores need to pass information between each other and wait for results". However, when working with a remote device, communication overhead includes the amount of time the main processor needs to send the accelerator the kernel and the data to work on as well as the intercore communication.

So, when designing your system, cut off as big a chunk of work you can and get the little cores to do as much as possible - they are quite powerful... they can handle it.


Final thought...


While my topologies didn't deliver any significant gain in this case, I think that they would still be useful if the topology suited the algorithm. If I come across one, I'll write another post.


Up Next


I'm going to upgrade my system to the new image. I believe that the reboot is more reliable and you don't need to run you programs as root. Not a big deal but I still forget to login using su every now and then. I'll also revisit my old posts to update them if need be.

Then I'll keep going on my neural network simulator... Stay tuned.

Friday, 20 December 2013

Sorting of Spatial Data for Meshed Multi-Core Processors

Sorting of Spatial Data for Meshed Multi-Core Processors

Many algorithms that operate on two dimensional data sets are well suited to multi-core processors such as the Parallella. The nature of raw two dimensional data is that it generally arrives as an unsorted “point cloud” where neither the x or y axis are sorted. In order to gain the most efficient use of the processor, the points must be sorted so that each core can work on a subset of the space while minimising the need for inter-core communication. Therefore it makes sense to start the process with an efficient, two dimensional sort where the data ends up evenly distributed amongst the cores in clusters of “near neighbours”.
The target of this and subsequent blogs will be a design for Delaunay Triangulation. The target architecture is the Parallella board. I believe that the Epiphany III and Epiphany IV chips on the Parallella is well suited to spatial processing given that its cores are organised in a mesh, i.e. each core is connected to it's nearest neighbours to the north, south, east and west via multiply high speed buses. Thus, if close data points are clustered on each core and neighbouring clusters are on neighbouring cores, the distance to be travelled when the clusters are finally matched up will be minimised.

Goals of the Algorithm

The goals of the sorting algorithm are:
  • The x value of each point in a cluster are less than those in the clusters on the cores to the east and greater than those on the cores to the west.
  • The y value of each point in a cluster are less than those in the clusters on the cores to the north and greater than those in the clusters on the cores to the south.
  • The points are evenly distributed between cores
In addition we should take care that no unnecessary steps are introduced so that we end up with the most efficient process over all.

Distributing the Points

Given that the point cloud is assumed to be random and we want to have the same number of points per core then a round robin distribution seems the best way to start. The can be done as the data is being read in which should give the cores enough time to complete at least some of the initial sort in parallel with the data load phase.


Initial Sort

Let's not reinvent the wheel here. I think that a quick sort of the x values into one index and the y values into another index is the way to go here.


Swap Sort

After the initial sort, each core will have an sorted group of points that could be from anywhere in the point cloud. The purpose of the swap sort is to push each point to the neighbouring cores where the point is with it's nearest neighbours, or at least closer to them. I'm using a push or write-only method of communication between cores because the write message is a lot quicker than the read message. The cores must swap points (i.e receive one point for every point sent) in order to preserve the even distribution of points.
The cores must start the process by passing their larges x and y values to the cores to the east and north respectively via the c-mesh (let's call this the MyLargestX message). If the lowest x value of the core to the east is smaller than the highest x value then the these two cores must initiate a swap. This can be done in response to the MyLargestX message (let's call the response MySmallesXY). Simultaneously, the cores can be swapping along the y axis with a MyLargestY message. The swap is then completed with a MyLargestXY call from the initial core passing both the x and y values to the points new “home”.
If the MyLargestX message contains an argument that is smaller than the smallest value on the receiving core then the receiving core does not initiate the swap. If, at a later time that core receives a point that is smaller than that received from the core to the south or west then the swap can be initiated.


End Conditions

The end conditions cannot be determined by any one core on behalf of all cores because each core only knows it's own values and the largest values of the cores to the south and west. Therefore, each core must indicate to the controlling application that it is has received a MyLargestX or MyLargestY that does not cause it to initiate a swap. This is only a tentative state because a core that is currently “happy” (i.e. has values that are larger than the cores to the south and west) may receive a point from the north or east that it must pass on. Therefore the controlling application must only terminate the process when all cores are “happy”.


Potential Problems

Sub-Optimal Swapping Pathways

Because each core can only make decisions based on the data that is currently in it's local memory there may be some cases that swaps are initiated early in the process are then undone due to some smaller values arriving from cores further to the north. Right now I can't see how this can be avoided.
Similarly, a swap to the west may be undone at a later stage after some swaps to the north. This could be avoided by swapping based on the x values (north-south) first and, when x is sorted, sorting on y (east-west). This would also free up some memory on each core given that there would only be one sort index needed at a time (or indeed the points could be moved around in memory removing the need for an index at all).


Skewed Data Sets

This kind of sorting will end up with the point cloud spread out into a square. This is fine if the original data set is also roughly square or at least regularly distributed into a regular sort of shape.




Diagram 1 – Near neighbours stay near.

This distribution probably will not work so well if the point cloud is not regular. Points that are not true near neighbours may end up on the same core and points that are near neighbours may end up on distant cores.



Diagram 2 – Distant points that are not near neighbours get squashed onto the same core(s)



Diagram 3 – Near neighbours get separated onto distant cores



In this case the user must be aware of the limitations of the sort and the algorithm that uses the sorted data must be robust enough to handle this situation.


Next: Using the sorted data

My next entry will a description of how to Triangulate the sorted points.
Don't hold your breath – the Parallella boards are being shipped and when I get mine, I'll be writing code, not english.


Tuesday, 1 October 2013

Training in Parallel


Training - the hard bit

Feeding forward is only part of the story. Any useful, real-world application with a significant number of inputs will need to be trained in an automatic fashion using a significant number of example data sets. Reed and Marks[1], the text that I mainly use, quotes Hinton[2] who says that a typical network with w weights requires requires O(w) training patterns and O(w) weight updates for good generalisation. On a serial machine this would require O(w^3) that would only be reduce by a factor of w on parallel hardware giving a training time of O(w^2). Clearly, an efficient implementation of the training algorithm is required.

How training works

Typically, before training commences, the weights on the links and bias values on the nodes are set to random values. When you feed your input pattern into the untrained network and calculated the output pattern, it will not bear any resemblance to the desired output. The weights and biases will have to be adjusted so that the calculated output matches the desired output (within predefined limits). The adjustment of each weight and bias is proportional to the amount of error it contributed to the final result.

For an output node the calculation is straight forward. The error is the desired value minus the calculated value. This error is then multiplied by the first derivative of the activation function giving the delta for the bias and incoming links. This delta is is used to "nudge" the incoming weights and output node bias in the direction that would result in a calculated output closer to the desired pattern. 

The magnitude of the nudge is determined by a value called the "learning rate". This is a value between 0 and 1 which is a multiplicand when updating the weights and bias values.

The error for the hidden layer is a little more difficult. Each weight between the hidden and output layer has contributed a little to the error of the output nodes. The error of a hidden node is the sum of these contributions (specifically, the weight times the output node delta). Again a delta for the hidden node is calculated summing the error on each weight and the first derivative of the activation function of the node.

Once the error on each hidden and output node has been calculated the incoming weights and node biases must be updated. Reed and Marks describe two variations called Batch Update and Online Update.

Training Variation 1 - Batch Update

In Batch Update the whole training set is passed through the network. The deltas for every node for every training set are calculated. The "total error" for the training set is the average of each node's deltas. Once this has happened the weights and biases are updated once.


Training Variation 2 - Online Update

In Online Update, each training pattern, (i.e. one set of inputs and the matching output) is run through the network, the deltas are calculated and the weights are updated straight away.

Practical Considerations

When to Stop

The idea with training is that the network is updated so that the discrepancy between the generated output and desired output is small while maintaining the generality of the solution. This is a balancing act. Clearly you want the network to produce an output that is a recognisable pattern (e.g. if an output is ON then it is greater that 0.8 and if it is OFF then it is less that -0.8). However, if you train the network too much it will eventually get to the stage that in only recognises the training set that you used. 

Recognising the situation where you have achieved the desired output levels can be done using a technique such as the Sum of the Squared Error (SSE).

Over training is not so easy. In practice, you keep a known set of inputs with their corresponding outputs aside as a test set. When you think that the network is getting close to a general solution you would run the test set (without doing any updates) and see if its output is as good a result as the training set. If it does then great! Press Save. If the test set results are significantly worse than the training test then you might have gone too far.

Once you have a well generalised network you may want continue training to see if you can improve the result but if the test set results start to diverge then you should back up to your last known general solution.

For the purposes of this design we need to keep in mind the calculation of the SSE and the ability to run a test set and not update afterwards. The maintenance and use of the test set will be left to the host application or as a manual process.

How do you know if you have the best network?

You can think of the weights in a neural network like a large multi-dimensional surface. A good solution represents a low point on this surface where all of the important weights are at a minimum. 



Diagram 8 - 2D Representation of a solution space


It is possible that your network to get stuck in a local minimum that does not represent a good solution. In this case no amount of training will make it better. There also may be a number of good solutions, one of which is the best. 

The only way of finding the best solution is to train the network many times from different starting points. The starting point in training is the set of random number that represent the initial weights. Therefore our system must have the ability to "unlearn" everything (hopefully after the user has pressed Save) and start again using a new set of random numbers.

Keeping the Cores Busy

To get the best performance out of the available hardware we should also consider how best to use all the features of the epiphany architecture.

Clearly the feed forward pass and error calculation (and weight update in online mode) are going to keep the core busy for a significant time and I presume that this task will be the processing bottleneck. Therefore keeping the cores busy, reducing the waiting time will be the key to optimum performance.

The off-chip network is connected to local memory via the DMA controller. To keep the cores "fed" with data, we should try to arrange the host process to send the next training batch to the core's local memory while it is still working on the current one. This should allow the next batch to commence as soon as it has finished. 

Where we left off...

At the end of the feed forward pass (assuming that the host has been diligent and passed the target output values while the cores were busy) our local memory would look like this:



Diagram 9 - Local memory at the end of the Feed Forward Pass



In this diagram:

  • red indicates values that have been passed to the core by the host (t(u) and t(v) are the target values for z(u) and z(v))
  • blue indicates values that have been calculated by a core (z(u), z(v), y(p) and y(q) have been calculated on Core J while y(1).. y(p-1) have been passed from upstream cores and y(q+1) .. y(N) from downstream cores)
  • purple indicates values that could either be calculated or passed (i.e. the weights)

Training Stage 1: Calculate the Output Error and Delta

Calculating the error and associated delta for an output node is trivial. The host can determine which core will calculate each final output value and send the target values to it.

Training Stage 2: Calculate the Hidden Error and Delta

The hidden node error is a little more difficult. The problem is that in my current model, the outbound weights from each hidden node are distributed across all of the cores. A few possible solutions come to mind:

1. Swap space for speed

In the example, Core J can only calculate the "hidden" error that is associated with Output(zu) and Output(zv) because it only has the links between Hidden(yp), Hidden(yq) and Output(zu) and Output(zv). It actually wants to calculate the whole error attributed to Hidden(yp) and Hidden(uq). To do this it would have to have a copy of all the weights between it's hidden nodes (yp and yq) and all the output node deltas. 

This is possible if each core had a copy of its own outbound weights and we could distribute the output deltas by using the same mechanism we used with the hidden layer values.


Diagram 10 - Space for Speed compromise


Clearly this strategy requires each core to have two sets of Hidden-Output links, the inbound links for its output nodes and the outbound links for it's hidden nodes. When training in batch mode the weights don't change from one training set to the next so the two sets of weights start synchronised and remain so until the update pass. 

The additional work to constantly update two sets of weights required for on-line mode suggests that this strategy would only be contemplated for batch mode.

2. Calculate and distribute

A less memory intensive method would be to calculate the weight * delta for all weights and deltas available to the core and to pass the those to the core that needs them. 

This would mean that data flowing around would use the fast the on-chip write network to its fullest extent. The value calculated by each core would only have to be sent to the core that needs it therefore the path would be determined by the epiphany's "x then y" rule. The largest number of hops would be 6 (on an epiphany-16) which would be for example between Core 1 and Core 13 at opposite corners as described in Diagram 7.

Once the hidden deltas have been calculated by the core that owns the hidden node it is at least in the right place. That core can either accumulate it for a later batch update or it can be used to update the node's inbound weights straight away in online mode.

3. Let the ARM cores look after it

Clearly neither of these a great solutions. There is another possibility however. The host CPUs (i.e. the ARM cores) also have a full set of data. Up until now we have only required them to keep the sending the data to the cores and not do any computation. There are two of them and both have considerable resources available in terms of memory and number crunching power.

If the output value for each output core or the output delta is passed back to the host, then it could work out remaining deltas while it is waiting for the next training pattern to be processed. The decision what to pass back to the ARM core would be based on how long the host takes to do its part of the job. The host's main task is still to keep the work up to the epiphany.

Again, batch mode would be fine with this. The ARM cores would accumulate the deltas and when the training set was done, send the deltas to each core which could then update the weights. This would introduce a short pause for the epiphany cores while the final output and hidden deltas and total batch errors are calculated and the sent to each core for the update.

Online mode... again... would have a problem. If the weights are to be updated every training example then the epiphany cores would be waiting around for the ARM cores to calculate and send the updates. This does not seem to be a good solution.

Training Stage 3: Update Weights and Biases

Once the output and hidden node errors have been calculated then each bias and weight needs to be nudged towards a better solution. Given that each core has a copy of each incoming weight and (after we figure out the best way of determining the hidden layer error) each will have the error of its own nodes then the update of the weights is straight forward. Each weight is assigned weight (i.e. itself) * delta * learningRate.

In online mode this would happen straight away and after the update the delta could be discarded.

In batch mode the error would have to be accumulated somewhere, either by each core or by the host CPU and when the training set was complete the final "total batch error" could be calculated. The accumulated errors would then be used to update the weights.

Up Next...

While we wait for our Parallellas to arrive I though I'd pull apart my Windows version and get a half decent interface together. I'll start on some documentation for that but it won't mean much until the guts have been written.

Also, I'll look around for some decent well known test data. I want to be about to get a couple of big problems to run through it and test out the scalability of the solution. If anyone knows of some good public test data please send me a link.

[1] Reed, R.D. and Marks, R. J. "Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks", MIT Press 1999.
[2] Hinton, G.E. Connectionist learning procedures. Artificial Intelligence 40(1): 143-150. 1989.

Monday, 2 September 2013

Parallel Neural Networks - Carving up the task



Intro

Over a couple of blogs I'll break down the process of getting the best performance out of a parallel architecture for processing neural networks. 

A few points before I begin:

First let me say that I am only going to look at Feed Forward - Back Propagation networks in the first instance. I might get into other types of neural networks with other training methods at a later time but for now I'm keeping things simple. This is also not a rigorous academic work. I'm only going to cover neural network theory in as much detail as I need in order to illustrate my program design.

Second, I must point out that this is an experiment. The method I develop here might not be the most efficient and if you believe you have a better way, please let me know. If I find a better way I'll update the post.

Third, I'm focussing on the Adapteva Parallella architecture. Whatever I write might work on other platforms but no promises from me. If you think it will work elsewhere please give it a try and let me know.


Carving Up the Task

To get the best performance it is important to carve up the task so that the available hardware is as busy is it can be. Let's begin by looking at the task.

Feed Forward - Back Propagation Networks

This is the basic shape of a feed forward - back propagation network:


Diagram 1 - Basic Structure

An input vector |x| (x(1) to x(n)) are fed into the network and produce an output vector |z| (z(1) to z(o)) via an intermediate or "hidden" vector |y| (y(1) to y(m)). What happens to transform inputs to outputs will be explained in the next paragraph. In the diagram, each circle is referred to as a Node and the nodes for each vector |x|, |y| and |z| are collectively known as a layer. Each node in the input layer is connected by a link (a.k.a. an arc) to every node in the hidden layer and similarly, each hidden node is linked to every output node.

The transition between input nodes and hidden nodes is shown in the following diagram:



Diagram 2 - Calculation of a single node (y(p))

This example a single node in the hidden layer is calculated (y(p)). To calculate y(p), each input value is multiplied by a weight (w(1,p) to w(n,p) in this diagram). The result is summed, added to bias value b(p) and passed through the Activation Function. The activation function is usually a non-linear function such as sigmoid or tanh.

This is the Feed Forward component of the behaviour. Back Propagation refers to the process by which weights and threshold values are adjusted in order for the network to "recognise" an input pattern and produce the output pattern. To use back propagation you need a large number of matched input and output vectors. The input vectors are presented to the network the resultant outputs produced by the feed forward pass are compared to the desired output in the matched set. The error for each output node is calculated and the weights between the hidden nodes and output nodes as well as the threshold value are then adjusted to reduce the magnitude of the error. Once the weights of the hidden to output layer links have been adjusted, the weights of the input to hidden layers are adjusted along with the threshold values on the hidden nodes. Thus the error is propagated backwards through the network. This is termed "training the network".

In the feed forward step weight multiplication must happen for each link between every input value and each hidden node and every hidden node and each output node. This is the bulk of the computation that must be optimised. The summation and Activation Function also represent significant work.

In the back propagation step each weight and threshold value must adjusted in proportion to it's contribution to the error. The adjustment is also "dampened" by a factor known as the learning rate (0..1) thus the weight adjustment is the multiplication of three values. Therefore, training (a feed forward pass followed by a back propagation pass) represents significantly more work than a feed forward pass alone.

Organising the Feed Forward Pass

Splitting apart Diagram 1 the complete task now looks like this:


Diagram 3 - Diagram 1 rearranged

I have rearranged Diagram 1 to highlight the regularity of the computation. To calculate a derived value (i.e. hidden nodes and output nodes) the same same pattern must be repeated. To again redraw the task in a more programatic way:


Diagram 4 - Node values and weights as arrays

Diagram 4 now has all the computation steps to calculate a derived value (hidden nodes in this case). The input vector and weight vectors are shown as array elements that need to be multiplied together to form the input to the summation and subsequently the activation function. 

This is now the sort of problem that we can break down into work items for Parallella cores.

Let's Get Practical

One Node per Core

Probably the most obvious way to split up the task is to assign one node to each core. This has the advantage of being simple to understand and therefore program. The big disadvantage is that it is unlikely that optimal performance will be achieved because it is unlikely that the number of nodes is evenly divisible by the number of cores.

Round Robin

It would also be possible to assign input * weight calculations from each link to each core in a Round Robin manner. This would ensure that all cores got the same amount of work but collecting all of the weighted inputs together for the summation step would be a communication headache.

Partitions

The best way I have come up with to split up the task is to partitions based on the number of cores. Assume that there are six input nodes and 16 hidden nodes. To split the task evenly between cores, core J, somewhere in the middle of the array would get the following task:



Diagram 5 - Core Assignment Example 1

Assigning the weight calculations evenly means that each core gets 10 to do. With 6 input values that means that there will be an overlap of 4 on each core. 

If a core completes a whole set of link multiplications (for yq in this example) then it has all the information it needs to do the summation and activation function for that core.

If it starts but does not finish the job then it passes the partial sum onto the next core (core K in this example) which then finishes the sum and applies the activation function (for yr).

If it completes the job but does not start it then it must wait for the previous core to pass the partial sum before completing the sum and applying the activation function (for yp in this example).

The only other case to consider is when a core does not get to start or complete a whole derived node's worth of input * weight calculations. 


Diagram 6 - Core Assignment Example 2


In this case, the core must wait for the incoming partial sum before passing it on to the next core. This would be a common occurrence on the Epiphany-64. 

Moving from the Hidden Layer to the Output Layer

The partitioning strategy assumes that all inputs to the calculation (x1..6 in the example) are know at the time the process starts. This can be set up to be that way initially when calculating the hidden nodes. All inputs and weights needed by each core can be sent to it during the set up phase.

At the end of the calculation of the hidden node values however, each core only has those hidden values that it has calculated itself. Therefore an efficient method must be devised to transmit all calculated values to all cores.

The most obvious method would be to write all hidden node values back to global memory and then each core can read the ones that it does not have (or read all of them if it is quicker).

My approach would be to try something different. Why not use the on-chip write network to transmit the hidden node values to the left and right as they are calculated. In this way, the transmission can be started before all of the values are known and each core can start the second phase calculation when it has a full set of values.

It might be best to initially pass the hidden node values in one direction (e.g. to a core with a higher index see next section) and then when the core has received all of its upstream values then to pass in the other direction. This would mean that the on-chip write network would be writing in one direction at a time and therefore hopefully would avoid collisions.

I'm not sure which strategy will work best. The on-chip write network would be pretty busy for a period. However, if each core can determine when it has a full set of values this strategy prevents the need for coordination make sure that all values have been posted to global memory before the cores are updated. Also, the off-chip write network is 8 times slower and would also get pretty congested when all the cores were updating.

Epiphany Internal Network Configuration

The partitioning strategy as outlined above has a linear relationship between the cores that reflects the linear organisation of the data. I propose that the most efficient flow of data between cores is as shown:

Diagram 7 - Data Pathway on a Epiphany-16 

Execution Flow

I think that it is worth having a think about how the execution will happen and how the inter-core communication will fit into it.

In Example 2 above there is only one job to do so starting at the lowest index and working through to the highest would be the most obvious option.

Example 1 however would be best done in reverse, i.e. the partial sum that is passed to the next core should be calculated first and passed so that it is ready when that core needs it. In this case the execution flow would be as follows:

0. Pass the input values and partitioned weight vectors (w(x,y) and w(y,z) to each core
1. Execute the input * weight calculations and any partial sum that must be passed to the next core
2. Pass the partial sum
3. Execute the input * weight / summation / activation function for the nodes that are wholly the task of the core
5. Pass the derived node value to the neighbouring cores to update their local memory in preparation for the next layer
6. Execute the input * weight calculation for any node where a partial sum is expected from an upstream core
7. Wait for the partial sum (it should already be there from step 2)
8. Add the incoming partial sum to the sum of the locally calculated values
9. Calculate the activation function for the node
10. Pass the derived node value to the neighbouring cores
11. Repeat steps 1-9 for the second layer ignoring step 5 (no need to update neighbouring cores)
12. Pass the final output values back to the host.


Next Up... Training

Now that I've covered the feed forward process, I'll spend some time thinking about training. Again, it is the transition from the output layer back to the hidden layer that seems to be the tricky bit. 

Please post any comments if you have spotted any problems with my process or if you have any other ideas.