Tuesday, 1 October 2013

Training in Parallel

Training - the hard bit

Feeding forward is only part of the story. Any useful, real-world application with a significant number of inputs will need to be trained in an automatic fashion using a significant number of example data sets. Reed and Marks[1], the text that I mainly use, quotes Hinton[2] who says that a typical network with w weights requires requires O(w) training patterns and O(w) weight updates for good generalisation. On a serial machine this would require O(w^3) that would only be reduce by a factor of w on parallel hardware giving a training time of O(w^2). Clearly, an efficient implementation of the training algorithm is required.

How training works

Typically, before training commences, the weights on the links and bias values on the nodes are set to random values. When you feed your input pattern into the untrained network and calculated the output pattern, it will not bear any resemblance to the desired output. The weights and biases will have to be adjusted so that the calculated output matches the desired output (within predefined limits). The adjustment of each weight and bias is proportional to the amount of error it contributed to the final result.

For an output node the calculation is straight forward. The error is the desired value minus the calculated value. This error is then multiplied by the first derivative of the activation function giving the delta for the bias and incoming links. This delta is is used to "nudge" the incoming weights and output node bias in the direction that would result in a calculated output closer to the desired pattern. 

The magnitude of the nudge is determined by a value called the "learning rate". This is a value between 0 and 1 which is a multiplicand when updating the weights and bias values.

The error for the hidden layer is a little more difficult. Each weight between the hidden and output layer has contributed a little to the error of the output nodes. The error of a hidden node is the sum of these contributions (specifically, the weight times the output node delta). Again a delta for the hidden node is calculated summing the error on each weight and the first derivative of the activation function of the node.

Once the error on each hidden and output node has been calculated the incoming weights and node biases must be updated. Reed and Marks describe two variations called Batch Update and Online Update.

Training Variation 1 - Batch Update

In Batch Update the whole training set is passed through the network. The deltas for every node for every training set are calculated. The "total error" for the training set is the average of each node's deltas. Once this has happened the weights and biases are updated once.

Training Variation 2 - Online Update

In Online Update, each training pattern, (i.e. one set of inputs and the matching output) is run through the network, the deltas are calculated and the weights are updated straight away.

Practical Considerations

When to Stop

The idea with training is that the network is updated so that the discrepancy between the generated output and desired output is small while maintaining the generality of the solution. This is a balancing act. Clearly you want the network to produce an output that is a recognisable pattern (e.g. if an output is ON then it is greater that 0.8 and if it is OFF then it is less that -0.8). However, if you train the network too much it will eventually get to the stage that in only recognises the training set that you used. 

Recognising the situation where you have achieved the desired output levels can be done using a technique such as the Sum of the Squared Error (SSE).

Over training is not so easy. In practice, you keep a known set of inputs with their corresponding outputs aside as a test set. When you think that the network is getting close to a general solution you would run the test set (without doing any updates) and see if its output is as good a result as the training set. If it does then great! Press Save. If the test set results are significantly worse than the training test then you might have gone too far.

Once you have a well generalised network you may want continue training to see if you can improve the result but if the test set results start to diverge then you should back up to your last known general solution.

For the purposes of this design we need to keep in mind the calculation of the SSE and the ability to run a test set and not update afterwards. The maintenance and use of the test set will be left to the host application or as a manual process.

How do you know if you have the best network?

You can think of the weights in a neural network like a large multi-dimensional surface. A good solution represents a low point on this surface where all of the important weights are at a minimum. 

Diagram 8 - 2D Representation of a solution space

It is possible that your network to get stuck in a local minimum that does not represent a good solution. In this case no amount of training will make it better. There also may be a number of good solutions, one of which is the best. 

The only way of finding the best solution is to train the network many times from different starting points. The starting point in training is the set of random number that represent the initial weights. Therefore our system must have the ability to "unlearn" everything (hopefully after the user has pressed Save) and start again using a new set of random numbers.

Keeping the Cores Busy

To get the best performance out of the available hardware we should also consider how best to use all the features of the epiphany architecture.

Clearly the feed forward pass and error calculation (and weight update in online mode) are going to keep the core busy for a significant time and I presume that this task will be the processing bottleneck. Therefore keeping the cores busy, reducing the waiting time will be the key to optimum performance.

The off-chip network is connected to local memory via the DMA controller. To keep the cores "fed" with data, we should try to arrange the host process to send the next training batch to the core's local memory while it is still working on the current one. This should allow the next batch to commence as soon as it has finished. 

Where we left off...

At the end of the feed forward pass (assuming that the host has been diligent and passed the target output values while the cores were busy) our local memory would look like this:

Diagram 9 - Local memory at the end of the Feed Forward Pass

In this diagram:

  • red indicates values that have been passed to the core by the host (t(u) and t(v) are the target values for z(u) and z(v))
  • blue indicates values that have been calculated by a core (z(u), z(v), y(p) and y(q) have been calculated on Core J while y(1).. y(p-1) have been passed from upstream cores and y(q+1) .. y(N) from downstream cores)
  • purple indicates values that could either be calculated or passed (i.e. the weights)

Training Stage 1: Calculate the Output Error and Delta

Calculating the error and associated delta for an output node is trivial. The host can determine which core will calculate each final output value and send the target values to it.

Training Stage 2: Calculate the Hidden Error and Delta

The hidden node error is a little more difficult. The problem is that in my current model, the outbound weights from each hidden node are distributed across all of the cores. A few possible solutions come to mind:

1. Swap space for speed

In the example, Core J can only calculate the "hidden" error that is associated with Output(zu) and Output(zv) because it only has the links between Hidden(yp), Hidden(yq) and Output(zu) and Output(zv). It actually wants to calculate the whole error attributed to Hidden(yp) and Hidden(uq). To do this it would have to have a copy of all the weights between it's hidden nodes (yp and yq) and all the output node deltas. 

This is possible if each core had a copy of its own outbound weights and we could distribute the output deltas by using the same mechanism we used with the hidden layer values.

Diagram 10 - Space for Speed compromise

Clearly this strategy requires each core to have two sets of Hidden-Output links, the inbound links for its output nodes and the outbound links for it's hidden nodes. When training in batch mode the weights don't change from one training set to the next so the two sets of weights start synchronised and remain so until the update pass. 

The additional work to constantly update two sets of weights required for on-line mode suggests that this strategy would only be contemplated for batch mode.

2. Calculate and distribute

A less memory intensive method would be to calculate the weight * delta for all weights and deltas available to the core and to pass the those to the core that needs them. 

This would mean that data flowing around would use the fast the on-chip write network to its fullest extent. The value calculated by each core would only have to be sent to the core that needs it therefore the path would be determined by the epiphany's "x then y" rule. The largest number of hops would be 6 (on an epiphany-16) which would be for example between Core 1 and Core 13 at opposite corners as described in Diagram 7.

Once the hidden deltas have been calculated by the core that owns the hidden node it is at least in the right place. That core can either accumulate it for a later batch update or it can be used to update the node's inbound weights straight away in online mode.

3. Let the ARM cores look after it

Clearly neither of these a great solutions. There is another possibility however. The host CPUs (i.e. the ARM cores) also have a full set of data. Up until now we have only required them to keep the sending the data to the cores and not do any computation. There are two of them and both have considerable resources available in terms of memory and number crunching power.

If the output value for each output core or the output delta is passed back to the host, then it could work out remaining deltas while it is waiting for the next training pattern to be processed. The decision what to pass back to the ARM core would be based on how long the host takes to do its part of the job. The host's main task is still to keep the work up to the epiphany.

Again, batch mode would be fine with this. The ARM cores would accumulate the deltas and when the training set was done, send the deltas to each core which could then update the weights. This would introduce a short pause for the epiphany cores while the final output and hidden deltas and total batch errors are calculated and the sent to each core for the update.

Online mode... again... would have a problem. If the weights are to be updated every training example then the epiphany cores would be waiting around for the ARM cores to calculate and send the updates. This does not seem to be a good solution.

Training Stage 3: Update Weights and Biases

Once the output and hidden node errors have been calculated then each bias and weight needs to be nudged towards a better solution. Given that each core has a copy of each incoming weight and (after we figure out the best way of determining the hidden layer error) each will have the error of its own nodes then the update of the weights is straight forward. Each weight is assigned weight (i.e. itself) * delta * learningRate.

In online mode this would happen straight away and after the update the delta could be discarded.

In batch mode the error would have to be accumulated somewhere, either by each core or by the host CPU and when the training set was complete the final "total batch error" could be calculated. The accumulated errors would then be used to update the weights.

Up Next...

While we wait for our Parallellas to arrive I though I'd pull apart my Windows version and get a half decent interface together. I'll start on some documentation for that but it won't mean much until the guts have been written.

Also, I'll look around for some decent well known test data. I want to be about to get a couple of big problems to run through it and test out the scalability of the solution. If anyone knows of some good public test data please send me a link.

[1] Reed, R.D. and Marks, R. J. "Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks", MIT Press 1999.
[2] Hinton, G.E. Connectionist learning procedures. Artificial Intelligence 40(1): 143-150. 1989.