Parallella Ideas, Designs and Executions: Intercore Communication and Profiling (sort of...)

In this post I will describe three ways of passing data between cores and provide some empirical comparisons using a rudimentary method of profiling.

In my last post I described a method of passing data from one core to the others using a broadcast strategy. Each core calculated the base address of each other core and wrote to a data structure at a known address. At the time I wrote my last post, I believed that this is could be improved and made a little more programmer friendly. To test out some alternative data passing methods, I designed a little experiment and used a rudimentary profiling strategy. I did find a significantly better alternative, but not where I expected.

Writing this post I'm using Brown Deer OpenCL (version 1.6) and the original Ubuntu 14.04 image.

To get my sample code execute:

git clone -b blog10files https://github.com/nickoppen/passing.git

Profiling in OpenCL

OpenCL provides the cl_event structure to pass onto the execution queue, created with the CL_QUEUE_PROFILING_ENABLED flag set to allow timing information to be retrieved using the clGetEventProfilingInfo call. This seems pretty straight forward but unfortunately this is currently not supported (according to the Parallella Quick Start Guide).

This leaves us with the timing support provided by C++. The library <ctime> provides the calls time(time_t & time) and clock_t clock = clock(). The time() call returns the number of seconds since 1 Jan 1970 which is too coarse to measure the execution of a small kernel. The clock() call returns the number of system clock ticks which is much smaller time intervals and, if all of the profiling is done on the same machine, will be accurate enough to get a feel for comparative execution times.

Surrounding the forka call time stamps seems simple enough but one must remember that the command queue is asynchronous so any calls prior to the forka and the forka itself must be called with the flag CL_EVENT_WAIT.

#include <ctime>

clock_t tstart, tend;

clmsync(stdacc, 0, debug, CL_MEM_DEVICE|CL_EVENT_WAIT);
tstart = clock();
clforka(stdacc, 0, krn, &ndr, CL_EVENT_WAIT, n, 1, debug);
tend = clock();

The main problem with this approach is that it measures everything that the forka does. Have a look at Adam Taylor's host code for a simple "Hello, World!" application. There is a lot of setting up code, which is buried within the forka. This is a little clumsy but will have to do for now.

Intercore Value Passing

Core ID Abstraction

In my last post I used the little gem (from djm):

#define LOCAL_MEM_ADDRESS_BASE(gid) (((32 + ((gid) / 4)) << 26) | ((8 + ((gid) % 4)) << 20))

to calculate the base address of the core and then used:

*(float *)(LOCAL_MEM_ADDRESS_BASE(gid_next) + ((unsigned int) derived) + (index * sizeof(float))) = value_being_passed;

Now, I love bit shifting as much as the next programmer but I don't really have to work out what every base address is. If I know my global id ( gid = get_global_id() ) I can figure out where the executing core is in the mesh and its relationship with all other cores because the base addresses don't change.

My second love is hexadecimal memory locations but they really increase your nerd status when your wife or girlfriend looks over your shoulder and wonders what on earth you are doing late into the night. Seriously though, does 0x8080 really mean anything to you? And what is it's relationship with 0x84b0. How many hops are there between the two?

The eSDK has a way of partitioning the cores based on their location in the grid. The stdcl library does not support this but it is still a good idea. Instead of 0x8080 we could say core00 and then it is obvious that there are 5 hops to core31 (formerly known as 0x84b0).

That suggests to me that the first snippet above (although pure poetry) can be abstracted away into some nice, convenient and most importantly, efficient include file that looks like this:

#define core00 0x80800000
#define core10 0x80900000
#define core20 0x80a00000

...

Topology Abstraction

The Epiphany is a rectangular mesh with each core connected to its nearest neighbours. This is great, but what if your application is better suited to a ring structure where it only had to communicate with two cores on each side of it in the ring. For this application, it would be convenient to relate to the next core as NEXT and the previous core as PREV. In this case, the ring topology would look like this:

If data only needed to flow around in one direction, then you would only need to refer to NEXT.

Similarly, you could define a row topology:

Or, if you really did need 3 Dimensional processing, a mesh topology where you only had to refer to NORTH, SOUTH, EAST and WEST would be really handy:

In the row and grid topology, the ends (e.g. RIGHT of core00) get assigned 0x0.

I have implemented these three topologies and used the ring in my sample code. The relevant code is:

unsigned int NEXT, PREV, ringIndex, gidOrder[CORECOUNT];

...

initRing(&NEXT, &PREV, &ringIndex, gidOrder);

The initRing call initialises all of the variables which are:

NEXT: the next core base address in the ring
PREV: the previous core base address in the ring
ringIndex: the position of the executing core within the ring (core00 is assigned 0 and core01 is 15)
gidOrder: an array with the global_ids in the order in which they come in the ring (gidOrder[ringIndex] == get_global_id(0)

Such definitions would even abstract away most of the need to refer to coreXX and would mean that, once set up, no core would have to execute an if... else... to figure out where it is in the mesh. The only exception would be using the row and mesh topologies, the code would have to check that it is not on the edge with no core further down the chain.

Abstracting the Assignment

One final abstraction is to clean up the assignment. To this end I've defined:

#define NEIGHBOUR_LOC(CORE, STRUCTURE, INDEX, SIZEOFTYPE) (CORE + ((unsigned int)STRUCTURE) + (INDEX * SIZEOFTYPE))

#define NEIGHBOR_LOC(CORE, STRUCTURE, INDEX, SIZEOFTYPE) (CORE + ((unsigned int)STRUCTURE) + (INDEX * SIZEOFTYPE))

(The only difference between the two is the spelling of neighbour/neighbor.)

Which is used (for an int assignment):

*(int*)NEIGHBOUR_LOC(NEXT, vLocal, i, sizeof(int)) = vLocal[i];

I still find this a bit clumsy but better than the original.

The Passing Experiment

The whole point of this exercise was to see if I could find a better alternative to the core-to-core broadcast method I implemented in my feed-forward pass of my neural network.

(For those who don't want to wade through my first post, the feed-forward pass is a series of matrix multiplication steps where the task is split between all cores. Each step produces another matrix which then is one input of the next step. Every core needs all of the values produced in the previous step, therefore every core must pass its results to every other core.)

The Original Method - Broadcast

The original method iterated through every remote core and wrote directly into it's memory. The code was simple enough but my "back of the envelope" calculation estimated that if every core just sent one value, there would be 640 intercore hops in the whole process.

I did improve on the original a little. I only calculated the remote core's base ID once (rather than every time I sent a value which is a obviously a waste of time).

Alternative Zero - Broadcast No Wait

I was also pretty sure that I didn't need so many calls to barrier() given that each core "owned" a chunk of the storage array on all cores and it alone wrote to it. So my first change was to implement the same broadcast strategy but with no barriers.

I wanted to write the best kernel I could so I generated an array of all core IDs (core00... core33) and iterated through the array sending all values to it except when the destination core ID was the same as the local core ID (no point in sending the value to itself).

Then I came up with two alternative strategies to test using my ring topology described above.

Alternative One - Unicast

Unicast uses the ring topology and passes it's local values to the next core. Then, it passes the values it has just received from the previous core in the ring onto the next core and so on until all values have flowed around the ring.

This would minimise the number of clashes but it does not use very much of the available bandwidth. Hence:

Alternative Two - Multicast

I read somewhere that the Epiphany has two channels on the cMesh (used for intercore write messages). Multicast (which should really be called Bicast) again uses the ring topology but passes values in two directions, thereby using both channels. Like Unicast, it first sends its own values (but in both directions) and then sends the values it has just received from its neighbours - values received from PREV are passed onto NEXT and values received from NEXT are passed onto PREV).

The Results

To see some sort of sensitivity I called each method 16 times. Each call I incremented an argument (n) from 1 to 16. All kernels passed n values to its neighbours. I expected to see some difference in execution time based on the kernels algorithm and some sort of trend line as n increased. However, my initial experiment did not show any difference between the methods or the volume of data passed at all. The chart looked like this:

The Y-axis is the number of clock ticks per call and the X-axis has the passing method (broardcast, broadcast No Wait, multicast and unicast), in ascending order of the number of values passed.

This was a bit disappointing at first. Taking into account a bit of random scatter, all the methods looked the same and there was no discernible trend line based on n. The average for all methods was just over 119,000 ticks. I thought I should take a larger sample.

At first, I surrounded everything in a loop and ran it 100 times. This made hardly any difference at all. The average nudged up a time bit but the overall picture looked the same.

As I mentioned above, the timing method I used measured the whole of the forka call. Because there was so little change between 1 and 100 iterations, I can only assume that the 119,000 odd clock ticks were all overhead!

Clearly, to see any sort of difference I'd have to dramatically increase the workload. So... 100,000 iterations...

Finally, a data set that tells a story! The averages (which include the 119,000 tick loading time) were:

Broadcast: 2,545,798 ticks
Broadcast No Wait: 302,535 ticks
Multicast: 1,521,429 ticks
Unicast: 2,623,446 ticks

And a roughly steady increase with the amount of data being sent.

But hang on... what story does it tell?

I was completely amazed by this result. Getting rid of the calls to barrier() decreased the execution time by almost 90%! Sure the algorithm was a little better but not that much!

Similarly, Multicast, with almost half the calls to barrier(), ran in almost half the time.

Here I was thinking that the number of hops was the dominant time consumer!

Conclusions

From this experiment, I propose the following conclusions:

0. Don't Wait... Minimise the use of barrier()

Coordinating cores using calls to barrier() is expensive.

While I thought my fancy topologies would speed things up by reducing the number of hops, the algorithm needed the cores to coordinate. In a "pass-it-along" scenario, each core had to make sure that the cores around it had successfully delivered their value(s) before it could pass it (or them) along.

Broadcast, using point-to-point communication didn't need coordination so the extra transmission cost was insignificant.

1. Give your cores a lot of work to do.

The thing that slows parallel architectures down in general and especially with distributed memory architectures is communication overhead. This is usually interpreted as, "The cores need to pass information between each other and wait for results". However, when working with a remote device, communication overhead includes the amount of time the main processor needs to send the accelerator the kernel and the data to work on as well as the intercore communication.

So, when designing your system, cut off as big a chunk of work you can and get the little cores to do as much as possible - they are quite powerful... they can handle it.

Final thought...

While my topologies didn't deliver any significant gain in this case, I think that they would still be useful if the topology suited the algorithm. If I come across one, I'll write another post.

Up Next

I'm going to upgrade my system to the new image. I believe that the reboot is more reliable and you don't need to run you programs as root. Not a big deal but I still forget to login using su every now and then. I'll also revisit my old posts to update them if need be.

Then I'll keep going on my neural network simulator... Stay tuned.

Parallella Ideas, Designs and Executions

Friday, 27 March 2015

Intercore Communication and Profiling (sort of...)