Friday, 20 December 2013

Sorting of Spatial Data for Meshed Multi-Core Processors

Sorting of Spatial Data for Meshed Multi-Core Processors

Many algorithms that operate on two dimensional data sets are well suited to multi-core processors such as the Parallella. The nature of raw two dimensional data is that it generally arrives as an unsorted “point cloud” where neither the x or y axis are sorted. In order to gain the most efficient use of the processor, the points must be sorted so that each core can work on a subset of the space while minimising the need for inter-core communication. Therefore it makes sense to start the process with an efficient, two dimensional sort where the data ends up evenly distributed amongst the cores in clusters of “near neighbours”.
The target of this and subsequent blogs will be a design for Delaunay Triangulation. The target architecture is the Parallella board. I believe that the Epiphany III and Epiphany IV chips on the Parallella is well suited to spatial processing given that its cores are organised in a mesh, i.e. each core is connected to it's nearest neighbours to the north, south, east and west via multiply high speed buses. Thus, if close data points are clustered on each core and neighbouring clusters are on neighbouring cores, the distance to be travelled when the clusters are finally matched up will be minimised.

Goals of the Algorithm

The goals of the sorting algorithm are:
  • The x value of each point in a cluster are less than those in the clusters on the cores to the east and greater than those on the cores to the west.
  • The y value of each point in a cluster are less than those in the clusters on the cores to the north and greater than those in the clusters on the cores to the south.
  • The points are evenly distributed between cores
In addition we should take care that no unnecessary steps are introduced so that we end up with the most efficient process over all.

Distributing the Points

Given that the point cloud is assumed to be random and we want to have the same number of points per core then a round robin distribution seems the best way to start. The can be done as the data is being read in which should give the cores enough time to complete at least some of the initial sort in parallel with the data load phase.


Initial Sort

Let's not reinvent the wheel here. I think that a quick sort of the x values into one index and the y values into another index is the way to go here.


Swap Sort

After the initial sort, each core will have an sorted group of points that could be from anywhere in the point cloud. The purpose of the swap sort is to push each point to the neighbouring cores where the point is with it's nearest neighbours, or at least closer to them. I'm using a push or write-only method of communication between cores because the write message is a lot quicker than the read message. The cores must swap points (i.e receive one point for every point sent) in order to preserve the even distribution of points.
The cores must start the process by passing their larges x and y values to the cores to the east and north respectively via the c-mesh (let's call this the MyLargestX message). If the lowest x value of the core to the east is smaller than the highest x value then the these two cores must initiate a swap. This can be done in response to the MyLargestX message (let's call the response MySmallesXY). Simultaneously, the cores can be swapping along the y axis with a MyLargestY message. The swap is then completed with a MyLargestXY call from the initial core passing both the x and y values to the points new “home”.
If the MyLargestX message contains an argument that is smaller than the smallest value on the receiving core then the receiving core does not initiate the swap. If, at a later time that core receives a point that is smaller than that received from the core to the south or west then the swap can be initiated.


End Conditions

The end conditions cannot be determined by any one core on behalf of all cores because each core only knows it's own values and the largest values of the cores to the south and west. Therefore, each core must indicate to the controlling application that it is has received a MyLargestX or MyLargestY that does not cause it to initiate a swap. This is only a tentative state because a core that is currently “happy” (i.e. has values that are larger than the cores to the south and west) may receive a point from the north or east that it must pass on. Therefore the controlling application must only terminate the process when all cores are “happy”.


Potential Problems

Sub-Optimal Swapping Pathways

Because each core can only make decisions based on the data that is currently in it's local memory there may be some cases that swaps are initiated early in the process are then undone due to some smaller values arriving from cores further to the north. Right now I can't see how this can be avoided.
Similarly, a swap to the west may be undone at a later stage after some swaps to the north. This could be avoided by swapping based on the x values (north-south) first and, when x is sorted, sorting on y (east-west). This would also free up some memory on each core given that there would only be one sort index needed at a time (or indeed the points could be moved around in memory removing the need for an index at all).


Skewed Data Sets

This kind of sorting will end up with the point cloud spread out into a square. This is fine if the original data set is also roughly square or at least regularly distributed into a regular sort of shape.




Diagram 1 – Near neighbours stay near.

This distribution probably will not work so well if the point cloud is not regular. Points that are not true near neighbours may end up on the same core and points that are near neighbours may end up on distant cores.



Diagram 2 – Distant points that are not near neighbours get squashed onto the same core(s)



Diagram 3 – Near neighbours get separated onto distant cores



In this case the user must be aware of the limitations of the sort and the algorithm that uses the sorted data must be robust enough to handle this situation.


Next: Using the sorted data

My next entry will a description of how to Triangulate the sorted points.
Don't hold your breath – the Parallella boards are being shipped and when I get mine, I'll be writing code, not english.


Tuesday, 3 December 2013

Something to play with while we wait.

I've just posted a light weight C++ library that implements a three layer feed-forward back propagation neural network simulator on github. The repository is here: https://github.com/nickoppen/eNNpi.git.

I developed this version on the pi using the C++ standard library. This is an update of a version that I wrote a couple of years ago and is a very C++ oriented solution. There are classes for links, nodes, layers using templates, multiple inheritance and pretty much every C++ feature I know.

My intention is to use this as a base version and from which I'll develop a version for the parallella when it arrives. The next thing I'll work on is a matrix based implementation along the lines of the design outlined in the two previous posts on this blog.

If you want to download it and have a play please do. I hope my documentation is clear enough to get you started. 

One caveat though. I found github really confusing. If you are not familiar with it, downloading the files and not using git locally is probably the best way to go. Also, please don't post git or github questions - the answer will be "I have NO idea".

Please comment if you find any problems either on this post or on the repository wiki.

Thanks and have fun.

PS I just managed to get the Shuttle Statlog data uploaded. There are four training sets and one test set, the original data and transformed data in an open office spreadsheet and two scripts, one for create/train/save and one for test.