Friends of Friends, an astronomer made clustering algorithm
The friends of friends (FoF) algorithm was first used by Huchra and Geller in
1982 to find groups of associated galaxies based upon their physical proximity.
More commonly nowadays it is used to find dark matter “halos” (a system of
gravitationally interacting particles that are stable, that won’t expand or
collapse anymore) within a simulation of dark matter particles.
The principle is simple: two particles (or galaxies, or any discrete point in
space) are grouped togther if their separation is less than some threshold,
generally called the linking length. Since the distance between any two
particles in a FoF group has a minimum value given by the linking length, this
also acts as a density threshold.
By definition, the FoF groups cannot intersect, therefore a particle can be
assigned uniquely to only one FoF group (for a single given linking length). In
graph theory-speak the algorithm produces a set of connected components of an
undirected graph. At the completion of the algorithm all particles are
effectively “grouped”, but many may be in groups of size 1. There is also no
dependence of the results on the initialisation (for a given linking length)
since two particles will always be in the same group if their separation is
below the threshold no matter where the algorithm began processing the data.
The runtime of the algorithm will depend not just on the implementation but also
on the structure of the data. In the limiting case where all particles are
within one linking length of each other the run time will be \( O(N) \). In the
limiting case where no particles are within one linking length of any other
particle, the run time (without optimisation) will be \( O(N^2) \); with
optimisation I guess this would reduce to \( O(N\log N) \) via a divide and conquer
approach.
Below is my basic implementation of the core part of this algorithm, generalised
to work in n dimensional space.
In [120]:
Simulate some data
Now let’s make some data to play with. Two groups, with a Gaussian distribution
of points one centered at \( (x,y)=(4,5) \) and the other at \( (x,y)=(1,2) \).
Let’s also add some sparse background of points, distributed uniformly.
In [121]:
-1.32884916474 < x < 5.44355794973 0.270660525779 < y < 6.44284078467
Visualise the data
In [122]:
Run FoF!
In [123]:
Found 499 groups from 1500 particles
Time taken 9.47667312622 s
Results
Lets pull out the two largest groups and see if they correspond to the original
groups
In [124]:
The largest group has 236 members
The second largest group has 160 members
The third largest group has 48 members
We can see from this that it is likely the FoF algorithm found the two groups.
In [125]:
Change the linking length
Now let’s re-run with a larger linking length
In [126]:
Found 121 groups from 1500 particles
Time taken 3.08288502693 s
The largest group has 491 members
The second largest group has 380 members
The third largest group has 46 members
The larger linking length allows more points to be included in the two largest
groups. The points appear to definitely include erroneous background points that
cause the center of each group to become biased.
K-means
For fun, let’s see what the k-means clustering algorithm does.
In [127]:
As well as having to know a priori the number of clusters we are searching for,
k-means also has to assign all points into one of the pre-defined clusters,
therefore this produces a significant bias in the calculated cluster centers.