I want to test how random forests deal with noisy data that has the same
information content as less noisy data.
The test for this will be to see how well the random forest regressor can
predict the mean of the Gaussian that the data was drawn from.
The whole data set will be drawn from a set of Gaussians with different means
(\( \mu \)) but identical standard deviations (\( \sigma \)), and the following three
data sets will be created:
the data is drawn from each Gaussian ONE time. This will be termed the
high noise data set, where each data point has error=\( \sigma \)
the data is drawn from each Gaussian \( n_{obs} \) times and then averaged to
produce one data point for each Gaussian. This will be termed the low noise
data set, where each data point has error=\( \sigma/\sqrt(n_{obs}) \)
the data is drawn from each Gaussian \( n_{obs} \) times but not averaged,
just kept. This data set will therefore be \( n_{obs} \) times larger than data sets 1. and 2.. This will be termed the high noise-many data set, where each data point has error=\( \sigma \) but the data set contains more information due to the
larger number of “observations” per Gaussian.
The data sets will be drawn from Gaussians that have means that are evenly spaced between
\( \mu_{min} \) and \( \mu_{max} \). These means will
be fixed over all realisations of the data sets. The spacing between \( \mu \)’s
will be set equal \( \sigma \) so there is significant overlap between data drawn
from adjacent Gaussians making it non-trivial for a single data point to predict
the mean of the Gaussian it was drawn from.
To illustrate how this might work I plot an example of 4 Gaussians spaced
between \( \mu_{min}=10 \) and \( \mu_{max}=12 \) by \( \sigma=0.5 \).
In [50]:
Now to illustrate the drawing of the different data sets, I’ll draw data points
from the 4 Gaussians plotted above, with only 5 “observations” for data sets 2.
and 3.
In [51]:
There are 4 Gaussians
The Gaussian mu's:
[ 10. 10.5 11. 11.5]
The high noise data:
[[ 9.3879663 ]
[ 10.61671429]
[ 10.94476015]
[ 11.60133208]]
The low noise data:
[[ 10.342395 ]
[ 10.15318336]
[ 10.89082949]
[ 11.50473817]]
The high noise-many data:
[[ 10.83025991 10.03411825 9.58813345 11.07049289 10.18897049]
[ 10.05478904 10.54668392 9.93534111 9.68767715 10.54142557]
[ 11.80355274 10.96345827 10.06984289 10.85218006 10.7651135 ]
[ 11.63320311 11.59479556 11.21628922 11.73834815 11.34105479]]
Taking the mean should retrieve the 'low noise' data:
[ 10.342395 10.15318336 10.89082949 11.50473817]
Probably (depending on the random realisation you get!) you can see that the
“high noise” data set is noiser than the “low noise” data set, and that the
“high noise-many” data set columns look more like 5 different versions of the
“high noise” data set that the “low noise” data set.
Anyway, now I’m going to repeat the above but with new (more illustrative)
parameters for the Gaussian and generate a larger data set. Then perform random
forest regression on the generated data to try and predict the mean of the
Gaussian it was drawn from. For each realisation of the data sets the rms precision with which the random forest regressor found the mean of the Gaussians will be recorded.
Depending on the value of nreal this may take a few minutes to run.
In [52]:
There are now 9000 Gaussians
On realisation 1 of 30
On realisation 11 of 30
On realisation 21 of 30
Results!
The plot above shows a histogram for each data set of the rms precision of the random forest regressor in predicting the value of the mean of each Gaussian.
The random forest run on the “high noise-many” data set turns out to be just as
accurate as the one run on the “low noise” data set (and of course then also
more accurate than the “high noise” data set) as we expected and hoped from our
knowledge of all the statistics.