Statistics - Maple Help

 Statistics

Maple contains a wealth of functionality for doing statistics, much of it in the Statistics package. Maple 17 has a new algorithm for fitting data in an overdetermined system for use in predictive models, as well as a new robust measure of dispersion that is more suitable for asymmetric distributions than the median absolute deviation.

Fitting Data for Use in a Predictive Model

Maple 17 has a new algorithm for fitting data in an overdetermined system for use in predictive models.

Consider a survey of large-cap Energy sector stocks in the S&P 500, where you know a lot of information about a few listings. Assume in this case that most of the data, when adjusted against general market trends, ends up being useless for predicting future markets, but some properties do end up being strong market indicators. You do not, however, know which variables to keep and which not to.

Let's set up an example using random data to model this situation.

 > $\mathrm{restart};$
 > $\mathrm{with}\left(\mathrm{Statistics}\right):$
 > $\mathrm{NumStocks}:=41:$
 > $\mathrm{NumProperties}:=100:$
 > $\mathrm{HiddenIndicators}:=\mathrm{LinearAlgebra}:-\mathrm{RandomMatrix}\left(\mathrm{NumStocks},10,\mathrm{datatype}={\mathrm{float}}_{8}\right):$
 > $\mathrm{Data}:=\mathrm{LinearAlgebra}:-\mathrm{RandomMatrix}\left(\mathrm{NumStocks},\mathrm{NumProperties},\mathrm{datatype}={\mathrm{float}}_{8}\right):$
 > ${\mathrm{Data}}_{..,1..5}:={\mathrm{HiddenIndicators}}_{..,1..5}:$
 > $\mathrm{Profit}:=\mathrm{Vector}\left(\mathrm{NumStocks},i→\mathrm{add}\left(\frac{1{\mathrm{HiddenIndicators}}_{i,j}}{j},j=1..10\right)\right):$

At this point we have generated the known data, and have profit numbers for that data. The following chart takes one sample parameter, property 10, and plots it against the profit.  It is not obvious what the correlation is.

 > $\mathrm{BarChart}\left(\left[{\mathrm{Data}}_{..,10},\mathrm{Profit}\right]\right)$

We want to build a model that fits all 100 properties against the data. A least-squares fit is used for comparison.

 > $\mathrm{StandardLS}:=\mathrm{LinearAlgebra}:-\mathrm{LeastSquares}\left(\mathrm{Data},\mathrm{Profit}\right):$

Because there are more variables than observations, we'll always get a perfect fit with 100% correlation to the existing data.

 > $\mathrm{ProfitEstimatesUsingStandardLS}:=\mathrm{Data}.\mathrm{StandardLS}:$
 > $\mathrm{Correlation}\left(\mathrm{Profit},\mathrm{ProfitEstimatesUsingStandardLS}\right)$
 ${1.00000000000000}$ (1)

We don't really want a model that will perfectly fit today's data. We want a model that will detect which parameters are likely to actually correlate well with the profit observed and can be used in predicting new observations. We want to avoid overfitting.

 > $\mathrm{plots}[\mathrm{display}]\left(\left\{\mathrm{plot}\left(x,x=-150..150\right),\mathrm{ScatterPlot}\left(\mathrm{Profit},\mathrm{ProfitEstimatesUsingStandardLS}\right)\right\}\right)$

Let's set up a new model using predictive least-squares.

 > $\mathrm{SignificantProperties},\mathrm{PredictLS}:=\mathrm{Statistics}:-\mathrm{PredictiveLeastSquares}\left(\mathrm{Data},\mathrm{Profit}\right):$
 > $\mathrm{SignificantProperties}$
 $\left[{1}{,}{2}{,}{3}{,}{4}{,}{7}{,}{8}\right]$ (2)

Note that the PredictiveLeastSquares command, new in Maple 17, returns two things: the coefficient matrix (as with regular least-squares) as well as a list indicating the columns of data that it decided were most relevant for use in predicting data.

 > $\mathrm{ProfitEstimatesUsingPredictLS}:={\mathrm{Data}}_{..,\mathrm{SignificantProperties}}.\mathrm{PredictLS}:$
 > $\mathrm{Correlation}\left(\mathrm{Profit},\mathrm{ProfitEstimatesUsingPredictLS}\right)$
 ${0.947961956927974}$ (3)

The correlation with the current profit data is not perfect (as expected) but still very good.

 > $\mathrm{plots}[\mathrm{display}]\left(\left\{\mathrm{plot}\left(x,x=-150..150\right),\mathrm{ScatterPlot}\left(\mathrm{Profit},\mathrm{ProfitEstimatesUsingPredictLS}\right)\right\}\right)$

The following graph shows the standard least-squares (red), predictive least-squares (green), and actual numbers on the same plot.

 > $\mathrm{perm},\mathrm{sortedProfit}:=\mathrm{sort}\left(\mathrm{Profit},\mathrm{output}=\left['\mathrm{permutation}','\mathrm{sorted}'\right]\right):$
 > $\mathrm{pts}:=\left[\mathrm{seq}\left(i,i=1..\mathrm{NumStocks}\right)\right]:$
 > $\mathrm{plots}\left[\mathrm{display}\right]\left(\left\{\mathrm{ScatterPlot}\left(\mathrm{pts},\mathrm{sortedProfit}\right),\mathrm{ScatterPlot}\left(\mathrm{pts},{\mathrm{ProfitEstimatesUsingPredictLS}}_{\mathrm{perm}},\mathrm{color}=\mathrm{green},\mathrm{symbolsize}=15\right),\mathrm{ScatterPlot}\left(\mathrm{pts},{\mathrm{ProfitEstimatesUsingStandardLS}}_{\mathrm{perm}},\mathrm{color}=\mathrm{red},\mathrm{symbolsize}=15\right)\right\}\right)\phantom{\rule[-0.0ex]{0.0em}{0.0ex}}$

These commands also illustrate another new feature of Maple 17: the output option for sort. It can be used to obtain the reordering applied to the input (in this case, $\mathrm{Profit}$) in order to sort it. This reordering (in this case, $\mathrm{perm}$) can then, for example, be applied to other lists or Vectors, as is the case here. There are more examples on the help page for sort.

Now we will generate some further data based again on the same hidden indicators, of which only some actually appear as items in the data that we know about.

 > $\mathrm{NewHiddenIndicators}:=\mathrm{LinearAlgebra}:-\mathrm{RandomMatrix}\left(\mathrm{NumStocks},10,\mathrm{datatype}={\mathrm{float}}_{8}\right):$
 > $\mathrm{NewData}:=\mathrm{LinearAlgebra}:-\mathrm{RandomMatrix}\left(\mathrm{NumStocks},\mathrm{NumProperties},\mathrm{datatype}={\mathrm{float}}_{8}\right):$
 > ${\mathrm{NewData}}_{..,1..5}:={\mathrm{NewHiddenIndicators}}_{..,1..5}:$
 > $\mathrm{NewProfit}:=\mathrm{Vector}\left(\mathrm{NumStocks},i→\mathrm{add}\left(\frac{1{\mathrm{NewHiddenIndicators}}_{i,j}}{j},j=1..10\right)\right):$

Let's first use the standard least-squares model against the new data.

 > $\mathrm{NewProfitEstimatesUsingStandardLS}:=\mathrm{NewData}.\mathrm{StandardLS}:$
 > $\mathrm{Correlation}\left(\mathrm{NewProfit},\mathrm{NewProfitEstimatesUsingStandardLS}\right)$
 ${0.710550589496143}$ (4)

It is clear that the model suffers from overfitting against spurious parameters in the original data set.

 > $\mathrm{NewProfitEstimatesUsingPredictLS}:={\mathrm{NewData}}_{..,\mathrm{SignificantProperties}}.\mathrm{PredictLS}:$
 > $\mathrm{Correlation}\left(\mathrm{NewProfit},\mathrm{NewProfitEstimatesUsingPredictLS}\right)$
 ${0.948260572403935}$ (5)

The predictive model achieved a much better correlation when using new data.

 > $\mathrm{perm2},\mathrm{sortedProfit2}:=\mathrm{sort}\left(\mathrm{NewProfit},\mathrm{output}=\left['\mathrm{permutation}','\mathrm{sorted}'\right]\right):$
 > $\mathrm{pts}:=\left[\mathrm{seq}\left(i,i=1..\mathrm{NumStocks}\right)\right]:$
 > $\mathrm{plots}[\mathrm{display}]\left(\left\{\mathrm{ScatterPlot}\left(\mathrm{pts},\mathrm{sortedProfit2}\right),\mathrm{ScatterPlot}\left(\mathrm{pts},{\mathrm{NewProfitEstimatesUsingPredictLS}}_{\mathrm{perm2}},\mathrm{color}=\mathrm{green},\mathrm{symbolsize}=15\right),\mathrm{ScatterPlot}\left(\mathrm{pts},{\mathrm{NewProfitEstimatesUsingStandardLS}}_{\mathrm{perm2}},\mathrm{color}=\mathrm{red},\mathrm{symbolsize}=15\right)\right\}\right)$

It is clear that the predictive least-squares data (green) more closely matches the actual points in black, whereas the standard least-squares data (red), which generally correlates, is much more scattered.

Robust Measures of Dispersion

A measure of dispersion is a statistic of a data set that describes the variability or spread of that data set. Two well-known examples are the standard deviation and the interquartile range. Maple 17 introduces a new measure of dispersion called $\mathrm{Sn}$, originally proposed by Rousseeuw and Croux [1].

Let us investigate how measures of dispersion behave when noise is added to a data set. Specifically, we will have an original data set $X$ of, say, $n$ data points, and a perturbed data set $Y$ where a certain fraction $rn$ of the data points are changed dramatically. We investigate at what value of $r$ the values become meaningless.

 > $X:=\mathrm{Sample}\left(\mathrm{Normal}\left(0,1\right),1000\right)$
 ${X}{:=}\left[\begin{array}{c}{\mathrm{1 .. 1000}}{{\mathrm{Vector}}}_{{\mathrm{row}}}\\ {\mathrm{Data Type:}}{{\mathrm{float}}}_{{8}}\\ {\mathrm{Storage:}}{\mathrm{rectangular}}\\ {\mathrm{Order:}}{\mathrm{Fortran_order}}\end{array}\right]$ (6)
 > $\mathrm{StandardDeviation}\left(X\right)$
 ${1.04421364916374}$ (7)
 > $Y:=\mathrm{copy}\left(X\right)$
 ${Y}{:=}\left[\begin{array}{c}{\mathrm{1 .. 1000}}{{\mathrm{Vector}}}_{{\mathrm{row}}}\\ {\mathrm{Data Type:}}{{\mathrm{float}}}_{{8}}\\ {\mathrm{Storage:}}{\mathrm{rectangular}}\\ {\mathrm{Order:}}{\mathrm{Fortran_order}}\end{array}\right]$ (8)
 > ${Y}_{1}:={10}^{100}$
 ${{Y}}_{{1}}{:=}{10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000}$ (9)
 > $\mathrm{StandardDeviation}\left(Y\right)$
 ${3.16227766016838}{}{{10}}^{{98}}$ (10)

For the standard deviation, we see that changing only one data point can massively change the standard deviation. In other words, there is no positive fraction $r$ of the data points that we can change while keeping the standard deviation bounded. We say that the breakdown point of the standard deviation is 0.

For the interquartile range, the process is different. Changing a single data point doesn't make the interquartile range of $Y$ change very much; in fact, we can change up to a quarter of the data points while staying within an order of magnitude from the interquartile range of $X$. As soon as we have changed 250 out of the 1000 data points, though, the interquartile range also goes through the roof.

 > $\mathrm{InterquartileRange}\left(X\right)$
 ${1.38339270130420}$ (11)
 > $Y:=\mathrm{copy}\left(X\right)$
 ${Y}{:=}\left[\begin{array}{c}{\mathrm{1 .. 1000}}{{\mathrm{Vector}}}_{{\mathrm{row}}}\\ {\mathrm{Data Type:}}{{\mathrm{float}}}_{{8}}\\ {\mathrm{Storage:}}{\mathrm{rectangular}}\\ {\mathrm{Order:}}{\mathrm{Fortran_order}}\end{array}\right]$ (12)
 > ${Y}_{1..249}:={10}^{100}:$
 > $\mathrm{InterquartileRange}\left(Y\right)$
 ${4.05253073684634}$ (13)
 > Y$\left[250\right]:={10}^{100}:$
 > $\mathrm{InterquartileRange}\left(Y\right)$
 ${5.83333333333371}{}{{10}}^{{99}}$ (14)

This suggests that the breakdown point of the interquartile range is $\frac{1}{4}$: changing strictly fewer than $\frac{1}{4}$ of the points cannot make the interquartile range unbounded. This is indeed correct. We say that the interquartile range is more robust than the standard deviation.

The breakdown point for any statistic can never be more than $\frac{1}{2}$: if we change over half of the data points in the set, then there's no way to decide what the "correct" data is, and what the "changed" data is. So are there dispersion statistics that reach this maximal breakdown point?

Yes, there are. A relatively well-known one is the median absolute deviation from the median, available in Maple as MedianDeviation. As the name says, it is obtained by computing the absolute difference between every data point and the median of the data set, and taking the median of these values.

 > $\mathrm{MedianDeviation}\left(X\right)$
 ${0.683845406516676}$ (15)
 > $Y:=\mathrm{copy}\left(X\right):$
 > ${Y}_{1..499}:={10}^{100}:$
 > $\mathrm{MedianDeviation}\left(Y\right)$
 ${6.52664152292415}$ (16)
 > ${Y}_{500}:={10}^{100}:$

The median absolute deviation from the median is a very useful robust estimator, but it also has some disadvantages, explained in the paper [1] by Rousseeuw and Croux. One of their objections is that it doesn't deal with asymmetric distributions very well, and another is that, while it is very robust against extreme changes in some points, it needs relatively many data points to "converge" to the proper value for a distribution in the absence of disturbance. In the statistics literature, this is phrased as saying that the median absolute deviation from the median is not very efficient. These authors propose two alternative statistics that also have a breakdown point of $\frac{1}{2}$ but higher efficiency, called $\mathrm{Sn}$ and $\mathrm{Qn}$. Maple 17 has an implementation of $\mathrm{Sn}$, which is called RousseeuwCrouxSn. This is a new command for Maple 17.

 > $\mathrm{RousseeuwCrouxSn}\left(X\right)$
 ${0.868120336664191}$ (17)
 > $Y:=\mathrm{copy}\left(X\right):$
 > ${Y}_{1..499}:={10}^{100}:$
 > $\mathrm{RousseeuwCrouxSn}\left(Y\right)$
 ${7.17500133596595}$ (18)
 > ${Y}_{500}:={10}^{100}:$
 > $\mathrm{RousseeuwCrouxSn}\left(Y\right)$
 ${1.00000000000000}{}{{10}}^{{100}}$ (19)

We will show how all of these statistics deviate from their true value for beta-distributed data samples at sample sizes from 10 to 10000 and with fractions between $0$ and $\frac{1}{2}$ of the data replaced by the value $5$. In particular, given the sample size and the fraction $r$, we replace the highest $100r$ percent of the data by $5$, then divide value obtained for the changed sample by the true value for the distribution, thus obtaining a number that should be $1$ for an ideal statistic. We then repeat this $100$ times, and take the average squared difference from $1$. This is the number shown in the plot below for each of the four measures of dispersion discussed above.

 > $\mathrm{functions}:=\left[\mathrm{StandardDeviation},\mathrm{InterquartileRange},\mathrm{MedianDeviation},\mathrm{RousseeuwCrouxSn}\right]:$
 > $X:=\mathrm{Sample}\left(\mathrm{BetaDistribution}\left(1,2\right),{10}^{6}\right):$
 > $\mathrm{true_values}:=\mathrm{map}\left(f→f\left(X\right),\mathrm{functions}\right)$
 ${\mathrm{true_values}}{:=}\left[{0.235596218498922}{,}{0.364943301288678}{,}{0.176447401282199}{,}{0.212276454518238}\right]$ (20)
 > $\mathrm{sample_sizes}:=\left[10,30,100,300,1000,3000,10000\right]:$
 > $\mathrm{nss}:=\mathrm{numelems}\left(\mathrm{sample_sizes}\right):$
 > $\mathrm{results}:=\mathrm{Array}\left(1..4,1..\mathrm{nss},0..10,1..100\right)$
 ${\mathrm{results}}{:=}\left[\begin{array}{c}{\mathrm{1..4 x 1..7 x 0..10 x 1..100}}{\mathrm{Array}}\\ {\mathrm{Data Type:}}{\mathrm{anything}}\\ {\mathrm{Storage:}}{\mathrm{rectangular}}\\ {\mathrm{Order:}}{\mathrm{Fortran_order}}\end{array}\right]$ (21)
 >
 > $\mathrm{rr}:=\mathrm{Array}\left(1..4,1..\mathrm{nss},0..10\right):$
 >
 >

The colors are red for the standard deviation, green for the interquartile range, blue for the median absolute deviation from the median, and yellow for Rousseeuw and Croux' $\mathrm{Sn}$. Lower numbers are shown higher in the graph, and are better. We see that in the case where there is no distortion ($r=0$), the standard deviation has the lowest distortion. However, as soon as there is any distortion, it is immediately too inaccurate to be useful for any purpose. For $r<0.25$, the interquartile range (green) does reasonably well, but greater values of $r$ make it, too, unusable. The median absolute deviation from the median (in blue) and $\mathrm{Sn}$ (yellow) are pretty close in most cases, with yellow coming out on top slightly more often than blue. We see that $\mathrm{Sn}$ is a good choice for a robust statistic of dispersion.

References

 Rousseeuw, Peter J., and Croux, Christophe. Alternatives to the Median Absolute Deviation. Journal of the American Statistical Association 88(424), 1993, pp.1273-1283.