Statistics with DataFrames - Maple Help

Statistics with DataFrames

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length; however, each variable can have a different type, such as integer, float, string, name, truefalse, etc., which makes data frames an ideal storage device for heterogeneous data.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index.

This example page shows a number of common statistical operations with data frames. For more information, including a list of Statistics commands that can be run on data frames, see DataFrames in Statistics.

Getting Started

To begin, the Statistics package is loaded.

 > with(Statistics):

A new DataFrame can be created based on randomly generated data. The following generates a DataFrame with 3 DataSeries. The first two columns are generated by randomly sampling from a Uniform(0,1) distribution. The third DataSeries also has randomly generated data, which is sampled from 4 distinct levels: 0, 1, 2, 3.

 > data := DataFrame( < Sample(Uniform(0, 1), [20, 2]) | LinearAlgebra:-RandomVector(20, generator = rand(0 .. 3) ) >, columns = [a,b,c] );
 ${\mathrm{DataFrame}}{}\left({{\mathrm{_rtable}}}_{{36893628804577345396}}{,}{\mathrm{rows}}{=}\left[{1}{,}{2}{,}{3}{,}{4}{,}{5}{,}{6}{,}{7}{,}{8}{,}{9}{,}{10}{,}{11}{,}{12}{,}{13}{,}{14}{,}{15}{,}{16}{,}{17}{,}{18}{,}{19}{,}{20}\right]{,}{\mathrm{columns}}{=}\left[{a}{,}{b}{,}{c}\right]\right)$ (1)
 DataFrames and the Context Panel The Maple programming language provides many commands that are useful for exploring DataFrames. The Context Panel provides easy access to a selection of these commands, displaying context-specific commands that can be applied to DataFrames or DataSeries. The DataFrame context-sensitive options include many commands that can be applied to entire DataFrames as well as to a single DataSeries in a DataFrame. The second section of the DataFrame menu in the Context Panel includes commands for conversions, operations, queries, and visualization of DataFrames and DataSeries. The third section includes more commands relating to statistics and data analysis, including data analysis, data manipulation, properties and quantities, and summary and tabulation. A useful feature of the context-sensitive options is the ability to quickly filter the DataFrame by value or to select columns to apply operations to. This can be beneficial when dealing with heterogeneous data that includes non-numeric DataSeries. In many cases, commands in Statistics assumes an entirely numeric DataFrame. Selectively removing the non-numeric data makes it possible to use the routines natively on the given DataFrame.

Computing Summary Statistics

One of the most fundamental statistical tasks is to generate summary statistics, such as the mean or standard deviation, for a given data set. The DataSummary command returns the mean, standard deviation, skewness and kurtosis, as well as the minimum and maximum values, and the cumulative weight. When all of the elements in the DataSeries have weight = 1, the cumulative weight corresponds to the total number of observations in the DataSeries.

 > DataSummary(data, summarize = embed):

 a b c mean ${0.642033244375975753}$ ${0.492561671718578453}$ ${1.64999999999999969}$ standarddeviation ${0.329931930346065327}$ ${0.334466425079465335}$ ${1.26802789276975481}$ skewness ${-0.535097906770228615}$ ${-0.223319664925291123}$ ${-0.259820705584549505}$ kurtosis ${1.62899290839617339}$ ${1.41288760741544195}$ ${1.38747927386886905}$ minimum ${0.0975404049994095246}$ ${0.0318328463774206760}$ ${0.}$ maximum ${0.970592781760615697}$ ${0.950222048838354927}$ ${3.}$ cumulativeweight ${20.}$ ${20.}$ ${20.}$

Individual summary statistics can also be returned. For example, the variance of each of the DataSeries in the DataFrame is:

 > Variance(data);
 ${\mathrm{DataSeries}}{}\left(\left[\begin{array}{ccc}0.1088550786618809& 0.1118677895054376& 1.6078947368421048\end{array}\right]{,}{\mathrm{labels}}{=}\left[{a}{,}{b}{,}{c}\right]{,}{\mathrm{datatype}}{=}{{\mathrm{float}}}_{{8}}\right)$ (2)

It is important to note that a DataSeries is returned from requesting any individual summary statistic. Each row corresponds to each of the columns in the DataFrame. In this case, column "a" has a variance of 0.11, column "b" has a variance of 0.11, and column "c" has a variance of 1.61.

Aggregate Statistics

The defined DataFrame has a third column which has 3 distinct levels: 0, 1, 2, and 3. The distinct levels in a DataSeries can easily be seen by collapsing the DataSeries into a set:

 > convert(data[3], set);
 $\left\{{0}{,}{1}{,}{2}{,}{3}\right\}$ (3)

When a DataFrame contains a DataSeries that has a fixed number of distinct levels, it can be useful to compute aggregate statistics based on the levels in a given DataSeries. Using the example DataFrame, since the column "c" has 4 distinct levels, aggregate statistics can be computed using the Aggregate command, that show values for a given statistic for each of the levels.

The following shows the mean for each of the levels in column "c":

 > Aggregate(data, 3);
 ${\mathrm{DataFrame}}{}\left(\left[\begin{array}{ccc}0.789176927985601& 0.5715716569968473& 0\\ 0.5312144700993431& 0.4644634091949476& 1\\ 0.6209018764709715& 0.6098719860151968& 2\\ 0.562666428150338& 0.34907382056065794& 3\end{array}\right]{,}{\mathrm{rows}}{=}\left[{1}{,}{2}{,}{3}{,}{4}\right]{,}{\mathrm{columns}}{=}\left[{a}{,}{b}{,}{c}\right]\right)$ (4)

The default summary statistic for the Aggregate command is the mean. By specifying a function, any summary statistic can be returned for the levels in column "c":

 > Aggregate(data, 3, function = Median);
 ${\mathrm{DataFrame}}{}\left(\left[\begin{array}{ccc}0.9354413457866585& 0.7394678592524246& 0\\ 0.5312144700993431& 0.4644634091949476& 1\\ 0.6323592462254095& 0.6787351548577735& 2\\ 0.5468815192049838& 0.31709948006086053& 3\end{array}\right]{,}{\mathrm{rows}}{=}\left[{1}{,}{2}{,}{3}{,}{4}\right]{,}{\mathrm{columns}}{=}\left[{a}{,}{b}{,}{c}\right]\right)$ (5)

The tally option can be optionally specified in order to tally up the number of observations in each of the levels:

 > Aggregate(data, 3, function = Range, tally);
 ${\mathrm{DataFrame}}{}\left(\left[\begin{array}{cccc}0.8287064431334004& 0.9183892024609343& 0& 6\\ 0.867348130199867& 0.5865534427667717& 1& 2\\ 0.6786687293758972& 0.4662094831640262& 2& 5\\ 0.788748708895561& 0.8134176272945876& 3& 7\end{array}\right]{,}{\mathrm{rows}}{=}\left[{1}{,}{2}{,}{3}{,}{4}\right]{,}{\mathrm{columns}}{=}\left[{a}{,}{b}{,}{c}{,}{\mathrm{Tally}}\right]\right)$ (6)

Data Manipulation

The Statistics package has many commands for manipulating statistical data including data selection, scaling and more. For example, the Scale command can be used to center and scale numeric DataSeries:

 > data[[1, 2]] := Scale(data[[1, 2]]);
 ${\mathrm{DataFrame}}{}\left({{\mathrm{_rtable}}}_{{36893628804560513732}}{,}{\mathrm{rows}}{=}\left[{1}{,}{2}{,}{3}{,}{4}{,}{5}{,}{6}{,}{7}{,}{8}{,}{9}{,}{10}{,}{11}{,}{12}{,}{13}{,}{14}{,}{15}{,}{16}{,}{17}{,}{18}{,}{19}{,}{20}\right]{,}{\mathrm{columns}}{=}\left[{a}{,}{b}\right]\right)$ (7)
 > data;
 ${\mathrm{DataFrame}}{}\left({{\mathrm{_rtable}}}_{{36893628804560505060}}{,}{\mathrm{rows}}{=}\left[{1}{,}{2}{,}{3}{,}{4}{,}{5}{,}{6}{,}{7}{,}{8}{,}{9}{,}{10}{,}{11}{,}{12}{,}{13}{,}{14}{,}{15}{,}{16}{,}{17}{,}{18}{,}{19}{,}{20}\right]{,}{\mathrm{columns}}{=}\left[{a}{,}{b}{,}{c}\right]\right)$ (8)

Data Analysis

There are many commands in Maple for analyzing data, including commands for regression analysis, principal component analysis, analysis of variance and more.

Principal Component Analysis (PCA) aims to identify patterns in data by reducing the dimensionality of multivariate data to a few key explanatory variables called principal components. Performing a principal component analysis on our data returns a record, which can be queried for the principal components, transformation matrix, singular values and more.

 > pca := PCA(data, summarize = true);
 summary:
 Values   proportion of variance  St. Deviation 1.7667     0.4897                 1.3292 1.5158     0.4201                 1.2312 0.3254     0.0902                 0.5704
 ${\mathrm{Record\left(values = \left(Vector\left(3, \left\{\left(1\right) = 1.7666797533205745, \left(2\right) = 1.5158476209991987, \left(3\right) = .3253673625223314\right\}, datatype = float\left[8\right]\right)\right), varianceproportion = \left(Vector\left(3, \left\{\left(1\right) = .4896705370253964, \left(2\right) = .4201474077167729, \left(3\right) = 0.9018205525783075e-1\right\}, datatype = float\left[8\right]\right)\right), stdev = \left(Vector\left(3, \left\{\left(1\right) = HFloat\left(1.3291650587194106\right), \left(2\right) = HFloat\left(1.2311976368557562\right), \left(3\right) = HFloat\left(0.5704098197983021\right)\right\}\right)\right), rotation = \left(module DataFrame \left(\right) description "two-dimensional rich data container"; local columns, rows, data, binder; option object\left(BaseDataObject\right); end module\right), principalcomponents = \left(module DataFrame \left(\right) description "two-dimensional rich data container"; local columns, rows, data, binder; option object\left(BaseDataObject\right); end module\right)\right)}}$ (9)

 > Biplot(pca, pointlabels = true, points = false);

Visualizing DataFrames

Many statistical visualizations support DataFrames and DataSeries:

 > BoxPlot(data);
 > AreaChart(data);

The colorscheme option is useful when a DataSeries has a fixed number of distinct levels in a given column. In the following, the points are colored based on the 4 levels {0, 1, 2, 3} found in the column "c".

 > ScatterPlot(data[.., 1], data[.., 2], symbolsize = 20, symbol = solidbox, colorscheme = ["valuesplit", data[.., 3], [0 = "Red", 1 = "Blue", 2 = "Green", 3 = "Purple"]]);

Sampling a DataFrame

A typical way of sampling a dataset stored in a Matrix is to make a new empirical distribution based on the dataset and sample it directly. The same process can be used for numeric DataSeries:

 > Sample(EmpiricalDistribution(data[1]), 10);
 $\left[\begin{array}{cccccccccc}6.0& 14.0& 14.0& 4.0& 3.0& 10.0& 20.0& 7.0& 12.0& 5.0\end{array}\right]$ (10)

This method is however only valid for type numeric DataSeries. In the case that we have a non-numeric OR numeric DataSeries, the RandomTools[Generate] command can be used to sample the DataSeries. This does however require a conversion of the DataSeries to a list:

 > RandomTools:-Generate('list'('choose'(convert(data[3], list)), 10));
 $\left[{0}{,}{1}{,}{0}{,}{0}{,}{1}{,}{3}{,}{3}{,}{0}{,}{3}{,}{3}\right]$ (11)
 > RandomTools:-Generate('list'('choose'(convert(DataSeries(["m", "a", "p", "l", "e"]), list)), 10));
 $\left[{"e"}{,}{"e"}{,}{"l"}{,}{"p"}{,}{"p"}{,}{"e"}{,}{"p"}{,}{"p"}{,}{"e"}{,}{"p"}\right]$ (12)
 >

More examples

There are several examples for working with DataFrames and DataSeries:

 • DataFrame Guide : Examples of working with DataFrames
 • Iris Data : Examples of summarizing data, computing aggregate statistics, and principal component analysis
 • Subsets of DataFrames : Examples of indexing and filtering columns and rows of a DataFrame