The "Iris" dataset is available in the datasets directory of Maple's data directory. By default, the Import command returns a dataframe object when importing csv files.
>
|
IrisData := Import("datasets/iris.csv", base = datadir);
|
The import commands displays a summary of the first 8 rows of the dataset as well as the row and column labels. This dataframe contains 4 columns of floating point data and one column of strings for the plant "Species".
The Describe command prints a brief description for the structure of the imported data:
IrisData :: DataFrame: 150 observations for 5 variables
Sepal Length: Type: anything Min: 4.300000 Max: 7.900000
Sepal Width: Type: anything Min: 2.000000 Max: 4.400000
Petal Length: Type: anything Min: 1.000000 Max: 6.900000
Petal Width: Type: anything Min: 0.100000 Max: 2.500000
Species: Type: anything Tally: ["versicolor" = 50, "virginica" = 50, "setosa" = 50]
| |
From the dataframe, you can see that the column labels are:
>
|
CLabels := ColumnLabels( IrisData );
|
| (1) |
The DataSummary command shows summary statistics for the numeric columns of the dataset:
>
|
interface(displayprecision=4):
|
>
|
DataSummary( IrisData[ CLabels[1 .. 4] ], summarize = embed ):
|
|
Sepal Length
|
Sepal Width
|
Petal Length
|
Petal Width
|
mean
|
|
|
|
|
standarddeviation
|
|
|
|
|
skewness
|
|
|
|
|
kurtosis
|
|
|
|
|
minimum
|
|
|
|
|
maximum
|
|
|
|
|
cumulativeweight
|
|
|
|
|
|
|
To summarize the column of strings, you can list the distinct elements by collapsing the column into a set:
>
|
convert( IrisData[ Species ], set );
|
| (2) |
Note that DataSummary returns a summary for all rows of the dataframe. The Aggregate command can be used to give aggregate statistics for the three distinct levels (factors) found in the "Species" column. By default, the Aggregate command returns the mean for each factor:
>
|
Aggregate( IrisData, Species );
|
Aggregate can return any summary statistic and tally up the number of observations for each factor level:
>
|
Aggregate( IrisData, Species, function = StandardDeviation, tally );
|
In order to visually detect patterns between variables, the variables can be plotted against one another using the GridPlot command. Note that for the upper triangle in the grid of plots, the colorscheme option is passed to plots:-pointplot using the valuesplit option. The valuesplit option splits the "Species" column into three levels and colors points accordingly.
>
|
GridPlot(IrisData[ CLabels[1 .. 4] ],
upper = [plots:-pointplot, colorscheme = ["valuesplit", IrisData[Species]], symbol = solidcircle, symbolsize = 20],
lower = '(x) -> Statistics:-PieChart([" " = abs(x), " " = 1 - abs(x)], color = ["CornflowerBlue", "WhiteSmoke"], title = evalf[3](x), size = [100, 100])',
correlation = [false, true, false], width = 600, widthmode = pixels);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In the above grid of plots, the lower triangle contains a series of piecharts that indicate the value for the correlation between corresponding columns. This type of plot is otherwise known as a correlogram and from this, it can be observed that the "Petal Length" and "Petal Width" columns have a high level of correlation.