A Guide to Data Frames
A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, truefalse, etc., which makes data frames an ideal storage device for heterogeneous data.
When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index.
The following example page shows a number of common tasks when working with data frames.
Creating a new DataFrame
Describing a DataFrame
Indexing entries in data frames
Changing values in a DataFrame
Details on DataFrame structure
How to change column names
Adding a new column or row
Removing a column or row
Selecting or Removing values that match a criteria
Dealing with missing values
Sorting A DataFrame
Filtering data in a DataFrame
What does with do?
Reshaping data frames
Applying a function to columns of a DataFrame
Changing the datatype of a column
Converting a DataFrame to other data types
Creating an empty DataFrame
Importing data into a DataFrame
Plotting data from a DataFrame
Data frames are rich containers for information that can store multiple types of data. For example, the following vectors contain information on three berries: Raspberries, Grapes and Strawberries respectively.
genus := < "Rubus", "Vitis", "Fragaria" >:
Nutrition: Energy per 100g in kJ
energy := < 220, 288, 136 >:
Nutrition: Carbohydrates per 100g in g
carbohydrates := < 11.94, 18.1, 7.68 >:
Tons produced worldwide in 2011
total_tons := < 543421, 58500118, 4594539 >:
Top producing country in 2011
top_producer := < Russia, China, USA >:
A DataFrame can combine all of these different types of data into one structure.
DF := DataFrame( < energy | carbohydrates | total_tons | top_producer | genus > );
It can be observed that in the above DataFrame, the row and column names are equivalent to the index value. The row and column names can be specified using the rows and columns options respectively. It is also useful to specify the types of data in each DataSeries using the datatypes option.
DF := DataFrame( < energy | carbohydrates | total_tons | top_producer | genus >,
columns = [ Energy, Carbohydrates, `Total Tons`, `Top Producer`, Genus ],
rows = [ Raspberry, Grape, Strawberry ],
datatypes = [ integer, float, integer, anything, string ] );
DF≔EnergyCarbohydratesTotal TonsTop ProducerGenusRaspberry22011.9400000000000543421RussiaRubusGrape28818.100000000000058500118ChinaVitisStrawberry1367.680000000000004594539USAFragaria
Note: It is important to remember that any vectors included in the DataFrame should have the same length. If the vectors are of different length, an error is returned.
The Describe command prints a description of the structure of the DataFrame including the number of observations ( rows ), the number of variables ( columns ), as well as the type of each column (if specified). In addition, for numeric columns, the minimum and maximum values are displayed, and for truefalse, string or name columns, the distinct levels are given.
Describe( DF );
DF :: DataFrame: 3 observations for 5 variables
Energy: Type: integer Min: 136.000000 Max: 288.000000
Carbohydrates: Type: float Min: 7.680000 Max: 18.100000
Total Tons: Type: integer Min: 543421.000000 Max: 58500118.000000
Top Producer: Type: anything Tally: [China = 1, USA = 1, Russia = 1]
Genus: Type: string Tally: ["Rubus" = 1, "Vitis" = 1, "Fragaria" = 1]
Unlike matrices, data frames can be indexed using the column or row names. For example, to view the energy in kJ for raspberries:
Note that above, the index follows the convention [ row, column ] and a single value is retrieved. If a DataFrame is indexed by a single argument, this indexes by [ column ] and the respective column is returned. For example, to retrieve values for all berries for the Energy variable:
To return an entire row, for example, the entire row for Raspberries as a DataSeries:
Energy220Carbohydrates11.9400000000000Total Tons543421Top ProducerRussiaGenusRubus
For more examples, see Indexing a DataFrame or the Subsets of DataFrames example worksheet.
Changing values in an existing DataFrame can be done in a similar way to other data structures in Maple; simply assign to the index position that you want to change. For example, to change the value for Carbohydrates for Grape:
DF[Grape, Carbohydrates] := 20:
EnergyCarbohydratesTotal TonsTop ProducerGenusRaspberry22011.9400000000000543421RussiaRubusGrape28820.58500118ChinaVitisStrawberry1367.680000000000004594539USAFragaria
Before moving on, we can set this back to its original value:
DF[2,2] := 18.1:
EnergyCarbohydratesTotal TonsTop ProducerGenusRaspberry22011.9400000000000543421RussiaRubusGrape28818.100000000000058500118ChinaVitisStrawberry1367.680000000000004594539USAFragaria
There are many useful commands for querying information about the structure of data frames. For example, numelems returned the total number of elements in the DataFrame:
numelems( DF );
The upperbound command returns the dimensions of the DataFrame.
upperbound( DF );
To get the number of rows in the DataFrame, put 1 as the second argument to upperbound:
upperbound( DF, 1 );
To get the number of columns in the DataFrame, put 2 as the second argument to upperbound:
upperbound( DF, 2 );
The Datatypes command returns a list of the data type of each of the columns of data:
Datatypes( DF );
Data frames can have both row and column names. In the above example, these were added to the DataFrame using the columns and rows options. To retrieve the current list of column names, the ColumnLabels command can be used:
ColumnLabels( DF );
Energy,Carbohydrates,Total Tons,Top Producer,Genus
The SubsColumnLabel command returns a DataFrame with a modified column label:
DF := SubsColumnLabel( DF, 4, Country );
The Append command appends another column or row to an existing DataFrame. The arguments for the Append command specify the DataFrame to append to, the data to append, and optionally, the orientation and the name for the new data (which can be a DataFrame or DataSeries). The following adds a new column to the DataFrame corresponding to truefalse values for if the given berry is a botanical berry or not:
DF := Append( DF, < false, true, false >, label = `Botanical Berry` );
DF≔EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalse
If we have a new row of information on blackberries such as:
NewRow := DataSeries( < 180 | 9.61 | Mexico | "Rubus" | false >,
labels = [ Energy, Carbohydrates, Country, Genus, `Botanical Berry` ] );
This row can be added to the existing DataFrame using the Append command.
DF := Append( DF, NewRow, mode = row, label = Blackberry );
DF≔EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalseBlackberry1809.61undefinedMexicoRubusfalse
Note: The value of the Total tons produced in 2011 is unknown for this DataSeries. When this is appended to the original DataFrame, since there is no value for Total Tons, the value undefined is added where the data is unknown. For more on missing values, see the Dealing with missing values section.
The Remove command removes a selected column or row from the DataFrame. This command is very useful when doing further analysis using commands that require the arguments to be purely numeric, such as visualizations in Statistics, etc.
For example, to remove the non-numeric Genus, Country and Botanical Berry columns:
Remove( DF, [ Genus, Country, `Botanical Berry` ] );
To remove a row, use the mode option:
Remove( DF, Grape, mode = row );
EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseStrawberry1367.680000000000004594539USAFragariafalseBlackberry1809.61undefinedMexicoRubusfalse
The remove, select, and selectremove commands are useful when removing (or selecting) rows that contain values that match a given criteria.
For example, the selectremove command returns two results - one that matches the given criteria and one that does not. With the given criteria that the rows contain even values for Total Tons:
selectremove( x->type(x,even), DF, `Total Tons`);
EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryGrape28818.100000000000058500118ChinaVitistrue,EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseStrawberry1367.680000000000004594539USAFragariafalseBlackberry1809.61undefinedMexicoRubusfalse
Some datasets may have missing values. In the DataFrame for blackberries, it can be observed that the value for Total Tons is undefined:
EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalseBlackberry1809.61undefinedMexicoRubusfalse
There are several commands for dealing with missing values in data frames including DropMissing and FillMissing.
FillMissing can be used to fill missing values with another value. For example, here we can fill the undefined value in the DataFrame with 0:
FillMissing( DF, 0 );
EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalseBlackberry1809.610MexicoRubusfalse
Note: In order to permanently change the value for any missing values, the result of the FillMissing command must be assigned to the DataFrame.
Duplicate rows (or duplicate entries in a given column) do occur in many datasets. Say one was to mistakenly add another row to our DataFrame that contains the same information about blackberries, but the row is added under a misspelled label:
DF := Append( DF, DataFrame( << 180 | 9.61 | Mexico | "Rubus" | false >>,
columns = [ Energy, Carbohydrates, Country, Genus, `Botanical Berry` ],
rows = [ Blackbery ] ) );
DF≔EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalseBlackberry1809.61undefinedMexicoRubusfalseBlackbery1809.61undefinedMexicoRubusfalse
By inspecting the DataFrame above, it is easy to see the duplicate entry, but duplicates may not be easy to find by inspection when dealing with larger DataFrames.
The AreDuplicate command returns a DataSeries of truefalse values that indicate if a row has a duplicate or not.
AreDuplicate( DF );
This result can be used to index a DataFrame to show just the duplicate rows:
DF[ AreDuplicate( DF ) ];
EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryBlackberry1809.61undefinedMexicoRubusfalseBlackbery1809.61undefinedMexicoRubusfalse
The AreUnique command can be used to show the unique rows in a DataFrame. If a DataFrame is indexed by the results, the duplicate rows are removed. By default, the AreUnique command marks the first instance of a row as unique and any subsequent instance of a matching row as duplicate.
DF[ AreUnique( DF ) ];
The keep option controls if the first, last, or none of the duplicates is marked as unique.
DF[ AreUnique( DF, keep = last ) ];
EnergyCarbohydratesTotal TonsCountryGenusBotanical BerryRaspberry22011.9400000000000543421RussiaRubusfalseGrape28818.100000000000058500118ChinaVitistrueStrawberry1367.680000000000004594539USAFragariafalseBlackbery1809.61undefinedMexicoRubusfalse
DF := DF[ AreUnique( DF, keep = first ) ];
In the How to change column names section, the ColumnLabels command was used to return the list of variable names. The ColumnLabels command also returns the current order of the columns:
Energy,Carbohydrates,Total Tons,Country,Genus,Botanical Berry
To reorder columns in a DataFrame, use the DataFrame index notation to specify the new order for the columns and reassign this to the existing DataFrame variable.
DF := DF[ [ Genus, Carbohydrates, Energy, Country, `Total Tons`, `Botanical Berry` ] ];
DF≔GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryRaspberryRubus11.9400000000000220Russia543421falseGrapeVitis18.1000000000000288China58500118trueStrawberryFragaria7.68000000000000136USA4594539falseBlackberryRubus9.61180Mexicoundefinedfalse
Note: It is not recommended that DataFrames with strictly numeric indexed column labels are reordered. This may lead to a conflict between the column label and the index position.
In the Reordering columns section, the ColumnLabels command was used to return the current order of the columns. The RowLabels command returns the current order of the rows:
RowLabels( DF );
To reorder rows in a DataFrame, use the DataFrame index notation to specify the new order for the rows and reassign this to the existing DataFrame variable.
DF := DF[ [ Blackberry, Grape, Raspberry, Strawberry ], .. ];
DF≔GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryBlackberryRubus9.61180MexicoundefinedfalseGrapeVitis18.1000000000000288China58500118trueRaspberryRubus11.9400000000000220Russia543421falseStrawberryFragaria7.68000000000000136USA4594539false
Note: It is not recommended that DataFrames with strictly numeric indexed row labels are reordered. This may lead to a conflict between the row label and the index position.
One of the most common tasks when working with data is to order the data by ascending or descending numeric value, alphabetical order, or using some other ordering. The sort command orders a DataFrame by values in a selected column (DataSeries).
For example, to sort the DataFrame in order of ascending carbohydrate value:
sort( DF, Carbohydrates );
GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryStrawberryFragaria7.68000000000000136USA4594539falseBlackberryRubus9.61180MexicoundefinedfalseRaspberryRubus11.9400000000000220Russia543421falseGrapeVitis18.1000000000000288China58500118true
To sort the DataFrame by descending energy value:
sort( DF, Energy, `>` );
GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryGrapeVitis18.1000000000000288China58500118trueRaspberryRubus11.9400000000000220Russia543421falseBlackberryRubus9.61180MexicoundefinedfalseStrawberryFragaria7.68000000000000136USA4594539false
It is also possible to sort using a DataSeries containing strings:
sort( DF, Country );
GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryGrapeVitis18.1000000000000288China58500118trueBlackberryRubus9.61180MexicoundefinedfalseRaspberryRubus11.9400000000000220Russia543421falseStrawberryFragaria7.68000000000000136USA4594539false
A very useful aspect of data frames is that they can be queried for subsets of the DataFrame that match a given query. Queries on data frames return a truth table whose entries are either true, false or FAIL depending on if a given element meets a given criteria.
For example, to return all of the berries that have greater than 10 g of carbohydrates per 100g:
DF[ Carbohydrates ] >~ 10;
The truth table is a DataSeries that can be used to index the main DataFrame. If the DataFrame is indexed by a DataSeries of type truefalse, it returns a filtered DataFrame:
DF[ DF[ Carbohydrates ] >~ 10 ];
GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryGrapeVitis18.1000000000000288China58500118trueRaspberryRubus11.9400000000000220Russia543421false
It is possible to use the add command to tally the values of any column of a DataFrame, including the truth table generated using the above query:
add( DF[ Carbohydrates ] >~ 10 );
This means that for the 4 individuals in the DataFrame, 2 out of 4 have greater then 10 g of carbohydrates per 100 g and 2 out of 4 have less than or equal to 10g of carbohydrates.
More advanced queries can be created by combining several queries into one statement using logical operators such as and, or, etc.
For example, which of the berries have less than 10 g of carbohydrates and greater than 150 kJ of energy?
DF[ DF[ Carbohydrates ] <~ 10 and DF[ Energy ] >~ 150 ];
GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryBlackberryRubus9.61180Mexicoundefinedfalse
For more examples, see the Subsets of DataFrames example page.
The with command creates newly assigned variables that correspond to each of the column names.
with( DF );
Genus,Carbohydrates,Energy,Country,Total Tons,Botanical Berry
These variables are each a DataSeries that has the same row labels as the parent DataFrame.
Binding labels is useful for simplifying the query syntax:
Energy >~ 200;
The unwith command unassigns the column names:
unwith( DF );
The Transpose command returns the transpose of a DataFrame. This turns the variable columns into individual rows and the individual rows into variable columns.
Transpose( DF );
BlackberryGrapeRaspberryStrawberryGenusRubusVitisRubusFragariaCarbohydrates9.6118.100000000000011.94000000000007.68000000000000Energy180288220136CountryMexicoChinaRussiaUSATotal Tonsundefined585001185434214594539Botanical Berryfalsetruefalsefalse
Many top-level commands as well as commands in the Statistics package can be applied directly to data frames, however many commands do not work with data frames. For many functions, there simply is not a uniform application that can be sensibly applied to all the DataSeries in a DataFrame. In cases where one wants to apply a function to only a single DataSeries of a DataFrame, the ~ (element-wise) operator can be useful.
For example, say we wanted to round the values in the Carbohydrates column to the nearest integer value, the round command can be applied in the following way:
To change the values for this column in place, reassign the Carbohydrates column:
DF[Carbohydrates] := round~(DF[Carbohydrates]);
It may also be useful to apply a custom function to a DataSeries. For example, the following converts the values in the Energy column to Joules from KJoules.
(x -> x * 1000)~(DF[Energy]);
The following converts all the country names to strings:
(x -> convert( x, string) )~(DF[Country]);
The datatypes option of the DataFrame constructor specifies the datatypes for each column of the DataFrame. The SubsDatatype command changes the value of the specified datatype and attempts to coerce the data contained in the column to that new datatype.
SubsDatatype( DF, Energy, float );
GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryBlackberryRubus10180.MexicoundefinedfalseGrapeVitis18288.China58500118trueRaspberryRubus12220.Russia543421falseStrawberryFragaria8136.USA4594539false
It is important to note that this is not done in-place; in order to have a permanent change, reassignment is required.
DF := SubsDatatype( DF, Energy, float );
DF≔GenusCarbohydratesEnergyCountryTotal TonsBotanical BerryBlackberryRubus10180.MexicoundefinedfalseGrapeVitis18288.China58500118trueRaspberryRubus12220.Russia543421falseStrawberryFragaria8136.USA4594539false
It may be beneficial from time to time to convert all columns or some columns of a DataFrame to other data types in order to interact with other commands in the Maple language. The convert command can convert a DataFrame to a Matrix, table, Array or a nested list (by supplying the option nested to a conversion to list).
For example, let us convert the numeric columns of the DataFrame to a Matrix:
convert( DF[ [ Carbohydrates, Energy, `Total Tons` ] ], Matrix );
The following converts the columns with strings or names data into a nested list:
convert( DF[ [ Genus, Country ] ], list, nested );
Data frames can also be converted to Arrays:
convert( DF[ `Botanical Berry` ], Array );
In most of these cases, the conversion is lossy; the column names and row names are discarded. The only case that attempts to keep the meta-data is the conversion to table, where each table index is the row name for the entry value:
DFtable := convert( DF[ Genus ], table );
indices( DFtable );
entries( DFtable );
In order to create an empty DataFrame, supply an empty list of lists as the first argument:
NewDF := DataFrame( [], columns = [Column1] );
Observations can then be added to the DataFrame using the Append command. For more details, see the Adding a new column or row section.
NewDF := Append( NewDF, DataFrame( <<1>>, columns = [Column1] ) );
NewDF := Append( NewDF, , label = Column2 );
By default, the Import command returns data frames when importing many tabular file formats such as .csv, .tsv, .xls, or .xlsx files.
PassengerData:=Import( "datasets/air_passengers.csv", base=datadir );
You can use the dataplot command or numerous Statistics visualization commands to plot a DataFrame.
For additional visualization examples, see the Iris Data and Statistics with DataFrames example worksheets.
There are several examples for working with DataFrames and DataSeries:
Indexing a DataFrame : Examples of selecting, rearranging, and reassigning entries from a DataFrame object
Iris Data : Examples of summarizing data, computing aggregate statistics, and principal component analysis
Subsets of DataFrames : Examples of indexing and filtering columns and rows of a DataFrame
Statistics with DataFrames : Examples of using commands from Statistics on DataFrames
Download Help Document