Restructuring Data#
Restructuring and aggregation are two basic methods for extracting statistical information from data. We start with groupwise aggregation and then discuss several forms of restructuring without and with additional aggregation.
import pandas as pd
Grouping#
Grouping is the first step in the so-called split-apply-combine procedure in data processing. Data is split into groups by some criterion, then some function is applied to each group, finally results get (re-)combinded. Typical functions in the apply step are sum or mean (more general: aggregation) or any type of transform or filtering functions (drop groups containing nan items, for instance).
This chapter follows the structure of the Pandas user guide, but leaves out sections on very specific details. Feel free to have a look at those details later on.
Splitting into Groups and Basic Usage#
Grouping is done by calling the groupby method of a series or data frame. It takes a column label or a list of column labels as argument and returns a SeriesGroupBy or DataFrameGroupBy object. The returned object represents a kind of list of groups, each group being a small series or data frame. All rows in a group have identical values in the columns used for grouping.
The ...GroupBy object offers several methods for working with the determined groups. Iterating over such objects is possible, too.
Grouping by one column and subsequent aggregation yields an index with values from the column used for grouping:
df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],
'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)
g = df.groupby('age')
for name, group in g:
print('age:', name)
display(group)
df_means = g.mean()
df_means
| age | score | |
|---|---|---|
| 0 | 2 | 2.3 |
| 1 | 3 | 4.5 |
| 2 | 3 | 3.4 |
| 3 | 2 | 2.0 |
| 4 | 4 | 5.4 |
| 5 | 5 | 7.2 |
| 6 | 5 | 2.8 |
| 7 | 5 | 3.9 |
age: 2
| age | score | |
|---|---|---|
| 0 | 2 | 2.3 |
| 3 | 2 | 2.0 |
age: 3
| age | score | |
|---|---|---|
| 1 | 3 | 4.5 |
| 2 | 3 | 3.4 |
age: 4
| age | score | |
|---|---|---|
| 4 | 4 | 5.4 |
age: 5
| age | score | |
|---|---|---|
| 5 | 5 | 7.2 |
| 6 | 5 | 2.8 |
| 7 | 5 | 3.9 |
| score | |
|---|---|
| age | |
| 2 | 2.150000 |
| 3 | 3.950000 |
| 4 | 5.400000 |
| 5 | 4.633333 |
Grouping by two columns and subsequent aggregation yields a multi-level index:
df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],
'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)
g = df.groupby(['age', 'answer'])
for name, group in g:
print('age:', name[0])
print('answer:', name[1])
display(group)
df_means = g.mean()
display(df_means)
| age | answer | score | |
|---|---|---|---|
| 0 | 2 | yes | 2.3 |
| 1 | 3 | no | 4.5 |
| 2 | 3 | no | 3.4 |
| 3 | 2 | no | 2.0 |
| 4 | 4 | no | 5.4 |
| 5 | 5 | yes | 7.2 |
| 6 | 5 | yes | 2.8 |
| 7 | 5 | no | 3.9 |
age: 2
answer: no
| age | answer | score | |
|---|---|---|---|
| 3 | 2 | no | 2.0 |
age: 2
answer: yes
| age | answer | score | |
|---|---|---|---|
| 0 | 2 | yes | 2.3 |
age: 3
answer: no
| age | answer | score | |
|---|---|---|---|
| 1 | 3 | no | 4.5 |
| 2 | 3 | no | 3.4 |
age: 4
answer: no
| age | answer | score | |
|---|---|---|---|
| 4 | 4 | no | 5.4 |
age: 5
answer: no
| age | answer | score | |
|---|---|---|---|
| 7 | 5 | no | 3.9 |
age: 5
answer: yes
| age | answer | score | |
|---|---|---|---|
| 5 | 5 | yes | 7.2 |
| 6 | 5 | yes | 2.8 |
| score | ||
|---|---|---|
| age | answer | |
| 2 | no | 2.00 |
| yes | 2.30 | |
| 3 | no | 3.95 |
| 4 | no | 5.40 |
| 5 | no | 3.90 |
| yes | 5.00 |
Grouping by levels of a multi-level index is possible by providing the level argument to groupby.
With get_group we have access to single groups:
g.get_group((5, 'yes'))
| age | answer | score | |
|---|---|---|---|
| 5 | 5 | yes | 7.2 |
| 6 | 5 | yes | 2.8 |
DataFrameGroupBy objects allow for column indexing:
df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],
'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)
g = df.groupby('age')
g['answer'].get_group(5)
| age | answer | score | |
|---|---|---|---|
| 0 | 2 | yes | 2.3 |
| 1 | 3 | no | 4.5 |
| 2 | 3 | no | 3.4 |
| 3 | 2 | no | 2.0 |
| 4 | 4 | no | 5.4 |
| 5 | 5 | yes | 7.2 |
| 6 | 5 | yes | 2.8 |
| 7 | 5 | no | 3.9 |
5 yes
6 yes
7 no
Name: answer, dtype: object
Aggregation#
To apply a function to each column of each group use aggregate. It takes a function or a list of functions as argument. Providing a dictionary of column: function pairs allows for column specific functions.
import numpy as np
df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],
'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)
g = df.groupby('age')
display(g.aggregate(np.min))
display(g.aggregate([np.min, np.max]))
display(g.aggregate({'answer': np.min, 'score': np.mean}))
| age | answer | score | |
|---|---|---|---|
| 0 | 2 | yes | 2.3 |
| 1 | 3 | no | 4.5 |
| 2 | 3 | no | 3.4 |
| 3 | 2 | no | 2.0 |
| 4 | 4 | no | 5.4 |
| 5 | 5 | yes | 7.2 |
| 6 | 5 | yes | 2.8 |
| 7 | 5 | no | 3.9 |
| answer | score | |
|---|---|---|
| age | ||
| 2 | no | 2.0 |
| 3 | no | 3.4 |
| 4 | no | 5.4 |
| 5 | no | 2.8 |
| answer | score | |||
|---|---|---|---|---|
| amin | amax | amin | amax | |
| age | ||||
| 2 | no | yes | 2.0 | 2.3 |
| 3 | no | no | 3.4 | 4.5 |
| 4 | no | no | 5.4 | 5.4 |
| 5 | no | yes | 2.8 | 7.2 |
| answer | score | |
|---|---|---|
| age | ||
| 2 | no | 2.150000 |
| 3 | no | 3.950000 |
| 4 | no | 5.400000 |
| 5 | no | 4.633333 |
With size we get group sizes.
g.size()
age
2 2
3 2
4 1
5 3
dtype: int64
Many aggregation functions are directly accessible from the ...GroupBy object. Examples are ...GroupBy.sum and ...GroupBy.mean. See Computations / descriptive stats for a complete list.
Transformation#
The transform method allows to transform rows groupwise resulting in a data frame with same shape as the original one.
df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],
'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)
g = df.groupby('age')
# substract the groups mean score in each age group
df['score'] = g['score'].transform(lambda score: score - score.mean())
df
| age | answer | score | |
|---|---|---|---|
| 0 | 2 | yes | 2.3 |
| 1 | 3 | no | 4.5 |
| 2 | 3 | no | 3.4 |
| 3 | 2 | no | 2.0 |
| 4 | 4 | no | 5.4 |
| 5 | 5 | yes | 7.2 |
| 6 | 5 | yes | 2.8 |
| 7 | 5 | no | 3.9 |
| age | answer | score | |
|---|---|---|---|
| 0 | 2 | yes | 0.150000 |
| 1 | 3 | no | 0.550000 |
| 2 | 3 | no | -0.550000 |
| 3 | 2 | no | -0.150000 |
| 4 | 4 | no | 0.000000 |
| 5 | 5 | yes | 2.566667 |
| 6 | 5 | yes | -1.833333 |
| 7 | 5 | no | -0.733333 |
Filtering#
To remove groups use filter method. It takes a function as argument and returns a data frame with rows belonging to removed groups removed. The passed function gets the group (series or data frame) and has to return True (keep group) or False (remove group).
df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],
'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)
g = df.groupby('age')
g.filter(lambda dfg: dfg['score'].mean() > 4)
| age | answer | score | |
|---|---|---|---|
| 0 | 2 | yes | 2.3 |
| 1 | 3 | no | 4.5 |
| 2 | 3 | no | 3.4 |
| 3 | 2 | no | 2.0 |
| 4 | 4 | no | 5.4 |
| 5 | 5 | yes | 7.2 |
| 6 | 5 | yes | 2.8 |
| 7 | 5 | no | 3.9 |
| age | answer | score | |
|---|---|---|---|
| 4 | 4 | no | 5.4 |
| 5 | 5 | yes | 7.2 |
| 6 | 5 | yes | 2.8 |
| 7 | 5 | no | 3.9 |
Restructuring Without Aggregation#
There are three basic techniques for restructuring data in a data frame:
pivot(interprets two specified columns as row and column index)stack/unstack(move (level of) column index to (level of) row index and vice versa)melt(create new column from some column labels)
Details and graphical illustrations of these technique may be found in Pandas’ user guide (first three sections).
Restructuring With Aggregation#
Pandas supports pivot tables via pivot_table function. Pivot tables are almost the same as pivoting with pivot but allow for multiple values per data cell, which then are aggregated to one value.
Details may be found in Pandas’ user guide.
Similar functionality is provided by crosstab. See Pandas user guide, too.