My favorite pandas pattern
==========================

I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This...

Date: March 1, 2018

# My favorite pandas pattern

I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This is the method that I use almost data to achieve this result. Typically the data that am working with changes very slowly and trends happen over years not days or weeks. Plotting daily/weekly data tends to be noisy and hides the trend. I use this pattern because it works well with my data and is easy to explain to my stakeholders.

```python
import pandas as pd
import numpy as np
% matplotlib inline
```

## Lets Fake some data

Here I am trying to simulate a subset of a large transactional data set. This could be something like sales data, production data, hourly billing, anything that has a date, category, and value. Since we generated this data we know that it is clean. I am still going to assume that it contains some nulls, and an irregular date range.

```python
n = 365*5
cols = {'level_0': 'date',
 'level_1': 'item',
 0: 'qty', }
data = (pd.DataFrame(np.random.randint(0, 10, size=(n, 4)),
 columns=['paper', 'pencils', 'note cards', 'markers'],
 index=pd.date_range('1/1/2017', periods=n, freq='d'),
 )
 .stack()
 .to_frame()
 .reset_index()
 .rename(columns=cols))
data.head()
```

 .dataframe tbody tr th:only-of-type {
 vertical-align: middle;
 }

 .dataframe tbody tr th {
 vertical-align: top;
 }

 .dataframe thead th {
 text-align: right;
 }

 date
 item
 qty

 0
 2017-01-01
 paper
 1

 1
 2017-01-01
 pencils
 4

 2
 2017-01-01
 note cards
 5

 3
 2017-01-01
 markers
 9

 4
 2017-01-02
 paper
 3

## The pattern

Here I am going to take my groupby date and item, this will take care of duplicate entries with the same time stamp. Select the value I want to sum on. unstack the items index into columns. Resample the data by month. I could easily use any of the [available rules](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases). Fill any missing months with 0, since there wasnt a transaction during that month. Apply a rolling window to get the annual sum. I find that this helps to ground values in values that my stakeholders are used to seeing on a regular basis and reduces the need for them to recalculate in their head. Then I am going to drop the nulls created by the rolling window for the first 11 rows.

```python
plot_data = (data
 .groupby(['date', 'item'])
 .sum()
 ['qty']
 .unstack()
 .resample('m')
 .sum()
 .fillna(0)
 .rolling(12)
 .sum()
 .dropna()
 )
plot_data.head()
```

 .dataframe tbody tr th:only-of-type {
 vertical-align: middle;
 }

 .dataframe tbody tr th {
 vertical-align: top;
 }

 .dataframe thead th {
 text-align: right;
 }

 item
 markers
 note cards
 paper
 pencils

 date

 2017-12-31
 1543.0
 1739.0
 1613.0
 1657.0

 2018-01-31
 1572.0
 1744.0
 1635.0
 1635.0

 2018-02-28
 1563.0
 1717.0
 1645.0
 1645.0

 2018-03-31
 1596.0
 1703.0
 1629.0
 1600.0

 2018-04-30
 1557.0
 1693.0
 1648.0
 1581.0

```python
plot_data.plot(title='Rolling annual sum of Categorical Random Data');
```

## For the Visual Learners

### Groupby

```python
plot_data = (data
 .groupby(['date', 'item'])
 .sum()
 )
plot_data.head()
```

 .dataframe tbody tr th:only-of-type {
 vertical-align: middle;
 }

 .dataframe tbody tr th {
 vertical-align: top;
 }

 .dataframe thead th {
 text-align: right;
 }

 qty

 date
 item

 2017-01-01
 markers
 9

 note cards
 5

 paper
 1

 pencils
 4

 2017-01-02
 markers
 4

### Select Values

In this case I chose to do this to avoid working with a multiple levels in the columns that would be created in the unstack() step.

```python
plot_data = plot_data['qty']

plot_data.head()
```

 date item
 2017-01-01 markers 9
 note cards 5
 paper 1
 pencils 4
 2017-01-02 markers 4
 Name: qty, dtype: int32

### unstack

transform the last column in the index ('item') into rows.

```python
plot_data = plot_data.unstack()

plot_data.head()
```

 .dataframe tbody tr th:only-of-type {
 vertical-align: middle;
 }

 .dataframe tbody tr th {
 vertical-align: top;
 }

 .dataframe thead th {
 text-align: right;
 }

 item
 markers
 note cards
 paper
 pencils

 date

 2017-01-01
 9
 5
 1
 4

 2017-01-02
 4
 2
 3
 7

 2017-01-03
 9
 5
 2
 3

 2017-01-04
 2
 0
 0
 5

 2017-01-05
 0
 1
 6
 2

### resample

This step is important for irregular data in order to get the data into regular intervals.

```python
plot_data = plot_data.resample('m').sum()

plot_data.head()
```

 .dataframe tbody tr th:only-of-type {
 vertical-align: middle;
 }

 .dataframe tbody tr th {
 vertical-align: top;
 }

 .dataframe thead th {
 text-align: right;
 }

 item
 markers
 note cards
 paper
 pencils

 date

 2017-01-31
 145
 128
 117
 146

 2017-02-28
 136
 140
 133
 135

 2017-03-31
 112
 145
 125
 163

 2017-04-30
 143
 148
 112
 147

 2017-05-31
 86
 134
 139
 141

### rolling

I like to use rolling because it get the data into annual numbers, and reduces noise. I have found that most of my datasets have patterns and trends that are greater than 1y. This is just due to the industry that I am in. Play with the resample and rolling rules to fit the need of your own data.

```python
plot_data = plot_data.rolling(12).sum()

plot_data.head(20)
```

 .dataframe tbody tr th:only-of-type {
 vertical-align: middle;
 }

 .dataframe tbody tr th {
 vertical-align: top;
 }

 .dataframe thead th {
 text-align: right;
 }

 item
 markers
 note cards
 paper
 pencils

 date

 2017-01-31
 NaN
 NaN
 NaN
 NaN

 2017-02-28
 NaN
 NaN
 NaN
 NaN

 2017-03-31
 NaN
 NaN
 NaN
 NaN

 2017-04-30
 NaN
 NaN
 NaN
 NaN

 2017-05-31
 NaN
 NaN
 NaN
 NaN

 2017-06-30
 NaN
 NaN
 NaN
 NaN

 2017-07-31
 NaN
 NaN
 NaN
 NaN

 2017-08-31
 NaN
 NaN
 NaN
 NaN

 2017-09-30
 NaN
 NaN
 NaN
 NaN

 2017-10-31
 NaN
 NaN
 NaN
 NaN

 2017-11-30
 NaN
 NaN
 NaN
 NaN

 2017-12-31
 1543.0
 1739.0
 1613.0
 1657.0

 2018-01-31
 1572.0
 1744.0
 1635.0
 1635.0

 2018-02-28
 1563.0
 1717.0
 1645.0
 1645.0

 2018-03-31
 1596.0
 1703.0
 1629.0
 1600.0

 2018-04-30
 1557.0
 1693.0
 1648.0
 1581.0

 2018-05-31
 1624.0
 1674.0
 1632.0
 1592.0

 2018-06-30
 1582.0
 1645.0
 1657.0
 1593.0

 2018-07-31
 1662.0
 1654.0
 1680.0
 1613.0

 2018-08-31
 1654.0
 1617.0
 1650.0
 1616.0

### dropna

get rid of the first 11 null rows

```python
plot_data = plot_data.dropna()

plot_data.head(10)
```

 .dataframe tbody tr th:only-of-type {
 vertical-align: middle;
 }

 .dataframe tbody tr th {
 vertical-align: top;
 }

 .dataframe thead th {
 text-align: right;
 }

 item
 markers
 note cards
 paper
 pencils

 date

 2017-12-31
 1543.0
 1739.0
 1613.0
 1657.0

 2018-01-31
 1572.0
 1744.0
 1635.0
 1635.0

 2018-02-28
 1563.0
 1717.0
 1645.0
 1645.0

 2018-03-31
 1596.0
 1703.0
 1629.0
 1600.0

 2018-04-30
 1557.0
 1693.0
 1648.0
 1581.0

 2018-05-31
 1624.0
 1674.0
 1632.0
 1592.0

 2018-06-30
 1582.0
 1645.0
 1657.0
 1593.0

 2018-07-31
 1662.0
 1654.0
 1680.0
 1613.0

 2018-08-31
 1654.0
 1617.0
 1650.0
 1616.0

 2018-09-30
 1669.0
 1648.0
 1638.0
 1634.0