Jürgen Hermann

Posted on Mar 17, 2020 • Originally published at jhermann.github.io on Mar 17, 2020

Using R-style Data Pipelines in Notebooks

#python #datascience #jupyter #pandas

Mutate Pandas data frames simply and elegantly with data pipelines.

Overview

This post shows how mutating data frames can be written more elegantly (and thus understood more easily) by using data pipelines. R users know this concept from the dplyr package, and Python offers a similar one named dfply.

Setting the Stage

We start off with some global definitions…

import numpy as np
import pandas as pd

The sample data (about OS package deployments) is read into the raw_data dataframe from a CSV file.

raw_data = pd.read_csv("../assets/data/cmdb-packages.csv", sep=',')

print('♯ of Records: {}\n'.format(len(raw_data)))

for name in raw_data.columns[1:]:
    if not name.startswith('Last '):
        print(name, '=', list(sorted(set(raw_data[name].fillna('')))))

raw_data.head(3).transpose()

♯ of Records: 146

Distribution = ['Debian 8.11', 'Debian 8.6', 'Debian 8.9', 'jessie']
Architecture = ['amd64']
Environment = ['', 'Canary', 'DEV', 'LIVE', 'QA']
Team = ['Automation', 'Big Data', 'Email', 'Ops App1', 'Ops Linux', 'Persistence', 'Platform']
Installed version = ['41.15-2(amd64)', '42.28-2(amd64)', '42.44-1(amd64)', '45.11-1(amd64)', '48.33-1(amd64)']

	0	1	2
CMDB_Id	274656589	153062618	282201163
Distribution	jessie	jessie	jessie
Architecture	amd64	amd64	amd64
Environment	DEV	DEV	LIVE
Team	Ops App1	Ops App1	Ops App1
Last seen	2019-02-18 11:43	2019-02-18 11:56	2019-02-18 12:04
Last modified	2019-02-18 11:43	2019-02-18 11:56	2019-02-18 12:04
Installed version	42.28-2(amd64)	42.28-2(amd64)	48.33-1(amd64)

The map_distro() helper function is used in the following sections to clean up the Distribution column.

def map_distro(name):
    """Helper to create canonical OS names."""
    return (name.split('.', 1)[0]
        .replace('Debian 7', 'wheezy')
        .replace('Debian 8', 'jessie')
        .replace('Debian 9', 'stretch')
        .replace('Debian 10', 'buster')
        .replace('squeeze', 'Squeeze [6]')
        .replace('wheezy', 'Wheezy [7]')
        .replace('jessie', 'Jessie [8]')
        .replace('stretch', 'Stretch [9]')
        .replace('buster', 'Buster [10]')
    )

Data Cleaning With Pandas

This code cleans up the imported data using the Pandas API.

To get sensible version statistics, we split off the auxiliary information in the version column (anything after -), leaving just the upstream part of the version string. The environment classifier is also cleaned up a little, and distributions are mapped to a canonical set of names. Some unused columns are dropped.

Finally, a subset of unique version samples is selected.

data = raw_data
data = data.assign(Version=data['Installed version'].str.split('-', 1, expand=True)[0])
data = data.assign(Environment=data.Environment.fillna('UNDEFINED').str.upper())
data = data.assign(Distribution=data.Distribution.apply(map_distro))
data = data.drop(columns=['CMDB_Id', 'Last seen', 'Last modified', 'Installed version'])
data = data.drop_duplicates(subset='Version', keep='first')

data.transpose()

	0	2	26	45	62
Distribution	Jessie [8]	Jessie [8]	Jessie [8]	Jessie [8]	Jessie [8]
Architecture	amd64	amd64	amd64	amd64	amd64
Environment	DEV	LIVE	LIVE	UNDEFINED	DEV
Team	Ops App1	Ops App1	Platform	Email	Platform
Version	42.28	48.33	41.15	45.11	42.44

Data Cleaning With Pipelines

This does the exact same processing as the code above, but is arguably more readable and maintained more easily:

It has less boilerplate, and makes the use of pipelined processing transparent.
Each step clearly states what it does to the data.
When steps are copied into other pipelines, the X placeholder ensures you use the data of this pipeline (the code is more DRY).

from dfply import *

piped = (raw_data
    >> mutate(Version=X['Installed version'].str.split('-', 1, expand=True)[0])
    >> mutate(Environment=X.Environment.fillna('UNDEFINED').str.upper())
    >> mutate(Distribution=X.Distribution.apply(map_distro))
    >> drop(X.CMDB_Id, X['Last seen'], X['Last modified'], X['Installed version'])
    >> distinct(X.Version)
)

piped.transpose()

The result is identical to the pure Pandas code, as expected.

	0	2	26	45	62
Distribution	Jessie [8]	Jessie [8]	Jessie [8]	Jessie [8]	Jessie [8]
Architecture	amd64	amd64	amd64	amd64	amd64
Environment	DEV	LIVE	LIVE	UNDEFINED	DEV
Team	Ops App1	Ops App1	Platform	Email	Platform
Version	42.28	48.33	41.15	45.11	42.44