Sephi Berry

Posted on Dec 29, 2020 • Edited on Jan 7, 2021

NaNs bites

#datascience #datapreperations #dataanalysis

This post is co-authored with David Katz

Background

While working on a health-related classification project our team encountered a very large sparse metrix, due to the vast amount of health/lab tests that were available. After a meeting with the business domain experts, we understood that our initial data preprocessing for removing missing data (NaN) was faulty.

In this post, we would like to share the pit-fall that we experienced and share our process for identifying features with missing values related to classification problems.

The Pit Fall

We had over 2K of features across 40K patients. We knew that most of the features had a significant amount of NaNs, so we used the common methods - such as scikit-learn's VarianceThreshold and caret's near zero variance functions to remove features with high NaN values.

We were left with less than 70 features and ran our base model to see if our classifier model could predict better than randomness. After displaying the feature importance from our CatBoost model, some concerns were raised regarding some of the features.

So we went back and did some homework...

While re-analyzing the features that were left, we saw that although they had passed
our initial NaN tests, we did not check for the distribution of the NaN across our classes, i.e. some features had a significant amount of NaNs concentrated in a specific class which were not evenly distributed across our group classifications.

Data sample

To demonstrate this process let's look at an example dataset - the Horse Colic dataset.

This dataset includes the outcome/survival of horses diagnosed with colic disease based upon their past medical histories.

import pandas as pd
import numpy as np
from itertools import combinations

pd.options.display.float_format = "{:,.2f}".format

names = "surgery,Age,Hospital Number,rectal temperature,pulse,respiratory rate,temperature of extremities,peripheral pulse,mucous membranes,capillary refill time,pain,peristalsis,abdominal distension,nasogastric tube,nasogastric reflux,nasogastric reflux PH,rectal examination,abdomen,packed cell volume,total protein,abdominocentesis appearance,abdomcentesis total protein,outcome,surgical lesion,type of lesion1,type of lesion2,type of lesion3,cp_data"
names = names.replace(" ", "_").split(",")
file_path = (
    "https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv"
)
df = pd.read_csv(file_path, names=names)
label_col = "outcome"


def preprocess_df(df):
    df.columns = [c.strip() for c in df.columns]
    df = df.replace("?", np.nan).replace("nan", np.nan)
    df = df[~(df["outcome"].isna())].copy()  # clean up label column
    df.index.name = "ID"
    df[label_col] = (
        df[label_col]
        .astype(str)
        .replace({"1": "lived", "2": "died", "3": "euthanized"})
    )
    return df


df = preprocess_df(df)
print(f"df.shape: {df.shape}")
df.head(2)

df.shape: (299, 28)

	surgery	Age	Hospital_Number	rectal_temperature	pulse	respiratory_rate	temperature_of_extremities	peripheral_pulse	mucous_membranes	capillary_refill_time	...	packed_cell_volume	total_protein	abdominocentesis_appearance	abdomcentesis_total_protein	outcome	surgical_lesion	type_of_lesion1	type_of_lesion2	type_of_lesion3	cp_data
ID
0	2	1	530101	38.50	66	28	3	3	NaN	2	...	45.00	8.40	NaN	NaN	died	2	11300	0	0	2
1	1	1	534817	39.2	88	20	NaN	NaN	4	1	...	50	85	2	2	euthanized	2	2208	0	0	2

2 rows × 28 columns

First we will check how many featurs have NaN values.

# There are 28 features in this dataset
def number_of_features_with_NaN(df):
    _s_na = df.isna().sum()
    return len(_s_na[_s_na > 0])


print(f"The number features with NaN values are {number_of_features_with_NaN(df)}")

The number features with NaN values are 19

Near Zero Variance

Simulating our intitial workflow, we will remove features with NaN values with our implementation of caret's R library near zero variance function (with their default values).

def near_zero_variance(df, cols=None, frq_cut=95 / 5, unique_cut=10):
    if not cols:
        cols = df.columns
    drop_cols = []
    for col in cols:
        val_count = list(
            df[col].value_counts(dropna=False, normalize=True).to_dict().items()
        )
        if len(val_count) == 1:
            drop_cols.append(col)
            continue
        lunique = len(val_count)
        percent_unique = 100 * lunique / len(df[col])
        freq_ratio = val_count[0][1] / val_count[1][1] + 1e-5

        if (freq_ratio > frq_cut) & (percent_unique <= unique_cut):
            drop_cols.append(col)
    return df[[c for c in df.columns if c not in drop_cols]]


df_nzr = near_zero_variance(df)
print(
    f"After processeing the dataset via `near_zero_variance` we are left with {number_of_features_with_NaN(df_nzr)} features with NaN values.\n"
)

After processeing the dataset via `near_zero_variance` we are left with 18 features with NaN values.

Deeper NaN Analysis

Since we are only interested in understanding the missing values in the dataset, we can view how many NaN values are in the various features.
Let's now plot the remaining features relative to the percent of NaNs in each feature

def plot_percent_nan_in_features(df):
    nan = df.isna().sum() / len(df)
    feature_nan_threshold = {
        round(threshold, 2): len(df.columns) - len(nan[nan < threshold])
        for threshold in np.arange(0, 1.01, 0.05)
    }
    _df = pd.DataFrame.from_dict(
        feature_nan_threshold, orient="index", columns=["num_of_features"]
    )
    _df.plot(
        kind="bar",
        ylabel="Number of Features",
        xlabel="Percentage of NaNs in feature",
        figsize=(10, 5),
        grid=True,
    )


plot_percent_nan_in_features(df_nzr)

From the above plot, we can see that the number of features with more than 40% of NaNs are 2 features and above 25% are 6 features.

We can see that some of the features have very high NaN values.

Let's now remove these problematic features.
For this example we will set a threshold of 35% .

threshold_max_na = 0.35


def drop_features_above_threshold_max_na(df, threshold_max_na=threshold_max_na):
    nan = df.isna().sum() / len(df)
    nan_threshold = nan[nan > threshold_max_na]
    df = df.drop(columns=nan_threshold.index).copy()
    return df


df_nzr_threshold = drop_features_above_threshold_max_na(df_nzr)
print(
    f"After drop_features_above_threshold_max_na the number of features with NaNs that we are left with are : {number_of_features_with_NaN(df_nzr_threshold)}"
)

After drop_features_above_threshold_max_na the number of features with NaNs that we are left with are : 14

We may assume that we have removed the problematic features and can try to imputate our NaN data and run our pipeline/model.

But before we do so let's look a bit more closely at our classification.

NaN Distribution Among the Classifer Labels

Looking at the classifier label feature outcome we can see the following distribution

def create_value_counts_df(df, s2=pd.Series(dtype=float), col=None, check_na=False):
    if col:
        s1 = df[col].value_counts(dropna=False)
    if check_na:
        s1 = df.isna().sum()
        s2 = df.isna().sum() / len(df)
    if s2.empty:
        s2 = df[col].value_counts(normalize=True, dropna=False)
    df = pd.concat([s1, s2], axis=1)
    df.columns = ["num", "percent"]
    return df.sort_values(by="percent", ascending=False)


df_labels_counts = create_value_counts_df(df, col=label_col)
df_labels_counts

	num	percent
lived	178	0.60
died	77	0.26
euthanized	44	0.15

We can see that the distribution of the classes is uneven.

Our class distribution is apporximately 60%, 25%, 15% between the lived, died, euthanized classes (respectively).

But how are our NaNs distributed?

What is the distribution of NaNs in each feature with relation to our classification field.

def get_na_per_label(df, label_col, labels=df_labels_counts.index):
    cols = [c for c in df.columns if c != label_col]
    sum_na = (
        [df[col].isna().sum()]
        + [
            df[df[label_col] == label][col].isna().sum()
            for label in df_labels_counts.index.tolist()
        ]
        for col in cols
    )
    df_sum_na = pd.DataFrame(
        columns=[f"{col}_sum_na" for col in ["all"] + df_labels_counts.index.tolist()],
        data=sum_na,
        index=cols,
    )
    df_sum_na["all_percentage_na"] = df.isna().sum() / len(df)
    for label in labels:
        df_sum_na[f"{label}_percentage_na"] = (
            df_sum_na[f"{label}_sum_na"] / df_labels_counts.loc[label, "num"]
        )
    return df_sum_na

def get_na_cols(df):
    _s_na = df.isna().sum()
    na_cols = _s_na[_s_na > 0].index
    na_cols = list(na_cols) + [label_col]
    return na_cols

na_cols = get_na_cols(df_nzr_threshold)
df_sum_na = get_na_per_label(df_nzr_threshold[na_cols], label_col)
df_sum_na

	all_sum_na	lived_sum_na	died_sum_na	euthanized_sum_na	all_percentage_na	lived_percentage_na	died_percentage_na	euthanized_percentage_na
rectal_temperature	60	26	24	10	0.20	0.15	0.31	0.23
pulse	24	12	11	1	0.08	0.07	0.14	0.02
respiratory_rate	58	31	19	8	0.19	0.17	0.25	0.18
temperature_of_extremities	56	32	13	11	0.19	0.18	0.17	0.25
peripheral_pulse	69	39	18	12	0.23	0.22	0.23	0.27
mucous_membranes	47	28	11	8	0.16	0.16	0.14	0.18
capillary_refill_time	32	19	10	3	0.11	0.11	0.13	0.07
pain	55	34	12	9	0.18	0.19	0.16	0.20
peristalsis	44	22	15	7	0.15	0.12	0.19	0.16
abdominal_distension	56	31	14	11	0.19	0.17	0.18	0.25
nasogastric_tube	104	62	25	17	0.35	0.35	0.32	0.39
rectal_examination	102	56	26	20	0.34	0.31	0.34	0.45
packed_cell_volume	29	13	8	8	0.10	0.07	0.10	0.18
total_protein	33	13	12	8	0.11	0.07	0.16	0.18

We can see that the distributions of the NaNs across the classification field are not even. e.g. the rectal_temperature feature has twice as much NaN in the died & lived classes than in the euthanized class.

Assuming that we don't want to remove any features that have less than 15% of NaNs, no matter how the NaN distribution is across the classification field.

threshold_min_na = 0.15
mask_threshold_min_na = df_sum_na["all_percentage_na"] > threshold_min_na
df_sum_min_na = df_sum_na[mask_threshold_min_na].copy()
df_sum_min_na

	all_sum_na	lived_sum_na	died_sum_na	euthanized_sum_na	all_percentage_na	lived_percentage_na	died_percentage_na	euthanized_percentage_na
rectal_temperature	60	26	24	10	0.20	0.15	0.31	0.23
respiratory_rate	58	31	19	8	0.19	0.17	0.25	0.18
temperature_of_extremities	56	32	13	11	0.19	0.18	0.17	0.25
peripheral_pulse	69	39	18	12	0.23	0.22	0.23	0.27
mucous_membranes	47	28	11	8	0.16	0.16	0.14	0.18
pain	55	34	12	9	0.18	0.19	0.16	0.20
abdominal_distension	56	31	14	11	0.19	0.17	0.18	0.25
nasogastric_tube	104	62	25	17	0.35	0.35	0.32	0.39
rectal_examination	102	56	26	20	0.34	0.31	0.34	0.45

Now we can further analyse our data. Let's see the ratio between the classifications.

def create_ratio_between_classes(df, label_classes):
    for label_a, label_b in combinations(label_classes, 2):
        df[f"ratio_percentage_{label_a}_{label_b}"] = (
            df[f"{label_a}_percentage_na"] / df[f"{label_b}_percentage_na"]
        )
    col_ratio = [col for col in df.columns if "ratio_percentage" in col]
    return df[col_ratio]


create_ratio_between_classes(df_sum_min_na, df_labels_counts.index).plot(
    figsize=(12, 5), rot=45, grid=True
);

Values closer to 1 have similar percentage of NaNs, whereas values that are further away a higher distributions of the NaNs across the classification field.

We can set a lower and upper threshold for filtering out the problematic features. Once all the ratios are between these limits we will want to keep this feature. Any value outside these limits we can assume that the NaNs are unevenly distributed, and the features should be removed.

def get_features_outside_threshold(
    df,
    lt_ratio_threshold=0.7,
    gt_ratio_threshold=1.3,
):
    features_to_drop = []
    for col in [col for col in df.columns if "ratio" in col]:
        mask_ratio_threshold = ~(
            df[col].between(lt_ratio_threshold, gt_ratio_threshold)
        )
        features_to_drop.extend(mask_ratio_threshold[mask_ratio_threshold].index)
    features_to_drop = list(set(features_to_drop))
    return features_to_drop


features_to_drop = get_features_outside_threshold(df_sum_min_na)
features_to_drop

['rectal_examination',
 'abdominal_distension',
 'rectal_temperature',
 'temperature_of_extremities',
 'respiratory_rate']

df_for_model = df_nzr_threshold.drop(columns=features_to_drop)
print(
    f"After removing the features from get_features_outside_threshold function we are left with {number_of_features_with_NaN(df_for_model)} features with `NaN`s that we are going to imputate"
)

After removing the features from get_features_outside_threshold function we are left with 9 features with NaNs that we are going to imputate.

Summary

This post describes the issues while analysing NaNs for feature selection.

Simple filtering methods do not always perform as expected and additional emphsesis should be taken when working with sparse matrices.

We can analyse NaN within features in multiple levels:

At the global level - i.e. the total amount of NaNs within a feature (both for removing and for keeping features)
At the label/classification level - i.e. the relative distribution of NaNs per class.

Finally we recommand trying out the missingno package for graphical analysis of NaN values

The notebook for this post is at this gist link

DEV Community

NaNs bites

Background

The Pit Fall

Data sample

Near Zero Variance

Deeper NaN Analysis

NaN Distribution Among the Classifer Labels

Summary

Top comments (0)

Read next

Universal Text Segmentation Model Outperforms Specialized Systems Across Diverse Domains

Data Engineering with Scala: Mastering Real-Time Data Processing with Apache Flink and Google Pub/Sub

LLM Quantization: Balancing Accuracy and Efficiency for Real-World Deployments

Energy-Efficient Language Models: Addition is All You Need