John Calabrese

Posted on Sep 6, 2021

Binary Classifiers

#machinelearning #datascience #cheatsheet

Some ML concepts I tend to regularly forget and refresh.

Binary Classifiers

We have data $(x_i,y_i)$ where the inputs x are numbers, and the output y are yes/no (encoded as 1 or 0).

Metrics

Once you've trained a model you'll make predictions $x_i \mapsto \hat{y}_i$ . Depending on the pair $(\hat{y}_i,y_i)$ you might have: true positive, true negative, false positive, false negative.

Accuracy

Accuracy measures how many we got right over how many samples there are. Let N = TP + TN + FP + FN be the total number of samples we are evaluating. Then accuracy is

\text{Accuracy} = \frac{\text{TP}+\text{TN}}{N}

This metric is terrible when the data is strongly imbalanced. For example, let's say we are training a search engine with a lots of different documents. So x is a pair (search query, result) and y is the outcome good/bad. It's not unreasonable to expect our data to mostly be (query,result) pairs labeled "bad". So we could just have a model always assign $x \mapsto 0$ and achieve super high accuracy.

Precision and Recall

High precision means few false positives.

\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

High recall means few false negatives.

\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

For example in a two-stage search engine, the first stage should have high recall (to make sure all the relevant documents are there, at the cost of throwing in some irrelevant ones), the second stage should have high precision.

F1

Ideally we want high F1, which combines both precision and recall.

\text{F1} = \frac{ \text{Precision} \cdot \text{Recall} } { \text{Precision} + \text{Recall} }

the threshold

Another issue is that typically our model $f$ doesn't really have a binary output, but rather a "probability" (or score) $f(x_i) \in [0,1]$ . So then it's up to us to decide a cutoff: if $f(x_i) > t$ then it's a 0, otherwise it's a 1. How to decide a cutoff? And is there a cutoff-independent way of evaluating our model?

Cutoff can be decided whether you favor precision or recall most.
Models could be evaluated by the ROC-AUC, or by looking at the precision-recall curve.

Logistic Regression

Possibly the second most basic binary classifier, after nearest neighbors. You have data $x_i \in \mathbb R^d , y_i \in {0,1}$ . You apply a linear map to x to produce a scalar, ie you take the dot product with some other vector $f(x,w) = w \cdot x$ . We then hit this with a sigmoid, to get a number between 0 and 1. The sigmoid is

\sigma(t) = \frac{e^t}{1 + e^t}

so that

\hat y = \sigma(f(x,w)) \in [0,1].

What loss function is appropriate here? We should interpret $\hat y$ as being a probability of being 1, in some sense.
More formally, if we treat our inputs and outputs as random variables, the idea is that $\sigma(f(X,w)) = E[Y|X]$ . Since Y can only take 0,1 values it must be a Bernoulli random variable, whose likelihood is

p^Y (1-p)^{1-Y}

where $p = E[Y|X]$ . Following the maximum likelihood principle, given our data $x_i,y_i$ the best $w$ is the one maximizing the likelihood, which is (assuming as is customary that our samples are iid)

w^* = \argmax_w \prod_i \hat{y_i}^{y_i} (1 - \hat{y_i})^{1-y_i}

and since log is monotone, this is equivalent to

w^* = \argmin_w \sum_i - y_i \log(\hat{y_i}) - (1-y_i) \log(1-\hat{y_i})

which is the so-called binary cross-entropy loss.

DEV Community

Binary Classifiers

Binary Classifiers

Metrics

Accuracy

Precision and Recall

F1

the threshold

Logistic Regression

Top comments (0)

Read next

Comprehensive Guide to Data Observability Tools in 2024

Machine Learning Basics: Building Your First Predictive Model in R

Why Run LLM's /SLM's locally

Why Seeing Data Beats Reading It: The Case for Data Visualization