DEV Community

John Calabrese
John Calabrese

Posted on

Binary Classifiers

Some ML concepts I tend to regularly forget and refresh.

Binary Classifiers

We have data (xi,yi)(x_i,y_i) where the inputs x are numbers, and the output y are yes/no (encoded as 1 or 0).

Metrics

Once you've trained a model you'll make predictions xiy^ix_i \mapsto \hat{y}_i . Depending on the pair (y^i,yi)(\hat{y}_i,y_i) you might have: true positive, true negative, false positive, false negative.

Accuracy

Accuracy measures how many we got right over how many samples there are. Let N = TP + TN + FP + FN be the total number of samples we are evaluating. Then accuracy is

Accuracy=TP+TNN\text{Accuracy} = \frac{\text{TP}+\text{TN}}{N}

This metric is terrible when the data is strongly imbalanced. For example, let's say we are training a search engine with a lots of different documents. So x is a pair (search query, result) and y is the outcome good/bad. It's not unreasonable to expect our data to mostly be (query,result) pairs labeled "bad". So we could just have a model always assign x0x \mapsto 0 and achieve super high accuracy.

Precision and Recall

High precision means few false positives.

Precision=TPTP+FP\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

High recall means few false negatives.

Recall=TPTP+FN\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

For example in a two-stage search engine, the first stage should have high recall (to make sure all the relevant documents are there, at the cost of throwing in some irrelevant ones), the second stage should have high precision.

F1

Ideally we want high F1, which combines both precision and recall.

F1=PrecisionRecallPrecision+Recall\text{F1} = \frac{ \text{Precision} \cdot \text{Recall} } { \text{Precision} + \text{Recall} }

the threshold

Another issue is that typically our model ff doesn't really have a binary output, but rather a "probability" (or score) f(xi)[0,1]f(x_i) \in [0,1] . So then it's up to us to decide a cutoff: if f(xi)>tf(x_i) > t then it's a 0, otherwise it's a 1. How to decide a cutoff? And is there a cutoff-independent way of evaluating our model?

  • Cutoff can be decided whether you favor precision or recall most.
  • Models could be evaluated by the ROC-AUC, or by looking at the precision-recall curve.

Logistic Regression

Possibly the second most basic binary classifier, after nearest neighbors. You have data xiRd,yi0,1x_i \in \mathbb R^d , y_i \in {0,1} . You apply a linear map to x to produce a scalar, ie you take the dot product with some other vector f(x,w)=wxf(x,w) = w \cdot x . We then hit this with a sigmoid, to get a number between 0 and 1. The sigmoid is

σ(t)=et1+et\sigma(t) = \frac{e^t}{1 + e^t}

so that y^=σ(f(x,w))[0,1].\hat y = \sigma(f(x,w)) \in [0,1].

What loss function is appropriate here? We should interpret y^\hat y as being a probability of being 1, in some sense.
More formally, if we treat our inputs and outputs as random variables, the idea is that σ(f(X,w))=E[YX]\sigma(f(X,w)) = E[Y|X] . Since Y can only take 0,1 values it must be a Bernoulli random variable, whose likelihood is

pY(1p)1Yp^Y (1-p)^{1-Y}

where p=E[YX]p = E[Y|X] . Following the maximum likelihood principle, given our data xi,yix_i,y_i the best ww is the one maximizing the likelihood, which is (assuming as is customary that our samples are iid)

w=arg maxwiyi^yi(1yi^)1yiw^* = \argmax_w \prod_i \hat{y_i}^{y_i} (1 - \hat{y_i})^{1-y_i}

and since log is monotone, this is equivalent to

w=arg minwiyilog(yi^)(1yi)log(1yi^)w^* = \argmin_w \sum_i - y_i \log(\hat{y_i}) - (1-y_i) \log(1-\hat{y_i})

which is the so-called binary cross-entropy loss.

Top comments (0)