Some ML concepts I tend to regularly forget and refresh.
Binary Classifiers
We have data where the inputs x are numbers, and the output y are yes/no (encoded as 1 or 0).
Metrics
Once you've trained a model you'll make predictions . Depending on the pair you might have: true positive, true negative, false positive, false negative.
Accuracy
Accuracy measures how many we got right over how many samples there are. Let N = TP + TN + FP + FN be the total number of samples we are evaluating. Then accuracy is
This metric is terrible when the data is strongly imbalanced. For example, let's say we are training a search engine with a lots of different documents. So x is a pair (search query, result) and y is the outcome good/bad. It's not unreasonable to expect our data to mostly be (query,result) pairs labeled "bad". So we could just have a model always assign and achieve super high accuracy.
Precision and Recall
High precision means few false positives.
High recall means few false negatives.
For example in a two-stage search engine, the first stage should have high recall (to make sure all the relevant documents are there, at the cost of throwing in some irrelevant ones), the second stage should have high precision.
F1
Ideally we want high F1, which combines both precision and recall.
the threshold
Another issue is that typically our model doesn't really have a binary output, but rather a "probability" (or score) . So then it's up to us to decide a cutoff: if then it's a 0, otherwise it's a 1. How to decide a cutoff? And is there a cutoff-independent way of evaluating our model?
- Cutoff can be decided whether you favor precision or recall most.
- Models could be evaluated by the ROC-AUC, or by looking at the precision-recall curve.
Logistic Regression
Possibly the second most basic binary classifier, after nearest neighbors. You have data
. You apply a linear map to x to produce a scalar, ie you take the dot product with some other vector
. We then hit this with a sigmoid, to get a number between 0 and 1. The sigmoid is
so that
What loss function is appropriate here? We should interpret
as being a probability of being 1, in some sense.
More formally, if we treat our inputs and outputs as random variables, the idea is that
. Since Y can only take 0,1 values it must be a Bernoulli random variable, whose likelihood is
where . Following the maximum likelihood principle, given our data the best is the one maximizing the likelihood, which is (assuming as is customary that our samples are iid)
and since log is monotone, this is equivalent to
which is the so-called binary cross-entropy loss.
Top comments (0)