hmeasure.net

Welcome to H-measure.net

This website is dedicated to the seemingly simple yet deeply challenging problem of measuring classification performance.

What is a classifier?

What is classification performance?

Classifier technology attempts to map objects (or more precisely, descriptions of objects) to labels. For example, we can describe a loan applicant in terms of several attributes, such as their annual income, their home ownership status, and their age, and apply a classifier to this description to decide whether to give a loan or not. Classification has driven much of the recent wave of machine learning success stories.

What's wrong with error rates?

The most commonly used measure of classification performance is the Error Rate: the ratio of incorrectly labelled examples to the total number of examples. Though useful, this metric conceals the fact that in reality most classifiers feature a tuning parameter enabling the user to favour false positives over false negatives or vice versa (sensitivity over specificity), depending on the use case. Understanding the performance of a classifier across this spectrum of possible use cases is key.

If an algorithm classifies ten images as either pictures of cats or of dogs, it is straightforward to report how many it got right. However, in real life things are more nuanced. One might be interested in estimating accuracy in future examples, or associate a higher cost with one type of mistake (as, for example, is the case with diagnosing serious conditions). These and other factors have motivated a rich field of study.

What is the H-measure?

The H-measure is a measure of classification performance proposed by D.J.Hand in this paper. It successfully overcomes the problem of capturing performance across multiple potential scenaria. Moreover, it is important in that it proposes a sensible criterion for coherence of performance metrics, which the H-measure satisfies but surprisingly several popular alternatives do not, notably including the Area Under the Curve (AUC) and its variants, such as the Gini coefficient.

Why does it matter?

A large number of real-world decisions are being made using classification technology every day: credit card transaction fraud screening, mortgage applications, medical diagnoses and cyber security.
A large number of businesses, large and small, already rely on classification technology to battle fraud, ensure customer retention, improve customer satisfaction via recommender systems and so on.
A large number of academic resources are spent on producing classifiers that outperform alternatives on the basis of certain widely accepted measures of performance, including metrics whose coherence is now in doubt, such as the AUC.

If you can't measure it reliably, you can't improve it.

Get involved

Fork us or contribute on https://github.com/canagnos/mcp

Join our mailing list or our LinkedIn group