Fairness and Accuracy are written about as being in conflict with each other. Why is that? And is this always the case?

1

Introduction

The fairness-utility trade-off is an important concept in the algorithmic fairness literature. It states that when some notion of fairness is enforced then usually the accuracy (or ‘utility’) suffers. This, of course, depends on the fairness metric used. But more importantly it depends very much on the dataset that you have.

We consider the two most popular (and simplest) fairness notions, demographic parity and equality of opportunity. We can see that in the case of a perfectly balanced dataset, there is no trade-off between demographic parity and accuracy. (Or rather there doesn't fundamentally have to be a trade-off. Of course, if you have an imperfect classifier, then a trade-off can still exist.) For equality of opportunity, on the other hand, given any dataset and a perfect classifier, no trade-off exists. So in this case, it is the imperfection of the classifier that induces a trade-off.

Below we take a look at all the different settings one by one and determine whether there is a trade-off. Before we get there, let’s define some terminology.

Terminology

Sensitive attribute: a label, usually denoted by $s$, which corresponds to a protected characteristic which we do not want to base decisions on.

Class label: the prediction target, usually denoted by $y$.

Prediction: output of the classifier, usually denoted by $\hat{y}$.

Ideal classifier: we call a classifier ideal if it perfectly predicts all the test labels. This means it has perfect utility. It is often not possible to have such a classifier because the data is underspecified.

Random classifier: a classifier which predicts all possible classes with equal probability completely at random.

Majority classifier: a classifier which always predicts the class label that occurred most often in the training set.

2

Fairness definitions

Demographic parity (DP)

We start with the simplest fairness definition: demographic parity (DP) or statistical parity. In terms of a sensitive attribute $s$ and predictions $\hat{y}$ for a class label $y$, it is defined as: P(\hat{y}=1|s=0)=P(\hat{y}=1|s=1) That is, the probability for a positive prediction ought to be the same for both demographic groups.

A random classifier will always satisfy DP regardless of the dataset, because its likelihood of predicting $\hat{y}=1$ does not depend on the data at all, and so can also not depend on $s$.

The majority classifier also always satisfies DP for the same reason: its predictions do not depend on anything. $P(\hat{y}=1|s=s')$ will either be 0 or 1 regardless of $s'$.

Of course, these two classifiers have rather poor accuracy, so it’s maybe not surprising that they are able to achieve perfect fairness. For other, higher utility classifiers, the outcome strongly depends on the dataset, as we shall see.

Equality of Opportunity (EOpp)

The next definition of fairness is slightly less intuitive. Equal Opportunity (EOpp) is about balancing the True Positive Rates (TPRs) of a decision model. P(\hat{y}=1|s=0, y=1) = P(\hat{y}=1|s=1, y=1) This can also be seen as balancing the accuracy for the subset where $y=1$.

The ideal classifier and the majority classifier always satisfy this. In the case of the former, all TPRs are 1, and with the latter, the TPRs are all either 0 or 1. The case of the random classifier is a bit more interesting. If all the groups have the same outcomes, then it will satisfy EOpp, but if one group has, for example, more samples with $y=1$ than the other groups, then its TPR will differ from the others. In this case, the random classifier does not satisfy EOpp.

Equalized Odds (EOdds)

Equalized Odds is a generalization of Equality of Opportunity where the rates are not only equal for $y=1$, but for all values of $y$. P(\hat{y}=y'|s=0, y=y') = P(\hat{y}=y'|s=1, y=y')\quad\forall y' This can again be interpreted as balancing the accuracies in the subsets corresponding to $y=y'$.

3

Analysis of different settings

4-way balanced dataset

In this dataset, all 4 “quadrants” corresponding to $(s=0,y=0)$, $(s=0,y=1)$, $(s=1,y=0)$ and $(s=1,y=1)$ have the same size: \begin{aligned} 0.25 &=P(s=0,y=0) = P(s=0,y=1)\\ &= P(s=1,y=0)= P(s=1,y=1) \end{aligned}

/images/blog/balanced.svg
4-way balanced dataset.

We thus have P(y=1|s=0)=P(y=1|s=1) in the dataset (i.e. the same equation as above except the $\hat{y}$’s are replaced by $y$’s). It makes sense then that if we apply an ideal classifier to this, it will satisfy demographic parity.

This means there is no trade-off here: a classifier exists which has perfect utility and perfect fairness.

It’s worth noting that an ideal classifier, on this dataset, achieves demographic parity, perfect accruacy, and EOpp/EOdds.

Group-balanced dataset

We are now starting to relax some of the constraints of the 4-way balance. In group-balanced datasets, the 4 quadrants are not equal, but all groups occur at the same rate and they each have the same outcomes. For example, the dataset is 50% female and 50% male, and in both groups 20% of individuals have a positive outcome.

/images/blog/balanced2.svg
Group balanced dataset.

So, the following equation holds for all $y'\in{0, 1}$: 0.5 =P(s=0|y=y')=P(s=1|y=y') This still implies demographic parity on the dataset labels: \begin{aligned} P(y=1|s=0)&=\frac{P(y=1,s=0)}{P(s=0)}=\frac{P(s=0|y=1)P(y=1)}{P(s=0)}\\ &=\frac{P(s=1|y=1)P(y=1)}{P(s=1)}=P(y=1|s=1) \end{aligned} This dataset still provides most of the benefits of the 4-way balanced dataset. An ideal classifier still has perfect utility and perfect fairness. Non-ideal classifiers usually also don’t have too much trouble with this. The dataset emphasises both demographic groups equally, so even a non-ideal classifier is unlikely to pick up a bias here.

Outcome-balanced dataset

Next, we are relaxing the requirement of equal-sized groups, while keeping the requirement that outcomes are the same in all groups. For example, the dataset could be 35% male and 65% female. However, the outcome percentage has to be exactly the same in both: for example 20% of both groups have a positive outcome.

outcome-balaned dataset
Outcome-balanced dataset.

In this dataset we still naturally have demographic parity (DP), because within a group, the probability of a certain outcome is the same as in other groups: P(y=1 | s=0)=P(y=1 | s=1) An ideal classifier trained on this dataset will still maintain all of the benefits of being trained on the 4-way balanced dataset.

One danger is however, that the classifier will spend less “effort” on the group that is underrepresented and the predictions will become worse. This is in contrast to the group-balanced dataset above where groups occur at equal rates.

By its definition, the ideal classifier will not have this problem, but if we loosen this definition of a model, we can get 2 new classifiers. A “best possible” classifier which has 100% test accuracy on samples that are drawn i.i.d. from the training distribution. We also introduce a “realistic” classifier which generalizes reasonably well, but achieves <100% test accuracy.

Whilst the best possible classifier will still be ok in what’s been described so far, the realistic classifier is susceptible to this kind of cutting corners.

This is also true for Equality of Opportunity. While EOpp can in theory be achieved on any dataset, algorithms struggle with it on imbalanced datasets, because they spend more effort on the majority class.

However, if the only problem is the imbalance of the groups (and data quality is not a problem), then the effects can be relatively easily mitigated by re-weighting. This makes it so that ERM (empirical risk minimization) puts equal weight on all groups and will work for DP and EOpp alike.

Imbalanced dataset

/images/blog/balanced4.svg
Imbalanced dataset.

Finally, in an imbalanced dataset, there are no constraints.

An ideal classifier will achieve maximum accuracy, and therefore be fair with regard to Equal Opportunity $P(\hat{y}=1|s, y=1)$, but it can no longer be fair with regard to demographic parity. The latter being simply due to the fact that the labels themselves don’t satisfy demographic parity and so perfect predictions also won’t.

Non-ideal classifiers may still achieve EOpp, but as with the outcome-balanced dataset, they will most likely struggle. And imbalanced datasets actually have an additional source of trouble with regards to Equality of Opportunity: often the minority group has quite imbalanced classes. So, if, for example, $s=0$ is the minority group, then there might only be 10% positive labels ($y=1$) for $s=0$ and 90% negative labels ($y=0$). This is another opportunity for the prediction algorithm to cut corners: if it always predicts $y=0$ for $s=0$, it will still achieve 90% accuracy! So we end up with a TPR of 0 for $s=0$ (i.e. $P(\hat{y}=1|s=0,y=1)=0$) while the TPR for $s=1$ might still be close to 1.

Misleading training set labels

Up till now we always assumed that training and test set have the same distribution, but sometimes this is not the case. In this case, it can happen that accuracy and fairness are aligned: a classifier with better fairness also has better accuracy.

For example, training on the “Imbalanced dataset” above, but when you deploy the model, the population looks more like the “Group Balanced Dataset”.

/images/blog/balanced_test.svg
Misleading training set labels.

Why might this setup occur? There could be sampling bias: the labeled training data was collected from a non-representative sample of the population on which the model will be deployed. There can also be label bias: the labeled training data contains samples that were mis-labeled in a way that correlates with membership of a sensitive group.

Is this a realistic problem? Well, yes. If the problem you’re trying to solve subscribes to the “We’re all equal” worldview then the deployment setting should generally conform to a balanced outcome.

Demographic parity. Enforcing demographic parity will then improve test accuracy, because it just so happens that the test set satisfies demographic parity, and enforcing balanced predictions makes those predictions closer to the test distribution. This is of course not guaranteed to work: demographic parity can be enforced in many ways. In particular, the random classifier and the majority classifier are very unlikely to get high test accuracy. This is because they just enforce DP blindly without considering where it makes sense to move the decision boundaries. Still, classifiers that consider accuracy, can perform well in this scenario. In essence, we're introducing knowledge of the deployment setting as an additional inductive bias which says the model should maintain demographic parity, and this improves the accuracy.

Equality of opportunity. As long as we evaluate with respect to the true labels from the deployment setting, equality of opportunity is compatible with perfect accuracy.

Poor data quality for some of the groups

Sometimes the problem is just the quality of the data. In this case, the labels might even be correct, but the quality of the features differs dramatically between groups. So, for example, the features for group $s=0$ might just not be good enough to make decent predictions. In this case, the classifier will often fall back on the base rate: if it knows that approximately 40% of samples have a positive label in $s=0$, then it will just assign a positive label to a more-or-less random 40% of the test data.

Demographic parity actually doesn’t necessarily suffer under this scenario, because there we don’t care about accurate predictions.

For equality of opportunity however, this can induce a kind of trade-off between accuracy and fairness: the only way to bring the TPRs of the other groups close to $s=0$ (the group with the bad features), the algorithms must do equally bad on the other groups, leading to abysmal overall performance.

4

Conclusion

As we saw, there are a considerable number of cases where no utility-fairness trade-off occurs. If the dataset is balanced in specific ways, fairness and utility are perfectly compatible. If only the training set is biased but the test set isn’t, then we even see that improving demographic parity also increases accuracy!