Explainable AI with PDP (Partial Dependence Plot)

In this article, we will go through what is PDP and how we can use it to explain the Machine Learning Algorithms

If you hate theory and want to play with the code here is Google Colab for you.

For others who are interested in how Partial Dependence Plot works read the entire story.

Introduction

Understanding functional relations between predictive and predictor variables can be difficult on a regular basis when using a black-box model. Partial Dependence Plots (PDP) were introduced by Friedman in 2001 who was facing a challenge in understanding the gradient-boosted machine he was working on. Usually, it is easy to calculate the importance of a variable but tough to know and understand its individual impact on the predictor variable. PDPs help us solve this problem by providing a way to functionally calculate and understand a variable’s importance.

What is PDP?

PDP’s visualize the marginal effect of a predictor variable on the predictive variable by plotting the average model outcome at different levels of the predictor variable. It gives an idea of the effect that a predictor variable has on the predictive variable on an average.

Assume a class of students who just gave their exams. Each individual’s cumulative grade is dependent on how they perform in each subject. The academic advisor for this class is interested to see how the average grade of the class is affected by one of the subjects ‘ExAI’ where the performance of the students has been mixed. What he does is he calculates the average of students’ grades for different scores scored in the ExAI by the students. To understand this better, he can plot a graph to see an average change in grade. This gives him an idea of the average change in the class grade at various score levels scored by students in ExAI.

This is exactly how Partial Dependence plots work, a selected predictor variable’s contribution to an outcome is calculated by calculating its average marginal effect which ignores the effect of other variables present in the model. These values are plotted on a chart which gives us an understanding of the direction in which the variable affects the outcome too.

As discussed earlier, the PDPs visualize the average marginal effect of a predictor on the predictive variable. To make it clear, let us consider a set S which is a subset of p predictor variables of the set C.

S⊂{X1, X2,…, Xp}

The set S is the set of variables whose response whose partial dependence on the predictor variables is to be calculated by keeping the other existing variables in set C constant. The functional relationship would look like, f(X)=f(XS,XC). Then the partial dependence function of predictors XS on f(X) is

this can be estimated as true values of f and dP(XC) are not known by

where {XC1, XC2,…, XCN} are all the values of XC in the dataset. The partial dependence function results in a set of ordered pairs of partial dependence values at various levels. Friedman proposed to plot lines joining the coordinates which results in the Partial Dependence Plots. This results in such that only the marginal dependence between the variables we selected and the predictive outcome.

Pros

  • PDPs are simple, easy to understand and can be explained to a non-technical person without any difficulties
  • They can be used to compare models to decide which model works best for a use case
  • They are intuitive and easy to implement

Cons

  • PDP assumes on the default that the features are uncorrelated
  • They can only plot averaged marginal dependence function and cannot work on individual points, which can be a huge problem when the dataset has only 2 equal opposite values
  • They also assume that there are no interactions between the variables which is highly unlikely in the real world
  • Though interactions can be plotted, they are only limited to second-order

Implementation

The implementation of this done on Gradient Boosting Machine

What is GBM?

  • Gradient Boosting Machine is a machine learning algorithm that forms an ensemble of weakly predicted decision trees
  • It constructs a forward stage-wise additive model by implementing gradient descent in function space
  • Also known as MART (Multiple Additive Regression Trees) and GBRT (Gradient Boosted Regression Trees)

Dataset:

Pima Indians Diabetes; Target: Outcome

The data is trained by calling the GradientBoostingClassifier function from Scikit learn Library

Visualizations

PDP for every feature

The above plot shows how a change in output varies with variations in feature values. Some key points for interpretation from the above plots:

  • As Pregnancies increase, the person’s chances of becoming diabetic go up
  • Higher the Glucose, the higher the chances of a person becoming diabetic
  • BMI of more than 25 increases an individual's chances of becoming diabetic

3-D PDPs

These plots show the combined effect of two features on the change in output. As seen above, a reduction in both — Insulin and DiabetesPedigreeFunction, results in a negative change in a person being diabetic (nearing a non-diabetic situation).

PDP interact plot

The below plot shows the change output prediction (value inside square) for every combination of values between the features Insulin and DiabetesPedigreeFunction(values given by scale).

There are other visualizations that one can play and try to learn more from the notebook — Link here.

References

  1. The Elements of Statistical Learning: Trevor Hastie, Robert Tibshirani and Jerome Friedman
  2. Molnar, Christoph. “Interpretable machine learning. A Guide for Making Black Box Models Explainable”, 2019. https://christophm.github.io/interpretable-ml-book/.

Thanks to Kartik Kumar and Varun Raj for their contribution to the blog

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store