Shell: Evaluating the performance of machine learning models used in the energy sector

Case study from Shell.

This project leverages deep-learning to perform computer vision tasks – semantic segmentation on specialised application domain. The project had about 15 deep-learning (DL) models in active deployment. The DL models are applied in a cascaded fashion to the generated predictions, which then feed into a series of downstream tasks to generate the final output which would be input to manual interpretation task. Hence, AI assurance through model performance evaluation is critical to ensure robust and explainable AI outcomes. Three types of model evaluation tests were designed and implemented into the DL inference pipeline:

More information on the AI White Paper Regulatory Principles.

The regression and integration tests form backbone provide model interpretability against a set of test data. During model development they provide a baseline to interpret whether model performance is improving or degrading conditional on the model training data and parameters. During the model deployment phase these tests also provide early indication of concept drift.

Statistical tests are more designed to predict model performance given the statistics of test data, hence providing a mechanism to detect data drift as models are deployed. Additionally they also give an indication of how robust the DL model performance is to statistical variations in test data.

The output of this AI assurance technique is communicated to AI developers and product owners to monitor potential deviation from expected DL model performance. Furthermore, if performance deviates these teams can operationalize appropriate mitigation measures.

Also, for frontline users and business stakeholders to maintain a high degree of trust in the outcomes of the DL models.

AI developers are responsible for designing and running the model evaluation tests to strengthen the performance testing. Product owners are responsible for leveraging these tests as a first line of defence before new model deployments. The project team works together to adapt the tests to tackle data and concept drift during deployment.

In this project, the predictions of the DL models are ultimately generating inputs for a manual interpretation task. This task is complicated, time consuming and effort intensive, hence it is crucial that the starting point (in this case DL model predictions) be of high-quality in terms of accuracy, detection coverage and very low noise. Furthermore, the outcome of the manual interpretation feeds into a high-impact decision making process.

The quality and robustness of the DL model's prediction is thus of paramount importance. The most important metric to judge the ML model's prediction performance is human-in-the-loop quality control. However, to automate the performance testing into a first line of defence, the model evaluation test suite technique was adopted. Data version control and creating implicit ML experiment pipelines was mainly to ensure that the models could be re-produced end to end (data, code and model performance) within an acceptable margin of error.

First line of defence, automated DL performance testing for QA

Test for model robustness and better interpretability of DL model performance.

Robust explanation of DL model performance for AI developers and end users

Build trust in DL models and workflows with user community

Enables model monitoring by establishing mechanism to detect concept drift.

MLOps hooks for enabling CI-CD during model deployment.

A large number of DL models with very different tasks: detection, classification, noise reduction.

Complexity and variability of problem being addressed by DL makes designing KPIs difficult.

Lack of high quality, representative data that could be used to design the model evaluation

Lack of clear metrics/thresholds to design regression, integration, and statistical tests.

Lack of a stable model evaluation library.

For more information about other techniques visit the OECD Catalogue of Tools and Metrics: https://oecd.ai/en/catalogue/overview

For more information on relevant standards visit the AI Standards Hub: https://aistandardshub.org/