PLS for prediction of pectin yield from carbohydrate microarray data

This notebook exemplifies the PLS modeling approach utilizing carbohydrate microarray data. The data used for this notebook was derived from Baum et al. [1]. Carbohydrate microarrays are typically equipped with multiple monoclonal antibodies and carbohydrate binding modules (CBM), which will specifically bind to unique structural polysaccharide domains. Pectin is such a polysaccharide. It is typically extracted from citrus peel and used as a gelling agent in jams etc. In the following we will establish Partial Least Squares (PLS) prediction models for pectin yield from the underlying carbohydrate binding pattern of various extracted samples. Or said in a different way we want to find a PLS model, which we could use on a newly measured carbydrate microarray row (= pectin extract sample) to predict its final pectin yield.

Import dependencies and load data

If you encounter errors in this section you might need to install/upgrade one or several of the following packages.

[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mbpls.data.get_data import load_CarbohydrateMicroarrays_Data
[2]:
data = load_CarbohydrateMicroarrays_Data()
Following dataset were loaded as Pandas Dataframes:
 dict_keys(['extraction2', 'extraction3', 'extraction1'])
[3]:
extraction1 = data['extraction1']
extraction2 = data['extraction2']
extraction3 = data['extraction3']

Plot carbohydrate microarray data

During the study by Baum et al. [1] three extractions of pectin were carried out. These are shown as heatmaps below. The reference values \(\textbf{y}\) (final pectin yield) are given as row labels. The extracted samples were measured against 30 monoclonal antibodies/CBMs (columns). Lighter color in the individual cells resemble high degree of interaction between antibody/CBM with the extracted pectin.

[4]:
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(18,12))
sns.heatmap(extraction1, ax=ax[0][0], annot=True, cbar=False)
sns.heatmap(extraction2, ax=ax[0][1], annot=True, cbar=False)
sns.heatmap(extraction3, ax=ax[1][0], annot=True, cbar=False)
ax[1][1].axis('off')

ax[0][0].set_title('Extraction1', fontsize=18)
ax[0][1].set_title('Extraction2', fontsize=18)
ax[1][0].set_title('Extraction3', fontsize=18)
ax[0][0].set_xlabel('monoclonal Antibody / CBM', fontsize=16)
ax[0][1].set_xlabel('monoclonal Antibody / CBM', fontsize=16)
ax[1][0].set_xlabel('monoclonal Antibody / CBM', fontsize=16)
ax[0][0].set_ylabel('Pectin yield [g]', fontsize=16)
ax[0][1].set_ylabel('Pectin yield [g]', fontsize=16)
ax[1][0].set_ylabel('Pectin yield [g]', fontsize=16)
plt.tight_layout()
../../_images/examples_real_world_applications_Carbohydrate_Microarray_PLS_7_0.png

Concatenate data

To perform PLS modeling all three extractions are concatenated into a single matrix \(\textbf{X}\) for PLS modeling. Pectin yields resemble reference values \(\textbf{y}\) for supervised learning.

[5]:
extractions_all = pd.concat((extraction1, extraction2, extraction3),axis=0)
yield_all = extractions_all.index
[6]:
plt.figure(figsize=(9,12))
sns.heatmap(extractions_all, annot=True, cbar=False)
plt.title('all Extractions concatenated', fontsize=18)
plt.xlabel('monoclonal Antibody / CBM', fontsize=16)
plt.ylabel('Pectin yield [g]', fontsize=16);
../../_images/examples_real_world_applications_Carbohydrate_Microarray_PLS_10_0.png

Perform Leave-One-Out Cross Validation for parameter tuning

To find the number of latent variables (LV) we perform LOO-CV. It is recommended to perform 5-fold CV to avoid overfitting. However, in this example the number of samples are quite limited, hence the choice for LOO-CV.

[7]:
from mbpls.mbpls import MBPLS
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error

MSEs = []
for lv in range(15):
    mbpls = MBPLS(n_components=lv+1)
    prediction = cross_val_predict(mbpls, extractions_all, yield_all, cv=len(extractions_all))
    prediction = pd.DataFrame(prediction)
    MSEs.append(mean_squared_error(prediction, yield_all))

plt.plot(np.arange(1,16), MSEs)
plt.xlabel('number of LVs', fontsize=16)
plt.ylabel('LOO-CV MSE', fontsize =16)
plt.title('Find the right number of LVs', fontsize =18);
../../_images/examples_real_world_applications_Carbohydrate_Microarray_PLS_12_0.png

Finally, we show the calibration for a model with 5 LVs. The actual MSE minimum indicates 7 LVs. However, we want to prevent from overfitting and choose only 5 LVs (increased bias).

[8]:
mbpls = MBPLS(n_components=5)
prediction = cross_val_predict(mbpls, extractions_all, yield_all,
                               cv=len(extractions_all))
prediction = pd.DataFrame(prediction)
yield_all = pd.DataFrame(np.array(yield_all))
prediction = pd.concat((prediction, yield_all), axis=1)
prediction.columns=['predicted yield [g]', 'known yield [g]']
prediction.plot.scatter(x='known yield [g]', y='predicted yield [g]')
plt.plot([prediction.min().min(), prediction.max().max()],
         [prediction.min().min(), prediction.max().max()], color='red')
plt.ylabel(prediction.columns[0], fontsize=16)
plt.xlabel(prediction.columns[1], fontsize=16)
plt.title('MSE = {:.4f} $g^2$'.format(mean_squared_error(prediction['known yield [g]'],
                                        prediction['predicted yield [g]'])), fontsize=18);
../../_images/examples_real_world_applications_Carbohydrate_Microarray_PLS_14_0.png

The calibration above indicates that carbohydrate microarrays can be used to predict pectin yield. The mean square error (MSE) is 0.0013 \(g²\)

Let’s look at the regression vector \(\boldsymbol{\beta}\)

To calculate the regression vector \(\boldsymbol{\beta}\) we first need to calculate a global model using our 5 LVs (we found these in the cross validation section).

[9]:
import seaborn as sns
mbpls = MBPLS(n_components=5)
mbpls.fit_transform(extractions_all, yield_all)
plt.figure(figsize=(18,5))
sns.barplot(x=extraction1.columns, y=mbpls.beta_[:,0])
plt.title('regression vector $\\beta$', fontsize=18)
plt.xlabel('antibody/CBM importance', fontsize=16)
plt.ylabel('amplitude', fontsize=16)

plt.figure(figsize=(18,12))
sns.heatmap(extractions_all, annot=True, cbar=False)
plt.title('all Extractions concatenated', fontsize=18)
plt.xlabel('monoclonal Antibody / CBM', fontsize=16)
plt.ylabel('Pectin yield [g]', fontsize=16);
../../_images/examples_real_world_applications_Carbohydrate_Microarray_PLS_17_0.png
../../_images/examples_real_world_applications_Carbohydrate_Microarray_PLS_17_1.png

From the regression vector \(\boldsymbol{\beta}\) (see above) one can understand which antibodies and/or CBMs are important for prediction of prectin yield. Additionally, the signs of the individual coefficents indicate positive or negative correlations, respectively.

References

[1] Baum, Andreas, et al. “Prediction of pectin yield and quality by FTIR and carbohydrate microarray analysis.” Food and Bioprocess Technology 10.1 (2017): 143-154.