Skip to content

Modelling/Analysis

In this chapter we will explore how to model/analyze our data using the packages statsmodels for general statistical modelling, lifelines for time to event modelling, and scikit-learn for prediction modelling and machine learning. We will also be using polars for data management, matplotlib for plotting, and numpy for some simple functions.

import numpy as np
import polars as pl
import statsmodels.formula.api as smf
from lifelines import CoxPHFitter, KaplanMeierFitter
from lifelines.datasets import load_rossi, load_waltons
from lifelines.plotting import add_at_risk_counts
from matplotlib import pyplot as plt
from matplotlib.ticker import PercentFormatter
from polars import col
from sklearn.datasets import load_iris
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import Bunch

Unfortunately, statsmodels returns its result as a pandas DataFrame (the older data frame library), but we can easily extract the coefficients and confidence intervals using a simple convenience function.

def extract_coefs(fit, exp=False):
coefs = pl.from_pandas(fit.params, include_index=True).rename(
{"index": "coef", "0": "estimate"}
)
ci = pl.from_pandas(fit.conf_int(), include_index=True).rename(
{"None": "coef", "0": "ci_lower", "1": "ci_upper"}
)
return coefs.join(ci, on="coef", how="inner", validate="1:1").with_columns(
pl.exclude("coef").exp() if exp else []
)

We will start with a simple linear regression, both using ordinary least squares (OLS) and maximum likelihood (ML) to fit the model (if you haven’t learnt what this is yet, don’t worry, just follow along with the code and save the statistical understanding for later).

First we need to import and prepare our data. Here we will look at how lab values correlate with each other. The following will give us all MCH (mean corpuscular haemoglobin) and MCV (mean corpuscular volume) lab values that were registered on the same date.

lab_events = pl.read_csv("../../data/hosp/labevents.csv.gz", try_parse_dates=True)
lab_items = pl.read_csv("../../data/hosp/d_labitems.csv.gz")
lab = lab_events.join(lab_items, on="itemid", how="left", validate="m:m")
lab_values = (
lab.sort("charttime")
.select(
"subject_id",
date=col("charttime").dt.date(),
lab="label",
value="valuenum",
)
.drop_nulls()
.unique(["subject_id", "date", "lab"], keep="first")
.pivot("lab", index=["subject_id", "date"])
)
linear_data = lab_values.select("MCH", "MCV").drop_nulls()

Now that we have the data in the right format, let’s look at the scatter plot of MCH and MCV to see if there is any obvious association between the variables.

fig, ax = plt.subplots()
ax.scatter(*linear_data, alpha=0.1)
ax.grid(color="black", alpha=0.1)
ax.set_axisbelow(True)
ax.set_xlabel("MCH")
ax.set_ylabel("MCV")
fig.savefig("../../assets/img/modelling-analysis/figure-z3zH7o.svg")

It looks like there is association between MCH and MCV. Let’s quantify it by assuming a linear association, fitting the model using OLS, and look at a summary of the fit and the coefficients.

fit_ols = smf.ols("MCV ~ MCH", data=linear_data).fit()
fit_ols.summary()

We can also extract the coefficients with confidence intervals using our convenience function.

extract_coefs(fit_ols)

We can also get the estimate and confidence interval for a specific linear combination of the coefficients where we decide the value of variables (in this case MCH).

est_lin_comb = fit_ols.t_test([1, 30])
est_lin_comb

Finally we can inspect the predicted values for each observation in the original data and compare them with the actual measured value.

data_predict = pl.concat(
[
linear_data,
pl.from_pandas(fit_ols.predict(linear_data)).rename("MCV_predict").to_frame(),
],
how="horizontal",
)
data_predict

Now that we have a fitted model, we can plot the regression line in the figure to get a visual indication of how well the model fits the trend of the data.

x = np.linspace(20, 38, 100)
y = fit_ols.params["Intercept"] + fit_ols.params["MCH"] * x
ax.plot(x, y, color="red")
fig.savefig("../../assets/img/modelling-analysis/figure-YoccW3.svg")

It looks pretty good! But maybe the MCV values curve slightly upward at higher MCH values, so a better model could be a second degree polynomial (including MCH to the power of two).

fit_ols_2 = smf.ols("MCV ~ MCH + I(MCH ** 2)", data=linear_data.to_pandas()).fit()
fit_ols_2.summary()

Again extracting the coefficients.

extract_coefs(fit_ols_2)

And plotting the resulting regression line in our scatter plot.

x_2 = np.linspace(20, 38, 100)
y_2 = (
fit_ols_2.params["Intercept"]
+ fit_ols_2.params["MCH"] * x
+ fit_ols_2.params["I(MCH ** 2)"] * x**2
)
ax.plot(x_2, y_2, color="orange")
fig.savefig("../../assets/img/modelling-analysis/figure-dOPYMg.svg")

Maybe a little closer to the data!

For the two previous models, we used OLS as the fitting method. This works well in the case of linear regression, but a more general approach is to use maximum likelihood estimation. In statsmodels we can do this by using the glm method (GLM, Generalized Linear Model). This can fit many types of models, but without specifying distribution or link function it will default to a simple linear regression.

fit_ml = smf.glm("MCV ~ MCH", data=linear_data).fit()
fit_ml.summary()

We can see that we get the same results as with the OLS method, which is what we should expect. Let’s extract the coefficients.

extract_coefs(fit_ml)

And plot the regression line.

x_3 = np.linspace(20, 38, 100)
y_3 = fit_ml.params["Intercept"] + fit_ml.params["MCH"] * x
ax.plot(x_3, y_3, color="purple")
fig.savefig("../../assets/img/modelling-analysis/figure-4PtVY4.svg")

As expected, the regression line completely overlays the previous one from the OLS method.

When we have a binary outcome, logistic regression can be a better modelling approach. Let’s look at survival during ICU stays.

patients = pl.read_csv("../../data/hosp/patients.csv.gz", try_parse_dates=True)
icu_stays = pl.read_csv("../../data/icu/icustays.csv.gz", try_parse_dates=True)
icu_death = icu_stays.join(
patients, on="subject_id", how="inner", validate="m:1"
).with_columns(
died_in_icu=col("dod")
.is_between("intime", "outtime", closed="both")
.fill_null(False)
.cast(int),
age=col("anchor_age").add(col("intime").dt.year().sub(col("anchor_year"))),
)

We can fit a logistic regression model using logit from statsmodels.

fit_logit = smf.logit("died_in_icu ~ age + gender", data=icu_death.to_pandas()).fit()
fit_logit.summary()

And finally extract the coefficients, this time exponentiated to correspond to odd ratios (more on this in your statistics courses).

extract_coefs(fit_logit, exp=True)

When we have a time-to-event outcome, we can approximate the survival function with the Kaplan-Meier estimator and fit a model to the the data using Cox regression. For time-to-event analysis in Python, we will use the lifelines package.

# Load example data and convert to a Polars DataFrame
waltons = pl.from_pandas(load_waltons())
control = waltons.filter(col("group") == "control")
mir = waltons.filter(col("group") == "miR-137")
kmf_control = KaplanMeierFitter()
kmf_mir = KaplanMeierFitter()
fig, ax = plt.subplots(constrained_layout=True)
fit_control = kmf_control.fit(control["T"], control["E"], label="Control")
kmf_control.plot_survival_function(ax=ax)
fit_mir = kmf_mir.fit(mir["T"], mir["E"], label="miR-137")
kmf_mir.plot_survival_function(ax=ax)
add_at_risk_counts(fit_control, fit_mir, ax=ax)
# Figure formatting
ax.grid(alpha=0.1)
ax.set_axisbelow(True)
ax.yaxis.set_major_formatter(PercentFormatter(xmax=1))
ax.set_xlabel("Time")
ax.set_ylabel("Survival")
fig.savefig("../../assets/img/modelling-analysis/figure-RmuJSA.svg")

# Load example data. Unfortunately, lifelines is not compatible with Polars
rossi = load_rossi()
cph = CoxPHFitter()
cph.fit(
rossi,
duration_col="week",
event_col="arrest",
formula="fin + wexp + age * prio",
)
cph.print_summary()

# Example from the scikit-learn documentation:
# https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html
# Load example data in scikit-learn format
iris: Bunch = load_iris(as_frame=True) # type: ignore
X = iris.data[["sepal length (cm)", "sepal width (cm)"]]
y = iris.target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
# Processing pipeline
clf = Pipeline(
steps=[
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=11)),
]
)
# Plot the decision boundaries
fig, ax = plt.subplots()
clf.fit(X_train, y_train)
DecisionBoundaryDisplay.from_estimator(
clf,
X_test,
response_method="predict",
plot_method="pcolormesh",
xlabel=iris.feature_names[0],
ylabel=iris.feature_names[1],
shading="auto",
alpha=0.5,
ax=ax,
)
scatter = ax.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, edgecolors="k")
ax.legend(scatter.legend_elements()[0], iris.target_names, title="Classes")
fig.savefig("../../assets/img/modelling-analysis/figure-yOAmiH.svg")