Modelling/Analysis

In this chapter we will explore how to model/analyze our data using the packages statsmodels for general statistical modelling, lifelines for time to event modelling, and scikit-learn for prediction modelling and machine learning. We will also be using polars for data management, matplotlib for plotting, and numpy for some simple functions.

import numpy as np
import polars as pl
import statsmodels.formula.api as smf
from lifelines import CoxPHFitter, KaplanMeierFitter
from lifelines.datasets import load_rossi, load_waltons
from lifelines.plotting import add_at_risk_counts
from matplotlib import pyplot as plt
from matplotlib.ticker import PercentFormatter
from polars import col
from sklearn.datasets import load_iris
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import Bunch

Unfortunately, statsmodels returns its result as a pandas DataFrame (the older data frame library), but we can easily extract the coefficients and confidence intervals using a simple convenience function.

def extract_coefs(fit, exp=False):
    coefs = pl.from_pandas(fit.params, include_index=True).rename(
        {"index": "coef", "0": "estimate"}
    )
    ci = pl.from_pandas(fit.conf_int(), include_index=True).rename(
        {"None": "coef", "0": "ci_lower", "1": "ci_upper"}
    )

    return coefs.join(ci, on="coef", how="inner", validate="1:1").with_columns(
        pl.exclude("coef").exp() if exp else []
    )

Linear Regression

We will start with a simple linear regression, both using ordinary least squares (OLS) and maximum likelihood (ML) to fit the model (if you haven’t learnt what this is yet, don’t worry, just follow along with the code and save the statistical understanding for later).

First we need to import and prepare our data. Here we will look at how lab values correlate with each other. The following will give us all MCH (mean corpuscular haemoglobin) and MCV (mean corpuscular volume) lab values that were registered on the same date.

lab_events = pl.read_csv("../../data/hosp/labevents.csv.gz", try_parse_dates=True)
lab_items = pl.read_csv("../../data/hosp/d_labitems.csv.gz")

lab = lab_events.join(lab_items, on="itemid", how="left", validate="m:m")

lab_values = (
    lab.sort("charttime")
    .select(
        "subject_id",
        date=col("charttime").dt.date(),
        lab="label",
        value="valuenum",
    )
    .drop_nulls()
    .unique(["subject_id", "date", "lab"], keep="first")
    .pivot("lab", index=["subject_id", "date"])
)

linear_data = lab_values.select("MCH", "MCV").drop_nulls()

Now that we have the data in the right format, let’s look at the scatter plot of MCH and MCV to see if there is any obvious association between the variables.

Code
Output

fig, ax = plt.subplots()
ax.scatter(*linear_data, alpha=0.1)

ax.grid(color="black", alpha=0.1)
ax.set_axisbelow(True)
ax.set_xlabel("MCH")
ax.set_ylabel("MCV")

fig.savefig("../../assets/img/modelling-analysis/figure-z3zH7o.svg")

It looks like there is association between MCH and MCV. Let’s quantify it by assuming a linear association, fitting the model using OLS, and look at a summary of the fit and the coefficients.

Code
Output

fit_ols = smf.ols("MCV ~ MCH", data=linear_data).fit()
fit_ols.summary()

                            OLS Regression Results
==============================================================================
Dep. Variable:                    MCV   R-squared:                       0.734
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     6168.
Date:                Wed, 01 Oct 2025   Prob (F-statistic):               0.00
Time:                        20:23:03   Log-Likelihood:                -6296.9
No. Observations:                2236   AIC:                         1.260e+04
Df Residuals:                    2234   BIC:                         1.261e+04
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     20.9119      0.905     23.102      0.000      19.137      22.687
MCH            2.3731      0.030     78.534      0.000       2.314       2.432
==============================================================================
Omnibus:                       35.806   Durbin-Watson:                   2.008
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               40.626
Skew:                           0.257   Prob(JB):                     1.51e-09
Kurtosis:                       3.413   Cond. No.                         317.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can also extract the coefficients with confidence intervals using our convenience function.

Code
Output

extract_coefs(fit_ols)

┌───────────┬───────────┬───────────┬───────────┐
│ coef      ┆ estimate  ┆ ci_lower  ┆ ci_upper  │
│ ---       ┆ ---       ┆ ---       ┆ ---       │
│ str       ┆ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ Intercept ┆ 20.911902 ┆ 19.136749 ┆ 22.687056 │
│ MCH       ┆ 2.37314   ┆ 2.313882  ┆ 2.432399  │
└───────────┴───────────┴───────────┴───────────┘

We can also get the estimate and confidence interval for a specific linear combination of the coefficients where we decide the value of variables (in this case MCH).

Code
Output

est_lin_comb = fit_ols.t_test([1, 30])
est_lin_comb

                            Test for Constraints
==============================================================================
                coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0            92.1061      0.086   1074.401      0.000      91.938      92.274
==============================================================================

Finally we can inspect the predicted values for each observation in the original data and compare them with the actual measured value.

Code
Output

data_predict = pl.concat(
    [
        linear_data,
        pl.from_pandas(fit_ols.predict(linear_data)).rename("MCV_predict").to_frame(),
    ],
    how="horizontal",
)

data_predict

┌──────┬───────┬─────────────┐
│ MCH  ┆ MCV   ┆ MCV_predict │
│ ---  ┆ ---   ┆ ---         │
│ f64  ┆ f64   ┆ f64         │
╞══════╪═══════╪═════════════╡
│ 20.8 ┆ 74.0  ┆ 70.273223   │
│ 29.5 ┆ 91.0  ┆ 90.919545   │
│ 26.8 ┆ 87.0  ┆ 84.512065   │
│ 31.0 ┆ 93.0  ┆ 94.479255   │
│ 35.4 ┆ 103.0 ┆ 104.921073  │
│ …    ┆ …     ┆ …           │
│ 29.3 ┆ 93.0  ┆ 90.444916   │
│ 30.0 ┆ 96.0  ┆ 92.106115   │
│ 31.8 ┆ 93.0  ┆ 96.377767   │
│ 30.9 ┆ 92.0  ┆ 94.241941   │
│ 24.3 ┆ 87.0  ┆ 78.579214   │
└──────┴───────┴─────────────┘

Now that we have a fitted model, we can plot the regression line in the figure to get a visual indication of how well the model fits the trend of the data.

Code
Output

x = np.linspace(20, 38, 100)
y = fit_ols.params["Intercept"] + fit_ols.params["MCH"] * x

ax.plot(x, y, color="red")

fig.savefig("../../assets/img/modelling-analysis/figure-YoccW3.svg")

It looks pretty good! But maybe the MCV values curve slightly upward at higher MCH values, so a better model could be a second degree polynomial (including MCH to the power of two).

Code
Output

fit_ols_2 = smf.ols("MCV ~ MCH + I(MCH ** 2)", data=linear_data.to_pandas()).fit()
fit_ols_2.summary()

                            OLS Regression Results
==============================================================================
Dep. Variable:                    MCV   R-squared:                       0.743
Model:                            OLS   Adj. R-squared:                  0.743
Method:                 Least Squares   F-statistic:                     3227.
Date:                Wed, 01 Oct 2025   Prob (F-statistic):               0.00
Time:                        20:25:58   Log-Likelihood:                -6259.1
No. Observations:                2236   AIC:                         1.252e+04
Df Residuals:                    2233   BIC:                         1.254e+04
Df Model:                           2
Covariance Type:            nonrobust
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      71.7086      5.859     12.240      0.000      60.220      83.197
MCH            -1.1290      0.400     -2.820      0.005      -1.914      -0.344
I(MCH ** 2)     0.0598      0.007      8.772      0.000       0.046       0.073
==============================================================================
Omnibus:                       24.402   Durbin-Watson:                   2.024
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               29.443
Skew:                           0.177   Prob(JB):                     4.04e-07
Kurtosis:                       3.436   Cond. No.                     6.37e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.37e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Again extracting the coefficients.

Code
Output

extract_coefs(fit_ols_2)

┌─────────────┬───────────┬───────────┬───────────┐
│ coef        ┆ estimate  ┆ ci_lower  ┆ ci_upper  │
│ ---         ┆ ---       ┆ ---       ┆ ---       │
│ str         ┆ f64       ┆ f64       ┆ f64       │
╞═════════════╪═══════════╪═══════════╪═══════════╡
│ Intercept   ┆ 71.708557 ┆ 60.219768 ┆ 83.197345 │
│ MCH         ┆ -1.128983 ┆ -1.914034 ┆ -0.343932 │
│ I(MCH ** 2) ┆ 0.059779  ┆ 0.046416  ┆ 0.073142  │
└─────────────┴───────────┴───────────┴───────────┘

And plotting the resulting regression line in our scatter plot.

Code
Output

x_2 = np.linspace(20, 38, 100)
y_2 = (
    fit_ols_2.params["Intercept"]
    + fit_ols_2.params["MCH"] * x
    + fit_ols_2.params["I(MCH ** 2)"] * x**2
)

ax.plot(x_2, y_2, color="orange")

fig.savefig("../../assets/img/modelling-analysis/figure-dOPYMg.svg")

Maybe a little closer to the data!

For the two previous models, we used OLS as the fitting method. This works well in the case of linear regression, but a more general approach is to use maximum likelihood estimation. In statsmodels we can do this by using the glm method (GLM, Generalized Linear Model). This can fit many types of models, but without specifying distribution or link function it will default to a simple linear regression.

Code
Output

fit_ml = smf.glm("MCV ~ MCH", data=linear_data).fit()
fit_ml.summary()

                Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                    MCV   No. Observations:                 2236
Model:                            GLM   Df Residuals:                     2234
Model Family:                Gaussian   Df Model:                            1
Link Function:               Identity   Scale:                          16.368
Method:                          IRLS   Log-Likelihood:                -6296.9
Date:                Wed, 01 Oct 2025   Deviance:                       36567.
Time:                        20:27:08   Pearson chi2:                 3.66e+04
No. Iterations:                     3   Pseudo R-squ. (CS):             0.9366
Covariance Type:            nonrobust
==============================================================================
                coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     20.9119      0.905     23.102      0.000      19.138      22.686
MCH            2.3731      0.030     78.534      0.000       2.314       2.432
==============================================================================

We can see that we get the same results as with the OLS method, which is what we should expect. Let’s extract the coefficients.

Code
Output

extract_coefs(fit_ml)

┌───────────┬───────────┬──────────┬───────────┐
│ coef      ┆ estimate  ┆ ci_lower ┆ ci_upper  │
│ ---       ┆ ---       ┆ ---      ┆ ---       │
│ str       ┆ f64       ┆ f64      ┆ f64       │
╞═══════════╪═══════════╪══════════╪═══════════╡
│ Intercept ┆ 20.911902 ┆ 19.13771 ┆ 22.686095 │
│ MCH       ┆ 2.37314   ┆ 2.313914 ┆ 2.432367  │
└───────────┴───────────┴──────────┴───────────┘

And plot the regression line.

Code
Output

x_3 = np.linspace(20, 38, 100)
y_3 = fit_ml.params["Intercept"] + fit_ml.params["MCH"] * x

ax.plot(x_3, y_3, color="purple")

fig.savefig("../../assets/img/modelling-analysis/figure-4PtVY4.svg")

As expected, the regression line completely overlays the previous one from the OLS method.

Logistic Regression

When we have a binary outcome, logistic regression can be a better modelling approach. Let’s look at survival during ICU stays.

patients = pl.read_csv("../../data/hosp/patients.csv.gz", try_parse_dates=True)
icu_stays = pl.read_csv("../../data/icu/icustays.csv.gz", try_parse_dates=True)

icu_death = icu_stays.join(
    patients, on="subject_id", how="inner", validate="m:1"
).with_columns(
    died_in_icu=col("dod")
    .is_between("intime", "outtime", closed="both")
    .fill_null(False)
    .cast(int),
    age=col("anchor_age").add(col("intime").dt.year().sub(col("anchor_year"))),
)

We can fit a logistic regression model using logit from statsmodels.

Code
Output

fit_logit = smf.logit("died_in_icu ~ age + gender", data=icu_death.to_pandas()).fit()
fit_logit.summary()

                          Logit Regression Results
==============================================================================
Dep. Variable:            died_in_icu   No. Observations:                  140
Model:                          Logit   Df Residuals:                      137
Method:                           MLE   Df Model:                            2
Date:                Wed, 01 Oct 2025   Pseudo R-squ.:                 0.05498
Time:                        20:28:14   Log-Likelihood:                -34.044
converged:                       True   LL-Null:                       -36.025
Covariance Type:            nonrobust   LLR p-value:                    0.1380
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      -5.6636      1.826     -3.101      0.002      -9.243      -2.084
gender[T.M]     0.6971      0.720      0.968      0.333      -0.714       2.108
age             0.0399      0.024      1.644      0.100      -0.008       0.087
===============================================================================

And finally extract the coefficients, this time exponentiated to correspond to odd ratios (more on this in your statistics courses).

Code
Output

extract_coefs(fit_logit, exp=True)

┌─────────────┬──────────┬──────────┬──────────┐
│ coef        ┆ estimate ┆ ci_lower ┆ ci_upper │
│ ---         ┆ ---      ┆ ---      ┆ ---      │
│ str         ┆ f64      ┆ f64      ┆ f64      │
╞═════════════╪══════════╪══════════╪══════════╡
│ Intercept   ┆ 0.00347  ┆ 0.000097 ┆ 0.124402 │
│ gender[T.M] ┆ 2.008007 ┆ 0.489607 ┆ 8.23536  │
│ age         ┆ 1.040713 ┆ 0.992356 ┆ 1.091427 │
└─────────────┴──────────┴──────────┴──────────┘

Time-to-Event Analysis

When we have a time-to-event outcome, we can approximate the survival function with the Kaplan-Meier estimator and fit a model to the the data using Cox regression. For time-to-event analysis in Python, we will use the lifelines package.

# Load example data and convert to a Polars DataFrame
waltons = pl.from_pandas(load_waltons())

control = waltons.filter(col("group") == "control")
mir = waltons.filter(col("group") == "miR-137")

kmf_control = KaplanMeierFitter()
kmf_mir = KaplanMeierFitter()

fig, ax = plt.subplots(constrained_layout=True)

fit_control = kmf_control.fit(control["T"], control["E"], label="Control")
kmf_control.plot_survival_function(ax=ax)

fit_mir = kmf_mir.fit(mir["T"], mir["E"], label="miR-137")
kmf_mir.plot_survival_function(ax=ax)

add_at_risk_counts(fit_control, fit_mir, ax=ax)

# Figure formatting
ax.grid(alpha=0.1)
ax.set_axisbelow(True)
ax.yaxis.set_major_formatter(PercentFormatter(xmax=1))
ax.set_xlabel("Time")
ax.set_ylabel("Survival")

fig.savefig("../../assets/img/modelling-analysis/figure-RmuJSA.svg")

Cox Regression

Code
Output

# Load example data. Unfortunately, lifelines is not compatible with Polars
rossi = load_rossi()

cph = CoxPHFitter()
cph.fit(
    rossi,
    duration_col="week",
    event_col="arrest",
    formula="fin + wexp + age * prio",
)

cph.print_summary()

             duration col = 'week'
                event col = 'arrest'
      baseline estimation = breslow
   number of observations = 432
number of events observed = 114
   partial log-likelihood = -659.39
         time fit was run = 2025-10-01 19:03:00 UTC

---
           coef exp(coef)  se(coef)  coef lower 95%  coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
covariate
fin       -0.33      0.72      0.19           -0.70            0.04                0.49                1.05
wexp      -0.24      0.79      0.21           -0.65            0.17                0.52                1.19
age       -0.03      0.97      0.03           -0.09            0.03                0.92                1.03
prio       0.31      1.36      0.17           -0.03            0.64                0.97                1.90
age:prio  -0.01      0.99      0.01           -0.02            0.01                0.98                1.01

           cmp to     z    p  -log2(p)
covariate
fin          0.00 -1.73 0.08      3.57
wexp         0.00 -1.14 0.26      1.97
age          0.00 -0.93 0.35      1.51
prio         0.00  1.80 0.07      3.80
age:prio     0.00 -1.28 0.20      2.32
---
Concordance = 0.64
Partial AIC = 1328.77
log-likelihood ratio test = 31.99 on 5 df
-log2(p) of ll-ratio test = 17.35

Prediction Modelling

Code
Output

# Example from the scikit-learn documentation:
# https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html
# Load example data in scikit-learn format
iris: Bunch = load_iris(as_frame=True)  # type: ignore

X = iris.data[["sepal length (cm)", "sepal width (cm)"]]
y = iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# Processing pipeline
clf = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("knn", KNeighborsClassifier(n_neighbors=11)),
    ]
)

# Plot the decision boundaries
fig, ax = plt.subplots()

clf.fit(X_train, y_train)
DecisionBoundaryDisplay.from_estimator(
    clf,
    X_test,
    response_method="predict",
    plot_method="pcolormesh",
    xlabel=iris.feature_names[0],
    ylabel=iris.feature_names[1],
    shading="auto",
    alpha=0.5,
    ax=ax,
)
scatter = ax.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, edgecolors="k")
ax.legend(scatter.legend_elements()[0], iris.target_names, title="Classes")

fig.savefig("../../assets/img/modelling-analysis/figure-yOAmiH.svg")