Modelling/Analysis
In this chapter we will explore how to model/analyze our data using the packages statsmodels
for general statistical modelling, lifelines
for time to event modelling, and scikit-learn
for prediction modelling and machine learning.
We will also be using polars
for data management, matplotlib
for plotting, and numpy
for some simple functions.
import numpy as npimport polars as plimport statsmodels.formula.api as smffrom lifelines import CoxPHFitter, KaplanMeierFitterfrom lifelines.datasets import load_rossi, load_waltonsfrom lifelines.plotting import add_at_risk_countsfrom matplotlib import pyplot as pltfrom matplotlib.ticker import PercentFormatterfrom polars import colfrom sklearn.datasets import load_irisfrom sklearn.inspection import DecisionBoundaryDisplayfrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.utils import Bunch
Unfortunately, statsmodels
returns its result as a pandas
DataFrame (the older data frame library), but we can easily extract the coefficients and confidence intervals using a simple convenience function.
def extract_coefs(fit, exp=False): coefs = pl.from_pandas(fit.params, include_index=True).rename( {"index": "coef", "0": "estimate"} ) ci = pl.from_pandas(fit.conf_int(), include_index=True).rename( {"None": "coef", "0": "ci_lower", "1": "ci_upper"} )
return coefs.join(ci, on="coef", how="inner", validate="1:1").with_columns( pl.exclude("coef").exp() if exp else [] )
Linear Regression
Section titled “Linear Regression”We will start with a simple linear regression, both using ordinary least squares (OLS) and maximum likelihood (ML) to fit the model (if you haven’t learnt what this is yet, don’t worry, just follow along with the code and save the statistical understanding for later).
First we need to import and prepare our data. Here we will look at how lab values correlate with each other. The following will give us all MCH (mean corpuscular haemoglobin) and MCV (mean corpuscular volume) lab values that were registered on the same date.
lab_events = pl.read_csv("../../data/hosp/labevents.csv.gz", try_parse_dates=True)lab_items = pl.read_csv("../../data/hosp/d_labitems.csv.gz")
lab = lab_events.join(lab_items, on="itemid", how="left", validate="m:m")
lab_values = ( lab.sort("charttime") .select( "subject_id", date=col("charttime").dt.date(), lab="label", value="valuenum", ) .drop_nulls() .unique(["subject_id", "date", "lab"], keep="first") .pivot("lab", index=["subject_id", "date"]))
linear_data = lab_values.select("MCH", "MCV").drop_nulls()
Now that we have the data in the right format, let’s look at the scatter plot of MCH and MCV to see if there is any obvious association between the variables.
fig, ax = plt.subplots()ax.scatter(*linear_data, alpha=0.1)
ax.grid(color="black", alpha=0.1)ax.set_axisbelow(True)ax.set_xlabel("MCH")ax.set_ylabel("MCV")
fig.savefig("../../assets/img/modelling-analysis/figure-z3zH7o.svg")
It looks like there is association between MCH and MCV. Let’s quantify it by assuming a linear association, fitting the model using OLS, and look at a summary of the fit and the coefficients.
fit_ols = smf.ols("MCV ~ MCH", data=linear_data).fit()fit_ols.summary()
OLS Regression Results==============================================================================Dep. Variable: MCV R-squared: 0.734Model: OLS Adj. R-squared: 0.734Method: Least Squares F-statistic: 6168.Date: Wed, 01 Oct 2025 Prob (F-statistic): 0.00Time: 20:23:03 Log-Likelihood: -6296.9No. Observations: 2236 AIC: 1.260e+04Df Residuals: 2234 BIC: 1.261e+04Df Model: 1Covariance Type: nonrobust============================================================================== coef std err t P>|t| [0.025 0.975]------------------------------------------------------------------------------Intercept 20.9119 0.905 23.102 0.000 19.137 22.687MCH 2.3731 0.030 78.534 0.000 2.314 2.432==============================================================================Omnibus: 35.806 Durbin-Watson: 2.008Prob(Omnibus): 0.000 Jarque-Bera (JB): 40.626Skew: 0.257 Prob(JB): 1.51e-09Kurtosis: 3.413 Cond. No. 317.==============================================================================
Notes:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We can also extract the coefficients with confidence intervals using our convenience function.
extract_coefs(fit_ols)
┌───────────┬───────────┬───────────┬───────────┐│ coef ┆ estimate ┆ ci_lower ┆ ci_upper ││ --- ┆ --- ┆ --- ┆ --- ││ str ┆ f64 ┆ f64 ┆ f64 │╞═══════════╪═══════════╪═══════════╪═══════════╡│ Intercept ┆ 20.911902 ┆ 19.136749 ┆ 22.687056 ││ MCH ┆ 2.37314 ┆ 2.313882 ┆ 2.432399 │└───────────┴───────────┴───────────┴───────────┘
We can also get the estimate and confidence interval for a specific linear combination of the coefficients where we decide the value of variables (in this case MCH).
est_lin_comb = fit_ols.t_test([1, 30])est_lin_comb
Test for Constraints============================================================================== coef std err t P>|t| [0.025 0.975]------------------------------------------------------------------------------c0 92.1061 0.086 1074.401 0.000 91.938 92.274==============================================================================
Finally we can inspect the predicted values for each observation in the original data and compare them with the actual measured value.
data_predict = pl.concat( [ linear_data, pl.from_pandas(fit_ols.predict(linear_data)).rename("MCV_predict").to_frame(), ], how="horizontal",)
data_predict
┌──────┬───────┬─────────────┐│ MCH ┆ MCV ┆ MCV_predict ││ --- ┆ --- ┆ --- ││ f64 ┆ f64 ┆ f64 │╞══════╪═══════╪═════════════╡│ 20.8 ┆ 74.0 ┆ 70.273223 ││ 29.5 ┆ 91.0 ┆ 90.919545 ││ 26.8 ┆ 87.0 ┆ 84.512065 ││ 31.0 ┆ 93.0 ┆ 94.479255 ││ 35.4 ┆ 103.0 ┆ 104.921073 ││ … ┆ … ┆ … ││ 29.3 ┆ 93.0 ┆ 90.444916 ││ 30.0 ┆ 96.0 ┆ 92.106115 ││ 31.8 ┆ 93.0 ┆ 96.377767 ││ 30.9 ┆ 92.0 ┆ 94.241941 ││ 24.3 ┆ 87.0 ┆ 78.579214 │└──────┴───────┴─────────────┘
Now that we have a fitted model, we can plot the regression line in the figure to get a visual indication of how well the model fits the trend of the data.
x = np.linspace(20, 38, 100)y = fit_ols.params["Intercept"] + fit_ols.params["MCH"] * x
ax.plot(x, y, color="red")
fig.savefig("../../assets/img/modelling-analysis/figure-YoccW3.svg")
It looks pretty good! But maybe the MCV values curve slightly upward at higher MCH values, so a better model could be a second degree polynomial (including MCH to the power of two).
fit_ols_2 = smf.ols("MCV ~ MCH + I(MCH ** 2)", data=linear_data.to_pandas()).fit()fit_ols_2.summary()
OLS Regression Results==============================================================================Dep. Variable: MCV R-squared: 0.743Model: OLS Adj. R-squared: 0.743Method: Least Squares F-statistic: 3227.Date: Wed, 01 Oct 2025 Prob (F-statistic): 0.00Time: 20:25:58 Log-Likelihood: -6259.1No. Observations: 2236 AIC: 1.252e+04Df Residuals: 2233 BIC: 1.254e+04Df Model: 2Covariance Type: nonrobust=============================================================================== coef std err t P>|t| [0.025 0.975]-------------------------------------------------------------------------------Intercept 71.7086 5.859 12.240 0.000 60.220 83.197MCH -1.1290 0.400 -2.820 0.005 -1.914 -0.344I(MCH ** 2) 0.0598 0.007 8.772 0.000 0.046 0.073==============================================================================Omnibus: 24.402 Durbin-Watson: 2.024Prob(Omnibus): 0.000 Jarque-Bera (JB): 29.443Skew: 0.177 Prob(JB): 4.04e-07Kurtosis: 3.436 Cond. No. 6.37e+04==============================================================================
Notes:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.[2] The condition number is large, 6.37e+04. This might indicate that there arestrong multicollinearity or other numerical problems.
Again extracting the coefficients.
extract_coefs(fit_ols_2)
┌─────────────┬───────────┬───────────┬───────────┐│ coef ┆ estimate ┆ ci_lower ┆ ci_upper ││ --- ┆ --- ┆ --- ┆ --- ││ str ┆ f64 ┆ f64 ┆ f64 │╞═════════════╪═══════════╪═══════════╪═══════════╡│ Intercept ┆ 71.708557 ┆ 60.219768 ┆ 83.197345 ││ MCH ┆ -1.128983 ┆ -1.914034 ┆ -0.343932 ││ I(MCH ** 2) ┆ 0.059779 ┆ 0.046416 ┆ 0.073142 │└─────────────┴───────────┴───────────┴───────────┘
And plotting the resulting regression line in our scatter plot.
x_2 = np.linspace(20, 38, 100)y_2 = ( fit_ols_2.params["Intercept"] + fit_ols_2.params["MCH"] * x + fit_ols_2.params["I(MCH ** 2)"] * x**2)
ax.plot(x_2, y_2, color="orange")
fig.savefig("../../assets/img/modelling-analysis/figure-dOPYMg.svg")
Maybe a little closer to the data!
For the two previous models, we used OLS as the fitting method.
This works well in the case of linear regression, but a more general approach is to use maximum likelihood estimation.
In statsmodels
we can do this by using the glm
method (GLM, Generalized Linear Model).
This can fit many types of models, but without specifying distribution or link function it will default to a simple linear regression.
fit_ml = smf.glm("MCV ~ MCH", data=linear_data).fit()fit_ml.summary()
Generalized Linear Model Regression Results==============================================================================Dep. Variable: MCV No. Observations: 2236Model: GLM Df Residuals: 2234Model Family: Gaussian Df Model: 1Link Function: Identity Scale: 16.368Method: IRLS Log-Likelihood: -6296.9Date: Wed, 01 Oct 2025 Deviance: 36567.Time: 20:27:08 Pearson chi2: 3.66e+04No. Iterations: 3 Pseudo R-squ. (CS): 0.9366Covariance Type: nonrobust============================================================================== coef std err z P>|z| [0.025 0.975]------------------------------------------------------------------------------Intercept 20.9119 0.905 23.102 0.000 19.138 22.686MCH 2.3731 0.030 78.534 0.000 2.314 2.432==============================================================================
We can see that we get the same results as with the OLS method, which is what we should expect. Let’s extract the coefficients.
extract_coefs(fit_ml)
┌───────────┬───────────┬──────────┬───────────┐│ coef ┆ estimate ┆ ci_lower ┆ ci_upper ││ --- ┆ --- ┆ --- ┆ --- ││ str ┆ f64 ┆ f64 ┆ f64 │╞═══════════╪═══════════╪══════════╪═══════════╡│ Intercept ┆ 20.911902 ┆ 19.13771 ┆ 22.686095 ││ MCH ┆ 2.37314 ┆ 2.313914 ┆ 2.432367 │└───────────┴───────────┴──────────┴───────────┘
And plot the regression line.
x_3 = np.linspace(20, 38, 100)y_3 = fit_ml.params["Intercept"] + fit_ml.params["MCH"] * x
ax.plot(x_3, y_3, color="purple")
fig.savefig("../../assets/img/modelling-analysis/figure-4PtVY4.svg")
As expected, the regression line completely overlays the previous one from the OLS method.
Logistic Regression
Section titled “Logistic Regression”When we have a binary outcome, logistic regression can be a better modelling approach. Let’s look at survival during ICU stays.
patients = pl.read_csv("../../data/hosp/patients.csv.gz", try_parse_dates=True)icu_stays = pl.read_csv("../../data/icu/icustays.csv.gz", try_parse_dates=True)
icu_death = icu_stays.join( patients, on="subject_id", how="inner", validate="m:1").with_columns( died_in_icu=col("dod") .is_between("intime", "outtime", closed="both") .fill_null(False) .cast(int), age=col("anchor_age").add(col("intime").dt.year().sub(col("anchor_year"))),)
We can fit a logistic regression model using logit
from statsmodels
.
fit_logit = smf.logit("died_in_icu ~ age + gender", data=icu_death.to_pandas()).fit()fit_logit.summary()
Logit Regression Results==============================================================================Dep. Variable: died_in_icu No. Observations: 140Model: Logit Df Residuals: 137Method: MLE Df Model: 2Date: Wed, 01 Oct 2025 Pseudo R-squ.: 0.05498Time: 20:28:14 Log-Likelihood: -34.044converged: True LL-Null: -36.025Covariance Type: nonrobust LLR p-value: 0.1380=============================================================================== coef std err z P>|z| [0.025 0.975]-------------------------------------------------------------------------------Intercept -5.6636 1.826 -3.101 0.002 -9.243 -2.084gender[T.M] 0.6971 0.720 0.968 0.333 -0.714 2.108age 0.0399 0.024 1.644 0.100 -0.008 0.087===============================================================================
And finally extract the coefficients, this time exponentiated to correspond to odd ratios (more on this in your statistics courses).
extract_coefs(fit_logit, exp=True)
┌─────────────┬──────────┬──────────┬──────────┐│ coef ┆ estimate ┆ ci_lower ┆ ci_upper ││ --- ┆ --- ┆ --- ┆ --- ││ str ┆ f64 ┆ f64 ┆ f64 │╞═════════════╪══════════╪══════════╪══════════╡│ Intercept ┆ 0.00347 ┆ 0.000097 ┆ 0.124402 ││ gender[T.M] ┆ 2.008007 ┆ 0.489607 ┆ 8.23536 ││ age ┆ 1.040713 ┆ 0.992356 ┆ 1.091427 │└─────────────┴──────────┴──────────┴──────────┘
Time-to-Event Analysis
Section titled “Time-to-Event Analysis”When we have a time-to-event outcome, we can approximate the survival function with the Kaplan-Meier estimator and fit a model to the the data using Cox regression.
For time-to-event analysis in Python, we will use the lifelines
package.
Kaplan-Meier
Section titled “Kaplan-Meier”# Load example data and convert to a Polars DataFramewaltons = pl.from_pandas(load_waltons())
control = waltons.filter(col("group") == "control")mir = waltons.filter(col("group") == "miR-137")
kmf_control = KaplanMeierFitter()kmf_mir = KaplanMeierFitter()
fig, ax = plt.subplots(constrained_layout=True)
fit_control = kmf_control.fit(control["T"], control["E"], label="Control")kmf_control.plot_survival_function(ax=ax)
fit_mir = kmf_mir.fit(mir["T"], mir["E"], label="miR-137")kmf_mir.plot_survival_function(ax=ax)
add_at_risk_counts(fit_control, fit_mir, ax=ax)
# Figure formattingax.grid(alpha=0.1)ax.set_axisbelow(True)ax.yaxis.set_major_formatter(PercentFormatter(xmax=1))ax.set_xlabel("Time")ax.set_ylabel("Survival")
fig.savefig("../../assets/img/modelling-analysis/figure-RmuJSA.svg")
Cox Regression
Section titled “Cox Regression”# Load example data. Unfortunately, lifelines is not compatible with Polarsrossi = load_rossi()
cph = CoxPHFitter()cph.fit( rossi, duration_col="week", event_col="arrest", formula="fin + wexp + age * prio",)
cph.print_summary()
duration col = 'week' event col = 'arrest' baseline estimation = breslow number of observations = 432number of events observed = 114 partial log-likelihood = -659.39 time fit was run = 2025-10-01 19:03:00 UTC
--- coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95%covariatefin -0.33 0.72 0.19 -0.70 0.04 0.49 1.05wexp -0.24 0.79 0.21 -0.65 0.17 0.52 1.19age -0.03 0.97 0.03 -0.09 0.03 0.92 1.03prio 0.31 1.36 0.17 -0.03 0.64 0.97 1.90age:prio -0.01 0.99 0.01 -0.02 0.01 0.98 1.01
cmp to z p -log2(p)covariatefin 0.00 -1.73 0.08 3.57wexp 0.00 -1.14 0.26 1.97age 0.00 -0.93 0.35 1.51prio 0.00 1.80 0.07 3.80age:prio 0.00 -1.28 0.20 2.32---Concordance = 0.64Partial AIC = 1328.77log-likelihood ratio test = 31.99 on 5 df-log2(p) of ll-ratio test = 17.35
Prediction Modelling
Section titled “Prediction Modelling”# Example from the scikit-learn documentation:# https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html# Load example data in scikit-learn formatiris: Bunch = load_iris(as_frame=True) # type: ignore
X = iris.data[["sepal length (cm)", "sepal width (cm)"]]y = iris.target
# Train-test splitX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
# Processing pipelineclf = Pipeline( steps=[ ("scaler", StandardScaler()), ("knn", KNeighborsClassifier(n_neighbors=11)), ])
# Plot the decision boundariesfig, ax = plt.subplots()
clf.fit(X_train, y_train)DecisionBoundaryDisplay.from_estimator( clf, X_test, response_method="predict", plot_method="pcolormesh", xlabel=iris.feature_names[0], ylabel=iris.feature_names[1], shading="auto", alpha=0.5, ax=ax,)scatter = ax.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, edgecolors="k")ax.legend(scatter.legend_elements()[0], iris.target_names, title="Classes")
fig.savefig("../../assets/img/modelling-analysis/figure-yOAmiH.svg")