Modelling / Analysis

Try out the basics of modelling and analysis with statsmodels, lifelines, and scikit-learn.

Become a p-Hacker 👨‍💻

p-hacking is the misuse of statistics to find patterns in data without accounting for the risk of false positives. This is of course not something that we should be doing in our research, so let’s try it out now that we’re on a course and no harm can be done. Explore the lab events data and try to find interesting association between lab values. First plot the scatter plot between the values, then fit a model (OLS or ML), and finally plot the regression line over the scatter plot. You can also try to model non-linear relationships by adding polynomial terms. Happy hunting!

Exercise
Solution

# %%
# Import packages (copy this from the solution)


# %%
# Convenience function (copy this from the solution)


# %%
# Data management (copy this from the solution)


# %%
# Assess what lab values are the most common (copy this from the solution)


# %%
# Replace MCH and MCV with other lab values (copy this from the solution)


# %%
# --- START HERE ---
# Assess the association with a scatter plot, change alpha as needed


# %%
# Fit a linear regression using OLS and inspect the summary


# %%
# Extract the coefficients with confidence intervals


# %%
# Plot the regression line

# %%
# Import packages (copy this from the solution)
import numpy as np
import polars as pl
import statsmodels.formula.api as smf
from matplotlib import pyplot as plt
from polars import col


# %%
# Convenience function (copy this from the solution)
def extract_coefs(fit, exp=False):
    coefs = pl.from_pandas(fit.params, include_index=True).rename(
        {"index": "coef", "0": "estimate"}
    )
    ci = pl.from_pandas(fit.conf_int(), include_index=True).rename(
        {"None": "coef", "0": "ci_lower", "1": "ci_upper"}
    )

    return coefs.join(ci, on="coef", how="inner", validate="1:1").with_columns(
        pl.exclude("coef").exp() if exp else []
    )


# %%
# Data management (copy this from the solution)
lab_events = pl.read_csv("../../../data/hosp/labevents.csv.gz", try_parse_dates=True)
lab_items = pl.read_csv("../../../data/hosp/d_labitems.csv.gz")

lab = lab_events.join(lab_items, on="itemid", how="left", validate="m:m")

lab_values = (
    lab.sort("charttime")
    .select(
        "subject_id",
        date=col("charttime").dt.date(),
        lab="label",
        value="valuenum",
    )
    .drop_nulls()
    .unique(["subject_id", "date", "lab"], keep="first")
    .pivot("lab", index=["subject_id", "date"])
)

# %%
# Assess what lab values are the most common (copy this from the solution)
(
    lab_values.select(pl.exclude("subject_id", "date").is_not_null().sum())
    .unpivot(variable_name="lab", value_name="observations")
    .sort("observations", descending=True)
)

# %%
# Replace MCH and MCV with other lab values (copy this from the solution)
data = lab_values.select("MCH", "MCV").drop_nulls()

# %%
# --- START HERE ---
# Assess the association with a scatter plot, change alpha as needed
fig, ax = plt.subplots()
ax.scatter(*data, alpha=0.1)
ax.set_xlabel(data.columns[0])
ax.set_ylabel(data.columns[1])

ax.grid(color="black", alpha=0.1)
ax.set_axisbelow(True)

# %%
# Fit a linear regression using OLS and inspect the summary
fit_ols = smf.ols("MCV ~ MCH", data=data).fit()
fit_ols.summary()

# %%
# Extract the coefficients with confidence intervals
coefs = extract_coefs(fit_ols)

# %%
# Plot the regression line
x = np.linspace(20, 38, 100)
y = fit_ols.params["Intercept"] + fit_ols.params["MCH"] * x

ax.plot(x, y, color="red")