Modelling / Analysis
Try out the basics of modelling and analysis with statsmodels
, lifelines
, and scikit-learn
.
Become a p-Hacker ๐จโ๐ป
Section titled โBecome a p-Hacker ๐จโ๐ปโp-hacking is the misuse of statistics to find patterns in data without accounting for the risk of false positives. This is of course not something that we should be doing in our research, so letโs try it out now that weโre on a course and no harm can be done. Explore the lab events data and try to find interesting association between lab values. First plot the scatter plot between the values, then fit a model (OLS or ML), and finally plot the regression line over the scatter plot. You can also try to model non-linear relationships by adding polynomial terms. Happy hunting!
# %%# Import packages (copy this from the solution)
# %%# Convenience function (copy this from the solution)
# %%# Data management (copy this from the solution)
# %%# Assess what lab values are the most common (copy this from the solution)
# %%# Replace MCH and MCV with other lab values (copy this from the solution)
# %%# --- START HERE ---# Assess the association with a scatter plot, change alpha as needed
# %%# Fit a linear regression using OLS and inspect the summary
# %%# Extract the coefficients with confidence intervals
# %%# Plot the regression line
# %%# Import packages (copy this from the solution)import numpy as npimport polars as plimport statsmodels.formula.api as smffrom matplotlib import pyplot as pltfrom polars import col
# %%# Convenience function (copy this from the solution)def extract_coefs(fit, exp=False): coefs = pl.from_pandas(fit.params, include_index=True).rename( {"index": "coef", "0": "estimate"} ) ci = pl.from_pandas(fit.conf_int(), include_index=True).rename( {"None": "coef", "0": "ci_lower", "1": "ci_upper"} )
return coefs.join(ci, on="coef", how="inner", validate="1:1").with_columns( pl.exclude("coef").exp() if exp else [] )
# %%# Data management (copy this from the solution)lab_events = pl.read_csv("../../../data/hosp/labevents.csv.gz", try_parse_dates=True)lab_items = pl.read_csv("../../../data/hosp/d_labitems.csv.gz")
lab = lab_events.join(lab_items, on="itemid", how="left", validate="m:m")
lab_values = ( lab.sort("charttime") .select( "subject_id", date=col("charttime").dt.date(), lab="label", value="valuenum", ) .drop_nulls() .unique(["subject_id", "date", "lab"], keep="first") .pivot("lab", index=["subject_id", "date"]))
# %%# Assess what lab values are the most common (copy this from the solution)( lab_values.select(pl.exclude("subject_id", "date").is_not_null().sum()) .unpivot(variable_name="lab", value_name="observations") .sort("observations", descending=True))
# %%# Replace MCH and MCV with other lab values (copy this from the solution)data = lab_values.select("MCH", "MCV").drop_nulls()
# %%# --- START HERE ---# Assess the association with a scatter plot, change alpha as neededfig, ax = plt.subplots()ax.scatter(*data, alpha=0.1)ax.set_xlabel(data.columns[0])ax.set_ylabel(data.columns[1])
ax.grid(color="black", alpha=0.1)ax.set_axisbelow(True)
# %%# Fit a linear regression using OLS and inspect the summaryfit_ols = smf.ols("MCV ~ MCH", data=data).fit()fit_ols.summary()
# %%# Extract the coefficients with confidence intervalscoefs = extract_coefs(fit_ols)
# %%# Plot the regression linex = np.linspace(20, 38, 100)y = fit_ols.params["Intercept"] + fit_ols.params["MCH"] * x
ax.plot(x, y, color="red")