Course 5316: Fundamentals of using Python in Health Related Research ¶
Peter Alping, PhD ¶
Orsini Nicola, PhD ¶
Day 4 - Linear and logistic regression models ¶
What are you going to learn?
- specify a linear and logistic regression model
- conduct a univariate and multivariate test of hypothesis
- derive predicted outcomes and their changes for any linear combination
- present the estimated model in a graphical form
- model categorical predictors with dummy variables
- model quantitative predictors with splines
Package¶
statsmodel¶
statsmodel is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from matplotlib.ticker import FormatStrFormatter
Motivating example: observational cohort study¶
df = pd.read_stata('http://www.stats4life.se/data/marathon.dta', convert_categoricals=False)
Conditional mean¶
Consider a linear regression model to conduct statistical inference on the mean outcome conditionally on the values of the predictor.
$E(Y|x) = \beta_0 + \beta_1x $
For example, let's conduct statistical inference on the mean serum sodium concentration depending on weight change.
$E(\texttt{na}|\texttt{wtdiff}) = \beta_0 + \beta_1\texttt{wtdiff}$
# A scatter plot of individual outcomes can be a good starting point
plt.scatter(df.wtdiff, df.na, c='r', alpha=.2)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
plt.show()
# Estimates of the model parameter using Ordinary Least Squares (OLS) method
model = smf.ols('na ~ wtdiff', data=df).fit()
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: na R-squared: 0.169 Model: OLS Adj. R-squared: 0.167 Method: Least Squares F-statistic: 91.86 Date: Thu, 02 Oct 2025 Prob (F-statistic): 6.21e-20 Time: 07:55:02 Log-Likelihood: -1317.2 No. Observations: 455 AIC: 2638. Df Residuals: 453 BIC: 2647. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 139.5534 0.223 624.476 0.000 139.114 139.993 wtdiff -1.2183 0.127 -9.584 0.000 -1.468 -0.968 ============================================================================== Omnibus: 65.662 Durbin-Watson: 1.284 Prob(Omnibus): 0.000 Jarque-Bera (JB): 327.577 Skew: -0.491 Prob(JB): 7.37e-72 Kurtosis: 7.039 Cond. No. 2.04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Store the predicted mean outcome¶
# Add a column in the dataframe containing the predicted outcomes
df['fit'] = model.fittedvalues
# graph individual data points the the predicted mean outcome
plt.scatter(df.wtdiff, df.na, c='r', alpha=.2)
plt.plot(df.wtdiff, df.fit, c='b', alpha=.3 , linewidth=5)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.show()
# comparison of the predicted mean outcome vs observed outcome
subset = np.column_stack((df['wtdiff'], df['na'], df['fit']))
subset[10:30]
array([[ nan, 141. , nan], [ nan, 143. , nan], [ nan, 141. , nan], [ nan, 141. , nan], [ nan, 140. , nan], [ nan, 142. , nan], [ nan, 143. , nan], [ 0.68181818, 149. , 138.72278443], [ -0.90909091, 142. , 140.66096356], [ 1.13636364, 141. , 138.16901896], [ -0.68181818, 138. , 140.38408083], [ 0.22727273, 140. , 139.27654989], [ 0. , 142. , 139.55343263], [ -0.59659091, 149. , 140.2802498 ], [ -0.22727273, 140. , 139.83031537], [ -1.81818182, 147. , 141.76849449], [ 0. , 136. , 139.55343263], [ 0.45454546, 139. , 138.99966716], [ 0. , 143. , 139.55343263], [ 0.22727273, 142. , 139.27654989]])
Univariate Wald-type test¶
Consider testing an hypothesis involving only one regression coefficient
$H_0: \beta_1 = 0 $
A Wald-type statistic is computed from the estimated model and compared with an F-distribution. The result of this test is shown typically in the output of the estimated model.
hypotheses = 'wtdiff = 0'
t_test = model.t_test(hypotheses)
t_test
<class 'statsmodels.stats.contrast.ContrastResults'> Test for Constraints ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ c0 -1.2183 0.127 -9.584 0.000 -1.468 -0.968 ==============================================================================
# you can test any other value of the regression coefficient b1
hypotheses = 'wtdiff = -1.3'
t_test = model.t_test(hypotheses)
t_test
<class 'statsmodels.stats.contrast.ContrastResults'> Test for Constraints ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ c0 -1.2183 0.127 0.643 0.521 -1.468 -0.968 ==============================================================================
Access estimated parameters¶
# access estimated regression coefficients
model.params
Intercept 139.553433 wtdiff -1.218284 dtype: float64
# access estimated var/covariances of the regression coefficients
model.cov_params()
Intercept | wtdiff | |
---|---|---|
Intercept | 0.04994 | 0.011140 |
wtdiff | 0.01114 | 0.016157 |
# access estimated std errors of the regression coefficients
np.sqrt(np.diag(model.cov_params()))
array([0.22347274, 0.12711121])
Inference for any linear combinations of parameters and data¶
Suppose you want to conduct statistical inference on the predicted outcome (or change in predicted outcome) for a certain linear combination of the parameters and data $X\boldsymbol{\beta}$.
Let's consider a large-sample Wald-type test with reference to a standard normal distribution.
The $100(1-\alpha)$% confidence intervals can be obtained as follows
$ \begin{equation} \begin{gathered} X \boldsymbol{\hat \beta} \pm z_{\alpha / 2} \mathrm{diag} [ XV(\boldsymbol{\hat \beta})X' ]^{1/2} \end{gathered} \end{equation} $
Considering that Type I error is $\alpha=0.05$ in a two-sided test, we have that
$z_{0.05/ 2} = Q_{.025}(Z)$
So, we could rewrite the above in the following way to highlight the fact that the degree of confidence is related to the specific quantiles of the sampling distribution being used.
$ \begin{equation} \begin{gathered} Q_p(X \boldsymbol{\hat \beta}) = X \boldsymbol{\hat \beta} + Q_p(Z) \sqrt{ \mathrm{diag}(XV(\boldsymbol{\hat \beta})X') } \end{gathered} \end{equation} $
where $V(\boldsymbol{\hat \beta})$ is the estimated variance/covariance matrix of the estimated of $\boldsymbol{\hat \beta}$.
In Stata this is done with lincom command
In R this is done with ci.lin() function in Epi library
def lincom(X,b,V):
Var = np.dot(X, np.dot(V,X.T))
if Var.ndim == 0:
Se = np.sqrt(Var)
else:
Se = np.sqrt(np.diag(Var))
return (np.matmul(X,b), np.matmul(X,b)+stats.norm.ppf(.025)*Se, np.matmul(X,b)+stats.norm.ppf(.975)*Se)
b = model.params
V = model.cov_params()
Example 1: Confidence interval for the mean outcome¶
Derive a 95% confidence interval of the mean serum sodium concentration for those runners who increased 3 kg.
$E(\texttt{na}|\texttt{wtdiff}=3) = \beta_0 + \beta_1 3$
lincom(np.array([1, 3]),b,V)
(135.898580556215, 134.89498068034243, 136.90218043208756)
def lincom2(X,b,V):
Var = np.dot(X, np.dot(V,X.T))
if Var.ndim == 0:
Se = np.sqrt(Var)
else:
Se = np.sqrt(np.diag(Var))
z = (np.matmul(X,b)-0)/Se
p = stats.norm.cdf( -abs(z))*2
return ( np.matmul(X,b), Se, z, p, np.matmul(X,b)+stats.norm.ppf(.025)*Se, np.matmul(X,b)+stats.norm.ppf(.975)*Se)
results = lincom2(np.array([0, 1]),b,V)
np.round(np.transpose(results),4)
array([-1.2183, 0.1271, -9.5844, 0. , -1.4674, -0.9692])
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: na R-squared: 0.169 Model: OLS Adj. R-squared: 0.167 Method: Least Squares F-statistic: 91.86 Date: Thu, 02 Oct 2025 Prob (F-statistic): 6.21e-20 Time: 07:55:02 Log-Likelihood: -1317.2 No. Observations: 455 AIC: 2638. Df Residuals: 453 BIC: 2647. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 139.5534 0.223 624.476 0.000 139.114 139.993 wtdiff -1.2183 0.127 -9.584 0.000 -1.468 -0.968 ============================================================================== Omnibus: 65.662 Durbin-Watson: 1.284 Prob(Omnibus): 0.000 Jarque-Bera (JB): 327.577 Skew: -0.491 Prob(JB): 7.37e-72 Kurtosis: 7.039 Cond. No. 2.04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Example 2: Confidence interval for the difference in mean outcome¶
Derive a 95% confidence interval of the mean outcome difference comparing those who increased 3 kg vs those who decreased 1 kg.
$E(\texttt{na}|\texttt{wtdiff}=3) - E(\texttt{na}|\texttt{wtdiff}=-1) = \beta_1 (3- -1) = \beta_1 4$
lincom(np.array([0, 4]),b,V)
(-4.873136097422644, -5.869669663999436, -3.876602530845852)
Linear and quadratic function for the quantitative predictors¶
Consider a linear regression model to conduct statistical inference on the mean serum sodium concentration as function of weight change (linear function), and bmi (quadratic function)
$E(\texttt{na}|\texttt{wtdiff}, \texttt{bmi}) = \beta_0 + \beta_1\texttt{wtdiff}+ \beta_2\texttt{bmi} + \beta_3\texttt{bmi}^2$
df['bmisq'] = df['bmi']**2
# Pearson correlation between bmi and bmi squared
valid = df[['bmi', 'bmisq']].dropna()
r, pval = stats.pearsonr(valid['bmi'], valid['bmisq'])
print("Correlation:", r, "p-value:", pval)
Correlation: 0.9965742291212139 p-value: 0.0
model = smf.ols('na ~ wtdiff + bmi + bmisq' , data=df).fit()
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: na R-squared: 0.186 Model: OLS Adj. R-squared: 0.180 Method: Least Squares F-statistic: 34.03 Date: Thu, 02 Oct 2025 Prob (F-statistic): 7.91e-20 Time: 07:55:02 Log-Likelihood: -1305.2 No. Observations: 452 AIC: 2618. Df Residuals: 448 BIC: 2635. Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 105.0147 11.218 9.361 0.000 82.968 127.061 wtdiff -1.1770 0.127 -9.247 0.000 -1.427 -0.927 bmi 2.8707 0.940 3.053 0.002 1.023 4.719 bmisq -0.0587 0.020 -3.006 0.003 -0.097 -0.020 ============================================================================== Omnibus: 60.670 Durbin-Watson: 1.308 Prob(Omnibus): 0.000 Jarque-Bera (JB): 330.172 Skew: -0.400 Prob(JB): 2.01e-72 Kurtosis: 7.110 Cond. No. 3.03e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.03e+04. This might indicate that there are strong multicollinearity or other numerical problems.
F test for a joint hypothesis¶
Since body mass index was modelled with a quadratic function (degree-2 polynomial), then a test for the hypothesis of no adjusted association between body mass index and mean serum sodium concentration is obtained by testing the two regression coefficients equal to zero.
$H_0: \beta_2 = \beta_3 = 0 $
A Wald-type statistic is computed from the estimated model and compared with an F-distribution.
hypotheses = ['bmi = 0', 'bmisq = 0']
f_test = model.f_test(hypotheses)
f_test
<class 'statsmodels.stats.contrast.ContrastResults'> <F test: F=4.767628913599036, p=0.00893659800206336, df_denom=448, df_num=2>
Conditional probability¶
Consider a logistic regression model to conduct statistical inference on the outcome probability conditionally on the values of the predictor.
$P(Y=1|x) = \texttt{invlogit}(\beta_0 + \beta_1x) $
The logit of the outcome probability is related to a linear combination of regression coefficients and variables
$\texttt{logit}\left(P(Y=1|x)\right) = \beta_0 + \beta_1x $
Binary predictor¶
Conduct statistical inference on the odds ratio of hyponatremia associated with female sex.
$\texttt{logit}\left(P(\texttt{nas135}|\texttt{female})\right) = \beta_0 + \beta_1\texttt{female} $
Estimates of the regression coefficients are obtained with the maximum likelihood method.
model = smf.logit('nas135 ~ female ', data=df).fit()
Optimization terminated successfully. Current function value: 0.360585 Iterations 6
print(model.summary())
Logit Regression Results ============================================================================== Dep. Variable: nas135 No. Observations: 488 Model: Logit Df Residuals: 486 Method: MLE Df Model: 1 Date: Thu, 02 Oct 2025 Pseudo R-squ.: 0.05293 Time: 07:55:02 Log-Likelihood: -175.97 converged: True LL-Null: -185.80 Covariance Type: nonrobust LLR p-value: 9.204e-06 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -2.4749 0.208 -11.884 0.000 -2.883 -2.067 female 1.2260 0.280 4.386 0.000 0.678 1.774 ==============================================================================
Confidence interval for the Odds Ratio¶
Thanks to the equivariance property of the maximum likelihood estimator
- obtain estimates with confidence limits
- exponentiate and round them
b = model.params
V = model.cov_params()
np.round(np.exp(lincom(np.array([0, 1]), b, V)), 1)
array([3.4, 2. , 5.9])
params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['LB', 'UB', 'OR']
print(np.around(np.exp(conf[1:]),1))
LB UB OR female 2.0 5.9 3.4
Categorization of continuous predictors¶
df['bmic'] = pd.cut(df['bmi'],bins=[0,20,25,40],labels=['<20','20-25','>25'])
df['runtimehc'] = pd.cut(df['runtime']/60, bins=[0,3.5,4,8] , labels=['<3:30','3:30-4:00','>4:00'])
df['runtimehc'].value_counts(normalize=True)
runtimehc <3:30 0.404612 3:30-4:00 0.308176 >4:00 0.287212 Name: proportion, dtype: float64
df['bmic'].value_counts(normalize=True)
bmic 20-25 0.701075 >25 0.193548 <20 0.105376 Name: proportion, dtype: float64
Final multivariable logistic model with categorized predictors¶
Consider the estimates provided in Table 2 of the NEJM paper.
$\texttt{logit}(P(\texttt{nas135})=1|\texttt{wtdiff}, \texttt{runtime}, \texttt{bmi}) = \beta_0 + \beta_1I(\texttt{wtdiff}>0) + \beta_2 I(3.5<\texttt{runtime}<4) + \beta_3 I(\texttt{runtime}>4) + \beta_4 I(\texttt{bmi}<20) + \beta_5 I(\texttt{bmi}>25)$
Indicator (or dummy) variables are conveniently created with C() notation where treatment is indicating the chosen reference level.
model = smf.logit('nas135 ~ I(wtdiff>0) + C(runtimehc,Treatment(0)) + C(bmic,Treatment(1)) ', data=df).fit()
print(model.summary())
Optimization terminated successfully. Current function value: 0.311873 Iterations 7 Logit Regression Results ============================================================================== Dep. Variable: nas135 No. Observations: 455 Model: Logit Df Residuals: 449 Method: MLE Df Model: 5 Date: Thu, 02 Oct 2025 Pseudo R-squ.: 0.1826 Time: 07:55:02 Log-Likelihood: -141.90 converged: True LL-Null: -173.61 Covariance Type: nonrobust LLR p-value: 2.395e-12 =========================================================================================================== coef std err z P>|z| [0.025 0.975] ----------------------------------------------------------------------------------------------------------- Intercept -3.8727 0.436 -8.889 0.000 -4.727 -3.019 I(wtdiff > 0)[T.True] 1.4578 0.310 4.698 0.000 0.850 2.066 C(runtimehc, Treatment(0))[T.3:30-4:00] 1.0549 0.475 2.222 0.026 0.124 1.985 C(runtimehc, Treatment(0))[T.>4:00] 1.8032 0.466 3.867 0.000 0.889 2.717 C(bmic, Treatment(1))[T.<20] 1.3910 0.410 3.390 0.001 0.587 2.195 C(bmic, Treatment(1))[T.>25] 0.0369 0.388 0.095 0.924 -0.723 0.797 ===========================================================================================================
params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['LB', 'UB', 'OR']
print(np.around(np.exp(conf[1:]),1))
LB UB OR I(wtdiff > 0)[T.True] 2.3 7.9 4.3 C(runtimehc, Treatment(0))[T.3:30-4:00] 1.1 7.3 2.9 C(runtimehc, Treatment(0))[T.>4:00] 2.4 15.1 6.1 C(bmic, Treatment(1))[T.<20] 1.8 9.0 4.0 C(bmic, Treatment(1))[T.>25] 0.5 2.2 1.0
All continuous predictors using linear and quadratic functions¶
This is the final logistic regression model where each continuous preditor is modelled as quantitative
You need to assume the functional relationship between each continuous predicted and the logit of the outcome probability
- linear function for weight change
- linear function for race duration
- quadratic function for BMI
$\texttt{logit}(P(\texttt{nas135})=1|\texttt{wtdiff}, \texttt{runtime}, \texttt{bmi}) = \beta_0 + \beta_1 \texttt{wtdiff} + \beta_2 \texttt{runtime} + \beta_3 \texttt{bmi} + \beta_5 \texttt{bmi}^2$
df['bmisq'] = df['bmi']**2
model = smf.logit('nas135~wtdiff+runtime+bmi+bmisq', data=df).fit()
print(model.summary())
Optimization terminated successfully. Current function value: 0.282840 Iterations 8 Logit Regression Results ============================================================================== Dep. Variable: nas135 No. Observations: 446 Model: Logit Df Residuals: 441 Method: MLE Df Model: 4 Date: Thu, 02 Oct 2025 Pseudo R-squ.: 0.2515 Time: 07:55:02 Log-Likelihood: -126.15 converged: True LL-Null: -168.53 Covariance Type: nonrobust LLR p-value: 1.708e-17 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 21.3638 7.482 2.855 0.004 6.700 36.028 wtdiff 0.7329 0.123 5.953 0.000 0.492 0.974 runtime 0.0161 0.004 3.687 0.000 0.008 0.025 bmi -2.2319 0.622 -3.586 0.000 -3.452 -1.012 bmisq 0.0449 0.013 3.472 0.001 0.020 0.070 ==============================================================================
b = model.params
V = model.cov_params
params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['LB', 'UB', 'OR']
print(np.around(np.exp(conf[1:3]),2))
LB UB OR wtdiff 1.63 2.65 2.08 runtime 1.01 1.02 1.02
model.nobs
446
AIC and maximized log-likelihood¶
Akaike’s (1974) information criterion is defined as
$AIC = -2 lnL + 2k$
where $lnL$ is the maximized log-likelihood of the model and $k$ is the number of parameters estimated.
model.aic
262.2931311594778
model.llf
-126.1465655797389
Modelling quantitative predictors using splines¶
A spline of a predictor $x$ is just a transformation of it depending on the location of a knot $k$ and the power to which the distance between $x$ and $k$ is raised.
Degree-0 spline
$I(x>k)(x-k)^0$
Degree-1 spline
$I(x>k)(x-k)^1$
Degree-2 spline
$I(x>k)(x-k)^2$
Degree-3 spline
$I(x>k)(x-k)^3$
The binary indicator variable $I(x>k)$ takes value 1 only when $x>k$ and 0 otherwise.
Degree-1 Spline with linear regression¶
$ E(Y|x) = \beta_0 + \beta_1x + \beta_2 I(x>0)(x-0)^1$
if $x<0$
$ E(Y|x) = \beta_0 + \beta_1 x $
if $x \ge 0$
$ E(Y|x) = \beta_0 + (\beta_1 + \beta_2) x $
# Create the linear spline at the chosen knot
df.loc[ df.wtdiff>0, 'wc_s1'] = (df.wtdiff-0)
df.loc[ df.wtdiff<=0, 'wc_s1'] = 0
df[ ['wtdiff', 'wc_s1']]
wtdiff | wc_s1 | |
---|---|---|
0 | NaN | NaN |
1 | NaN | NaN |
2 | NaN | NaN |
3 | NaN | NaN |
4 | NaN | NaN |
... | ... | ... |
483 | -2.500000 | 0.000000 |
484 | 0.454545 | 0.454545 |
485 | 0.312500 | 0.312500 |
486 | 1.136364 | 1.136364 |
487 | -0.909091 | 0.000000 |
488 rows × 2 columns
model = smf.ols('na~ wtdiff + wc_s1', data=df).fit()
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: na R-squared: 0.186 Model: OLS Adj. R-squared: 0.182 Method: Least Squares F-statistic: 51.51 Date: Thu, 02 Oct 2025 Prob (F-statistic): 7.06e-21 Time: 07:55:03 Log-Likelihood: -1312.5 No. Observations: 455 AIC: 2631. Df Residuals: 452 BIC: 2643. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 140.3155 0.332 422.036 0.000 139.662 140.969 wtdiff -0.7795 0.190 -4.094 0.000 -1.154 -0.405 wc_s1 -1.3319 0.433 -3.073 0.002 -2.184 -0.480 ============================================================================== Omnibus: 66.281 Durbin-Watson: 1.324 Prob(Omnibus): 0.000 Jarque-Bera (JB): 326.527 Skew: -0.503 Prob(JB): 1.25e-71 Kurtosis: 7.026 Cond. No. 4.85 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Visualize piece-wise linear trend¶
df['fit'] = model.fittedvalues
to_plot = df[ ['wtdiff', 'fit'] ].sort_values(['wtdiff'])
plt.plot(to_plot.wtdiff, to_plot.fit, c='b', alpha=.5, linewidth=3)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.axvline(0, c='g', alpha=.2, linestyle='dashed', linewidth=2)
plt.show()
Define a function to generate restricted cubic splines¶
# Define a function to generate restricted cubic splines given number of knots
def rcs(x, nk):
if (nk == 3):
lk = [.1, .5, .9]
if (nk == 4):
lk = [.05, .35, .65, .95]
if (nk == 5):
lk = [.05 , .275, .50, .725, .95]
if (nk == 6):
lk = [.05 , .23, .41 , .59, .77 , .95]
if (nk == 7):
lk = [.25 , .1833 , .3417 , .50 , .6583 , .8167 , .975]
knots =[]
for i in range(nk):
knots.append(np.quantile(x, lk[i] ) )
u = np.zeros(shape=(len(x),nk) )
i=0
while i < nk:
u[:,i] = np.array(x>knots[i])*(x-knots[i])
i = i + 1
spline = np.zeros(shape=(len(x),(nk-1)))
spline[:,0] = x
i = 0
while i < (nk-2):
ip1 = i + 1
spline[:,ip1] = ( np.power(u[:,i],3) - 1/(knots[nk-1] - knots[nk-2])*
( np.power(u[:,(nk-2)],3)*(knots[nk-1] - knots[i])-np.power(u[:,(nk-1)],3)*(knots[nk-2] - knots[i]) )) / (knots[nk-1] - knots[0])**2
i=i+1
return spline
df = pd.read_stata('http://www.stats4life.se/data/marathon.dta', convert_categoricals=False)
df = df[['na', 'nas135', 'wtdiff', 'runtime', 'bmi']].dropna()
df['bmi2'] = np.power(df['bmi'],2)
y = df['na'].to_numpy()
# loop to compare AIC across spline models with increasing nr of knots
bmis = []
l_aic = []
nknots = [3,4,5,6,7]
for k in nknots:
print("Nr of knots = " , k)
X = rcs(df.bmi,k)
X = sm.add_constant(X)
smodel = sm.OLS(y,X).fit()
print(smodel.summary())
l_aic.append(smodel.aic)
fits = smodel.fittedvalues
# print(np.column_stack( [X[:,1], fits] ))
plt.scatter(X[:,1], fits, s=2, c='r', alpha=.4)
plt.title('Restricted Cubic Spline with {0} Knots'.format(k))
plt.show()
print('Spline model with %.0f knots AIC is %.0f' % (k, smodel.aic))
# lowest AIC is indicating a good compromise between fit and parameters
list(zip(nknots, l_aic))
Nr of knots = 3 OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.038 Model: OLS Adj. R-squared: 0.034 Method: Least Squares F-statistic: 8.754 Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.000187 Time: 07:55:03 Log-Likelihood: -1325.3 No. Observations: 446 AIC: 2657. Df Residuals: 443 BIC: 2669. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 121.5136 4.555 26.674 0.000 112.561 130.467 x1 0.8859 0.213 4.161 0.000 0.467 1.304 x2 -1.1100 0.277 -4.001 0.000 -1.655 -0.565 ============================================================================== Omnibus: 70.144 Durbin-Watson: 1.181 Prob(Omnibus): 0.000 Jarque-Bera (JB): 316.526 Skew: -0.590 Prob(JB): 1.85e-69 Kurtosis: 6.955 Cond. No. 473. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Spline model with 3 knots AIC is 2657 Nr of knots = 4 OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.039 Model: OLS Adj. R-squared: 0.032 Method: Least Squares F-statistic: 5.957 Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.000548 Time: 07:55:03 Log-Likelihood: -1325.1 No. Observations: 446 AIC: 2658. Df Residuals: 442 BIC: 2675. Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 117.1229 7.674 15.263 0.000 102.042 132.204 x1 1.1091 0.377 2.946 0.003 0.369 1.849 x2 -2.3612 1.570 -1.504 0.133 -5.446 0.724 x3 4.6141 5.000 0.923 0.357 -5.213 14.442 ============================================================================== Omnibus: 70.064 Durbin-Watson: 1.183 Prob(Omnibus): 0.000 Jarque-Bera (JB): 316.204 Skew: -0.589 Prob(JB): 2.17e-69 Kurtosis: 6.953 Cond. No. 928. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Spline model with 4 knots AIC is 2658 Nr of knots = 5 OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.039 Model: OLS Adj. R-squared: 0.031 Method: Least Squares F-statistic: 4.512 Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.00139 Time: 07:55:03 Log-Likelihood: -1325.0 No. Observations: 446 AIC: 2660. Df Residuals: 441 BIC: 2680. Df Model: 4 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 113.8794 10.237 11.124 0.000 93.760 133.999 x1 1.2772 0.514 2.483 0.013 0.266 2.288 x2 -4.3913 4.339 -1.012 0.312 -12.919 4.136 x3 13.2230 21.448 0.617 0.538 -28.931 55.377 x4 -11.6890 26.956 -0.434 0.665 -64.667 41.289 ============================================================================== Omnibus: 70.538 Durbin-Watson: 1.186 Prob(Omnibus): 0.000 Jarque-Bera (JB): 322.180 Skew: -0.590 Prob(JB): 1.10e-70 Kurtosis: 6.993 Cond. No. 3.65e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.65e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Spline model with 5 knots AIC is 2660 Nr of knots = 6 OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.039 Model: OLS Adj. R-squared: 0.028 Method: Least Squares F-statistic: 3.576 Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.00351 Time: 07:55:03 Log-Likelihood: -1325.0 No. Observations: 446 AIC: 2662. Df Residuals: 440 BIC: 2687. Df Model: 5 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 115.5313 11.903 9.706 0.000 92.138 138.925 x1 1.1914 0.605 1.969 0.050 0.002 2.380 x2 -3.2623 7.866 -0.415 0.679 -18.721 12.196 x3 2.4271 45.041 0.054 0.957 -86.095 90.949 x4 8.1119 77.386 0.105 0.917 -143.981 160.205 x5 -10.9087 61.681 -0.177 0.860 -132.135 110.318 ============================================================================== Omnibus: 70.170 Durbin-Watson: 1.186 Prob(Omnibus): 0.000 Jarque-Bera (JB): 318.165 Skew: -0.589 Prob(JB): 8.15e-70 Kurtosis: 6.967 Cond. No. 1.11e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.11e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Spline model with 6 knots AIC is 2662 Nr of knots = 7 OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.043 Model: OLS Adj. R-squared: 0.030 Method: Least Squares F-statistic: 3.301 Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.00345 Time: 07:55:03 Log-Likelihood: -1324.1 No. Observations: 446 AIC: 2662. Df Residuals: 439 BIC: 2691. Df Model: 6 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 116.2837 9.241 12.584 0.000 98.122 134.445 x1 1.1481 0.461 2.491 0.013 0.242 2.054 x2 -121.1906 295.221 -0.411 0.682 -701.412 459.031 x3 11.8721 94.523 0.126 0.900 -173.901 197.645 x4 241.8662 337.937 0.716 0.475 -422.310 906.042 x5 -191.9364 189.090 -1.015 0.311 -563.570 179.697 x6 74.5418 66.352 1.123 0.262 -55.864 204.948 ============================================================================== Omnibus: 74.168 Durbin-Watson: 1.189 Prob(Omnibus): 0.000 Jarque-Bera (JB): 345.955 Skew: -0.623 Prob(JB): 7.53e-76 Kurtosis: 7.131 Cond. No. 5.07e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 5.07e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Spline model with 7 knots AIC is 2662
[(3, 2656.5587353722885), (4, 2658.1682651667147), (5, 2659.955934150113), (6, 2662.0784684105115), (7, 2662.1652085479254)]
bmis = rcs(df.bmi,3)
np.size(bmis, 0), np.size(bmis,1)
(446, 2)
# creating the design matrix
bmis = rcs(df.bmi,3)
y = df['na'].to_numpy()
X = bmis
X = sm.add_constant(X)
model1 = sm.OLS(y,X).fit()
print(model1.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.038 Model: OLS Adj. R-squared: 0.034 Method: Least Squares F-statistic: 8.754 Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.000187 Time: 07:55:03 Log-Likelihood: -1325.3 No. Observations: 446 AIC: 2657. Df Residuals: 443 BIC: 2669. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 121.5136 4.555 26.674 0.000 112.561 130.467 x1 0.8859 0.213 4.161 0.000 0.467 1.304 x2 -1.1100 0.277 -4.001 0.000 -1.655 -0.565 ============================================================================== Omnibus: 70.144 Durbin-Watson: 1.181 Prob(Omnibus): 0.000 Jarque-Bera (JB): 316.526 Skew: -0.590 Prob(JB): 1.85e-69 Kurtosis: 6.955 Cond. No. 473. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
df
na | nas135 | wtdiff | runtime | bmi | bmi2 | |
---|---|---|---|---|---|---|
18 | 142 | 0 | -0.909091 | 195.0 | 19.503027 | 380.368077 |
19 | 141 | 0 | 1.136364 | 195.0 | 18.695052 | 349.504971 |
20 | 138 | 0 | -0.681818 | 207.0 | 20.157869 | 406.339666 |
21 | 140 | 0 | 0.227273 | 234.0 | 17.063750 | 291.171557 |
22 | 142 | 0 | 0.000000 | 233.0 | 19.428191 | 377.454619 |
... | ... | ... | ... | ... | ... | ... |
483 | 133 | 1 | -2.500000 | 308.0 | 29.188370 | 851.960947 |
484 | 128 | 1 | 0.454545 | 250.0 | 31.387685 | 985.186749 |
485 | 133 | 1 | 0.312500 | 283.0 | 28.311327 | 801.531230 |
486 | 131 | 1 | 1.136364 | 308.0 | 31.224404 | 974.963415 |
487 | 135 | 1 | -0.909091 | 276.0 | 28.292931 | 800.489939 |
446 rows × 6 columns
# add the cubic splines to the dataframe
df['bmis1'] = bmis[:,0]
df['bmis2'] = bmis[:,1]
# use the formula approach
model2 = smf.ols('na ~ bmis1 + bmis2', data=df).fit()
print(model2.summary())
OLS Regression Results ============================================================================== Dep. Variable: na R-squared: 0.038 Model: OLS Adj. R-squared: 0.034 Method: Least Squares F-statistic: 8.754 Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.000187 Time: 07:55:03 Log-Likelihood: -1325.3 No. Observations: 446 AIC: 2657. Df Residuals: 443 BIC: 2669. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 121.5136 4.555 26.674 0.000 112.561 130.467 bmis1 0.8859 0.213 4.161 0.000 0.467 1.304 bmis2 -1.1100 0.277 -4.001 0.000 -1.655 -0.565 ============================================================================== Omnibus: 70.144 Durbin-Watson: 1.181 Prob(Omnibus): 0.000 Jarque-Bera (JB): 316.526 Skew: -0.590 Prob(JB): 1.85e-69 Kurtosis: 6.955 Cond. No. 473. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Testing regression coefficients of the splines¶
$H_0: \beta_1 = \beta_2 = 0 $
hypotheses = ['bmis1 = 0', 'bmis2 = 0']
f_test = model2.f_test(hypotheses)
f_test
<class 'statsmodels.stats.contrast.ContrastResults'> <F test: F=8.753689785498299, p=0.0001868614716924368, df_denom=443, df_num=2>
Graph the predicted mean from the cubic spline linear model¶
df['fits'] = model2.fittedvalues
# (x,y) pairs to be plotted need to be sorted first
to_plot = df[ ['bmis1', 'fits'] ].sort_values(['bmis1', 'fits'])
#plt.scatter(df.bmi, df.na, c='r', alpha=.2, s=3)
plt.plot(to_plot['bmis1'],to_plot['fits'], c='b', linewidth=2, alpha=.5)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.show()
Multivariable logistic regression model¶
model = smf.logit('nas135 ~ wtdiff + runtime', data=df).fit()
print(model.summary())
Optimization terminated successfully. Current function value: 0.298177 Iterations 7 Logit Regression Results ============================================================================== Dep. Variable: nas135 No. Observations: 446 Model: Logit Df Residuals: 443 Method: MLE Df Model: 2 Date: Thu, 02 Oct 2025 Pseudo R-squ.: 0.2109 Time: 07:55:03 Log-Likelihood: -132.99 converged: True LL-Null: -168.53 Covariance Type: nonrobust LLR p-value: 3.681e-16 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -5.7753 0.958 -6.029 0.000 -7.653 -3.898 wtdiff 0.6980 0.115 6.049 0.000 0.472 0.924 runtime 0.0162 0.004 4.268 0.000 0.009 0.024 ==============================================================================
Wald-type test for one parameter¶
Recall that under $H_0$, the $Z^2$ follows a $\chi^2(1)$
hypothesis='wtdiff=0'
wald = model.wald_test(hypothesis, scalar=True)
print("Wald test statistic:", wald.statistic)
print("p-value:", wald.pvalue)
Wald test statistic: 36.5913183097515 p-value: 1.4567908936820186e-09
Cubic spline logistic regression model¶
model = smf.logit('nas135 ~ bmis1 + bmis2 + wtdiff', data=df).fit()
print(model.summary())
Optimization terminated successfully. Current function value: 0.298471 Iterations 7 Logit Regression Results ============================================================================== Dep. Variable: nas135 No. Observations: 446 Model: Logit Df Residuals: 442 Method: MLE Df Model: 3 Date: Thu, 02 Oct 2025 Pseudo R-squ.: 0.2101 Time: 07:55:03 Log-Likelihood: -133.12 converged: True LL-Null: -168.53 Covariance Type: nonrobust LLR p-value: 2.857e-15 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 8.5283 2.673 3.191 0.001 3.289 13.767 bmis1 -0.5033 0.128 -3.929 0.000 -0.754 -0.252 bmis2 0.7309 0.170 4.312 0.000 0.399 1.063 wtdiff 0.7518 0.118 6.396 0.000 0.521 0.982 ==============================================================================
Wald-type test of two parameters¶
hypothesis = '(bmis1=0, bmis2=0)'
wald = model.wald_test(hypothesis, scalar=True)
print("Wald test statistic:", wald.statistic)
print("p-value:", wald.pvalue)
Wald test statistic: 18.638150365520445 p-value: 8.969682522687936e-05
Likelihood ratio test¶
An alternative to the Wald-type test, based on the estimated regression coefficients and their estimated variance/covariance, is the likelihood ratio test
$ \lambda = 2|\ell_R- \ell_U|$
where $\ell_R$ is the log-likelihood of the restricted model and $\ell_R$ is the log-likelihood of the unrestricted model. It is the absolute distance that matters.
The two models need to be nested, that is, estimated on the same number of subjects but different number of parameters in the model.
The two models are nested if you can obtain the restricted model from the full model by putting constraints on the parameters you want to test.
Under the null hypothesis that the parameters being tested are equal to zero (assuming the restricted model is true), the likelihood ratio test follows a $\chi^2$ distribution with degrees of freedom equal to the difference in the number of parameters $(k_1 - k_0)$ of the two models being compared.
Under the hypothesis that the $(k_1 - k_0)$ parameters are all equal to zero, we have that $ \lambda \sim \chi^2(k_1 - k_0) $
A computed likelihood ratio statistic exceeding a top quantile, saying 0.95, of the null sampling distribution is considered a strong indication of incompatibility between this sample of data and the hypothesis.
Alternatively, one can compute the p-value from the cumulative distribution function of the $\chi^2$.
# Model 1
m1 = smf.logit('nas135 ~ wtdiff + runtime + bmis1 + bmis2 ', data=df).fit()
l1 = m1.llf
# Model 0 - restricted model without spline terms for bmi
m0 = smf.logit('nas135 ~ wtdiff + runtime', data=df).fit()
l0 = m0.llf
Optimization terminated successfully. Current function value: 0.283066 Iterations 8 Optimization terminated successfully. Current function value: 0.298177 Iterations 7
lrt = 2*abs(l1-l0)
k1 = len(m1.params)
k0 = len(m0.params)
d_f = k1-k0
print("Likelihood ratio test %.3f (p = %.4f) " % (lrt, 1-stats.chi2.cdf(lrt,d_f)) )
Likelihood ratio test 13.479 (p = 0.0012)
Adjusted Odds Ratio of Hyponatremia according to weight change using 0 kg as referent¶
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 446 entries, 18 to 487 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 na 446 non-null int16 1 nas135 446 non-null int8 2 wtdiff 446 non-null float64 3 runtime 446 non-null float64 4 bmi 446 non-null float64 5 bmi2 446 non-null float64 6 bmis1 446 non-null float64 7 bmis2 446 non-null float64 8 fits 446 non-null float64 dtypes: float64(7), int16(1), int8(1) memory usage: 29.2 KB
Final logistic regression model¶
$\texttt{logit}\left(P(\texttt{nas135}|\texttt{wtdiff}, \texttt{runtime}, \texttt{bmi})\right) = \beta_0 + \beta_1\texttt{wtdiff} + \beta_2\texttt{runtime} + \beta_3\texttt{bmi} + \beta_4\texttt{bmi}^2 $
model = smf.logit('nas135 ~ wtdiff + runtime + bmi + bmi2', data=df).fit()
print(model.summary())
Optimization terminated successfully. Current function value: 0.282840 Iterations 8 Logit Regression Results ============================================================================== Dep. Variable: nas135 No. Observations: 446 Model: Logit Df Residuals: 441 Method: MLE Df Model: 4 Date: Thu, 02 Oct 2025 Pseudo R-squ.: 0.2515 Time: 07:55:03 Log-Likelihood: -126.15 converged: True LL-Null: -168.53 Covariance Type: nonrobust LLR p-value: 1.708e-17 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 21.3638 7.482 2.855 0.004 6.700 36.028 wtdiff 0.7329 0.123 5.953 0.000 0.492 0.974 runtime 0.0161 0.004 3.687 0.000 0.008 0.025 bmi -2.2319 0.622 -3.586 0.000 -3.452 -1.012 bmi2 0.0449 0.013 3.472 0.001 0.020 0.070 ==============================================================================
Table of adjusted odds ratios¶
b = model.params
x = (-4, -3, -2, -1, 0, 1, 2, 3)
x_ref = 0
or_wtdiff = []
for i in range(8):
or_wtdiff.append(round(np.exp(b['wtdiff']*(x[i]-x_ref)), 2))
or_wtdiff
[0.05, 0.11, 0.23, 0.48, 1.0, 2.08, 4.33, 9.01]
Looping over a grid of predictor values¶
# Loop to tabulate adjusted odds ratios for a grid of values
for i in range(8):
print(x[i], round(np.exp(b['wtdiff']*(x[i]-x_ref)), 2))
-4 0.05 -3 0.11 -2 0.23 -1 0.48 0 1.0 1 2.08 2 4.33 3 9.01
V = model.cov_params()
V
Intercept | wtdiff | runtime | bmi | bmi2 | |
---|---|---|---|---|---|
Intercept | 55.975540 | 0.158859 | -0.005729 | -4.614272 | 0.095350 |
wtdiff | 0.158859 | 0.015157 | 0.000023 | -0.014786 | 0.000313 |
runtime | -0.005729 | 0.000023 | 0.000019 | 0.000177 | -0.000006 |
bmi | -4.614272 | -0.014786 | 0.000177 | 0.387397 | -0.008019 |
bmi2 | 0.095350 | 0.000313 | -0.000006 | -0.008019 | 0.000167 |
b.T
Intercept 21.363804 wtdiff 0.732879 runtime 0.016116 bmi -2.231851 bmi2 0.044906 dtype: float64
b['wtdiff']
0.7328793472053252
se = np.sqrt(np.diag(V))
b_wtdiff = b['wtdiff']
se_b_wtdiff = se[1]
print(b_wtdiff, se_b_wtdiff)
0.7328793472053252 0.12311509580038163
print("Var %.4f" % 1.23114808e-01)
Var 0.1231
Customize the output of contrasts¶
for i in range(8):
print("wtdiff = ", x[i])
print("OR = %.2f CI = %.2f to %.2f " % (np.exp(b_wtdiff*(x[i]-0)), np.exp(b_wtdiff*(x[i]-0)-1.96*se_b_wtdiff*(x[i]-0)), np.exp(b_wtdiff*(x[i]-0)+1.96*se_b_wtdiff*(x[i]-0)))
)
wtdiff = -4 OR = 0.05 CI = 0.14 to 0.02 wtdiff = -3 OR = 0.11 CI = 0.23 to 0.05 wtdiff = -2 OR = 0.23 CI = 0.37 to 0.14 wtdiff = -1 OR = 0.48 CI = 0.61 to 0.38 wtdiff = 0 OR = 1.00 CI = 1.00 to 1.00 wtdiff = 1 OR = 2.08 CI = 1.63 to 2.65 wtdiff = 2 OR = 4.33 CI = 2.67 to 7.02 wtdiff = 3 OR = 9.01 CI = 4.37 to 18.59
Estimated odds ratio of hyponatremia conferred by 1 unit increase in weight change¶
$ OR_{wtdiff} = \exp(\hat \beta_1\times1) $
# log odds ratio and 95% CI
lincom(np.array([0,1,0,0,0]),b,V)
(0.7328793472053252, 0.4915781934833787, 0.9741805009272717)
# Using the equivariance property of ML estimate we can easily obtain OR (95% CI)
np.around(np.exp(lincom(np.array([0,1,0,0,0]),b,V)), 1)
array([2.1, 1.6, 2.6])
Estimated odds ratio of hyponatremia conferred by 2 unit increase in weight change¶
$ OR_{wtdiff} = \exp(\hat \beta_1\times 2) $
np.exp(lincom(np.array([0,2,0,0,0]),b,V))
array([4.33082777, 2.67287958, 7.01717701])
Estimated odds ratio of hyponatremia conferred by 30 min increase in race duration¶
$ OR_{runtime} = \exp(\hat \beta_2\times 30) $
np.round(np.exp(lincom(np.array([0,0,30,0,0]),b,V)),1)
array([1.6, 1.3, 2.1])
List comprehensions to produce tables of estimates¶
or_wtdiff = np.array([np.exp(lincom(np.array([0,(x-0),0,0,0]),b,V)) for x in np.arange(-4,3,1)] )
or_wtdiff
array([[0.05331608, 0.02030837, 0.13997201], [0.11095417, 0.05379679, 0.22883946], [0.23090274, 0.14250745, 0.37412834], [0.4805234 , 0.37750159, 0.61166031], [1. , 1. , 1. ], [2.0810641 , 1.63489436, 2.64899547], [4.33082777, 2.67287958, 7.01717701]])
or_runtime = np.array( [np.exp(lincom(np.array([0,0,(x-210),0,0]),b,V)) for x in np.arange(180,330,30)] )
or_runtime
array([[0.61663599, 0.47689288, 0.79732779], [1. , 1. , 1. ], [1.62170229, 1.25418933, 2.09690696], [2.62991833, 1.57299087, 4.39701881], [4.26494459, 1.97282836, 9.22013936]])
or_bmi = np.array([np.exp(lincom(np.array([0,0,0,(x-22.5),(x**2-506.25)]),b,V)) for x in np.arange(17.5,32.5,0.5)] )
or_bmi
array([[ 8.8295876 , 2.83287201, 27.52034575], [ 6.41898241, 2.42612167, 16.98321052], [ 4.77246888, 2.09990604, 10.84641826], [ 3.62886806, 1.83720915, 7.16776497], [ 2.82195709, 1.62508584, 4.90032072], [ 2.24429937, 1.45366227, 3.46495865], [ 1.82541797, 1.3154089 , 2.53316726], [ 1.51843082, 1.20461365, 1.91400135], [ 1.29175114, 1.11700515, 1.49383466], [ 1.12386423, 1.04949246, 1.20350633], [ 1. , 1. , 1. ], [ 0.90999143, 0.8560019 , 0.96738618], [ 0.84688758, 0.75382167, 0.95144329], [ 0.8060563 , 0.68178057, 0.95298514], [ 0.78461416, 0.63202642, 0.97404058], [ 0.78108457, 0.59919431, 1.01818909], [ 0.79522703, 0.57958228, 1.09110656], [ 0.82800958, 0.57064619, 1.20144473], [ 0.8817201 , 0.57069133, 1.36226065], [ 0.96023444, 0.57868102, 1.59336515], [ 1.06948566, 0.59411267, 1.92522332], [ 1.21821462, 0.61693521, 2.40551494], [ 1.41913531, 0.64749775, 3.11035059], [ 1.69073271, 0.6865272 , 4.16382207], [ 2.06004762, 0.73513502, 5.77281192], [ 2.56702867, 0.79485443, 8.29036854], [ 3.27141266, 0.8677102 , 12.33377327], [ 4.26374376, 0.95632541, 19.00975401], [ 5.68326643, 1.06407279, 30.35461268], [ 7.74740197, 1.19528217, 50.21595654]])
Use lincom to make inference on predicted probabilities¶
$\texttt{logit}\left(P(\texttt{nas135}|\texttt{wtdiff},\texttt{runtime}, \texttt{bmi}\right) = \beta_0 + \beta_1\texttt{wtdiff} + \beta_2\texttt{runtime}+ \beta_3\texttt{bmi}+\beta_4\texttt{bmi}^2$
$\texttt{logit}\left(P(\texttt{nas135}|\texttt{wtdiff}=-1,\texttt{runtime}=210, \texttt{bmi}=22.5\right) = \beta_0 + \beta_1(-1) + \beta_2(210)+ \beta_3(22.5)+\beta_4(506.25)$
lincom(np.array([1,-1,210,22.5, 22.5**2]),b,V)
(-3.4678567099077813, -4.102390654917354, -2.8333227648982087)
def invlogit(xb):
return np.exp(xb) / (1 + np.exp(xb))
invlogit(lincom(np.array([1,-1,210,22.5, 22.5**2]),b,V))
array([0.03024077, 0.01626421, 0.05554981])
The probability of hyponatremia for those runners who lost 1 kg, completed the marathon in 3:30, and had a BMI equal to 22.5 kg/m^2 is estimated to be 3% (95% CI = 1.6 to 5.6%).
Plot statistical inference based on linear combinations¶
Graph partial predictions¶
The idea is to use the lincom function defined above to extract adjusted odds ratios.
xval = np.arange(-4,4,1)
ors = np.array([np.exp(lincom(np.array([0,(x-0),0,0,0]),b,V)) for x in xval] )
ors.shape
(8, 3)
ors
array([[ 0.05331608, 0.02030837, 0.13997201], [ 0.11095417, 0.05379679, 0.22883946], [ 0.23090274, 0.14250745, 0.37412834], [ 0.4805234 , 0.37750159, 0.61166031], [ 1. , 1. , 1. ], [ 2.0810641 , 1.63489436, 2.64899547], [ 4.33082777, 2.67287958, 7.01717701], [ 9.01273017, 4.36987577, 18.58847013]])
Figure 2 - Panel A) Weight change¶
# graph the adjusted odds ratios for weight change
xval = np.arange(-4,4,1)
ors = np.array([np.exp(lincom(np.array([0,(x-0),0,0,0]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "wtdiff"]
sample0 = df.loc[df["nas135"] == 0, "wtdiff"]
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=1)
ax.plot(xval, ors[:,1], color='blue', alpha=0.2, linewidth=0)
ax.plot(xval, ors[:,2], color='blue', alpha=0.2, linewidth=0)
ax.fill_between(xval, ors[:,1], ors[:,2], color='blue', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%4.3f'))
ax.set_yticks([0.125, .250, .5, 1 , 2, 4, 8])
ax.text(-3, 4, r'$p$<0.001', fontsize=14)
ax.plot(sample0, [.07]*len(sample0), '|', color='k')
ax.plot(sample1, [12]*len(sample1), '|', color='k')
ax.set_xlim(-4,3.5) # most of the data
ax.set_ylim(.05,13) # most of the data
ax.vlines(0, .05, 1, color='black', linewidth=.5)
plt.xlabel('Weight Change (kg)')
plt.ylabel('Adjusted Odds Ratio')
plt.savefig("wtdiff_or.pdf")
plt.show()
Figure 2 - Panel B) Race duration¶
# graph the adjusted odds ratios for runtime
xval = np.arange(180,360,30)
ors = np.array([np.exp(lincom(np.array([0,0,(x-210),0,0]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "runtime"]
sample0 = df.loc[df["nas135"] == 0, "runtime"]
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=4)
ax.plot(xval, ors[:,1], color='blue', alpha=0.2, linewidth=0)
ax.plot(xval, ors[:,2], color='blue', alpha=0.2, linewidth=0)
ax.fill_between(xval, ors[:,1], ors[:,2], color='blue', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%4.3f'))
ax.vlines(210, .05, 1, color='black', linewidth=.5)
ax.set_yticks([.5, 1 , 2, 4, 8])
ax.set_xticks([180, 210, 240, 270, 300, 330] )
ax.set_xticklabels(['3:00', '3:30', '4:00', '4:30', '5:00', '5:30'])
ax.plot(sample0, [.5]*len(sample0), '|', color='k')
ax.plot(sample1, [12]*len(sample1), '|', color='k')
ax.set_xlim(180, 360) # most of the data
ax.set_ylim(.4,13) # most of the data
ax.text(200, 9, r'$p$<0.001', fontsize=14)
plt.xlabel('Race Duration (min)')
plt.ylabel('Odds Ratio')
plt.savefig("runtime_or.pdf")
plt.show()
Figure 2 - Panel C) Body Mass Index¶
# graph the adjusted odds ratios for BMI
xval = np.arange(17.5,32.5,0.5)
ors = np.array([np.exp(lincom(np.array([0,0,0,(x-22.5),(x**2-506.25)]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "bmi"]
sample0 = df.loc[df["nas135"] == 0, "bmi"]
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylim(.4, 12) # most of the data
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=4)
ax.plot(xval, ors[:,1], color='grey', alpha=0.2, linewidth=0)
ax.plot(xval, ors[:,2], color='grey', alpha=0.2, linewidth=0)
ax.fill_between(xval, ors[:,1], ors[:,2], color='blue', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%3.2f'))
ax.set_yticks([.5, 1 , 2, 4, 8])
ax.set_xticks(np.arange(17.5,32.5, 2.5))
#ax.set_title('Non-Linear Modeling of BMI vs. Disease Risk', fontsize=18, pad=20)
ax.vlines(22.5, .05, 1, color='black', linewidth=.5)
plt.xlabel(r'Body Mass Index (kg/$m^2$)',fontsize=16)
plt.ylabel('Adjusted Odds Ratio', fontsize=16)
ax.plot(sample0, [.5]*len(sample0), '|', color='k')
ax.plot(sample1, [11]*len(sample1), '|', color='k')
ax.text(20, 7, r'$p$<0.001', fontsize=14)
plt.savefig("bmi_or.pdf")
plt.show()
# graph the adjusted odds ratios for BMI with a histogram
xval = np.arange(17.5,32.5,0.5)
ors = np.array([np.exp(lincom(np.array([0,0,0,(x-22.5),(x**2-506.25)]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "bmi"]
sample0 = df.loc[df["nas135"] == 0, "bmi"]
bmi = np.array(df[['bmi']])
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylim(.4, 12) # most of the data
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=4)
ax.plot(xval, ors[:,1], color='blue', alpha=0.2, linewidth=0)
ax.plot(xval, ors[:,2], color='blue', alpha=0.2, linewidth=0)
ax.fill_between(xval, ors[:,1], ors[:,2], color='b', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%3.2f'))
ax.set_yticks([.5, 1 , 2, 4, 8])
ax.set_xticks(np.arange(17.5,32.5, 2.5))
ax.set_ylim(.4, 12) # most of the data
plt.xlabel(r'Body Mass Index (kg/$m^2$)',fontsize=16)
plt.ylabel('Adjusted Odds Ratio', fontsize=16)
#ax.set_title('Non-Linear Modeling of BMI vs. Disease Risk', fontsize=18, pad=20)
ax.plot(sample0, [.5]*len(sample0), '|', color='k')
ax.plot(sample1, [11]*len(sample1), '|', color='k')
ax.text(20, 7, r'$p$<0.001', fontsize=14)
ax.vlines(22.5, .05, 1, color='black', linewidth=.5)
ax2=ax.twinx()
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.hist(bmi, 15, color='red', alpha=0.1, density=True, rwidth=.9)
ax2.set_yticks([])
plt.savefig("bmi_or2.pdf")
plt.show()