Course 5316: Fundamentals of using Python in Health Related Research ¶
Peter Alping, PhD ¶
Orsini Nicola, PhD ¶
Day 4 - Linear and logistic regression models ¶
What are you going to learn?
- specify a linear and logistic regression model
- conduct a univariate and multivariate test of hypothesis
- derive predicted outcomes and their changes for any linear combination
- present the estimated model in a graphical form
- model categorical predictors with dummy variables
- model quantitative predictors with splines
Package¶
statsmodel¶
statsmodel is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from matplotlib.ticker import FormatStrFormatter
Motivating example: observational cohort study¶
df = pd.read_stata('http://www.stats4life.se/data/marathon.dta', convert_categoricals=False)
Conditional mean¶
Consider a linear regression model to conduct statistical inference on the mean outcome conditionally on the values of the predictor.
$E(Y|x) = \beta_0 + \beta_1x $
For example, let's conduct statistical inference on the mean serum sodium concentration depending on weight change.
$E(\texttt{na}|\texttt{wtdiff}) = \beta_0 + \beta_1\texttt{wtdiff}$
# A scatter plot of individual outcomes can be a good starting point
plt.scatter(df.wtdiff, df.na, c='r', alpha=.2)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
plt.show()
# Estimates of the model parameter using Ordinary Least Squares (OLS) method
model = smf.ols('na ~ wtdiff', data=df).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: na R-squared: 0.169
Model: OLS Adj. R-squared: 0.167
Method: Least Squares F-statistic: 91.86
Date: Thu, 02 Oct 2025 Prob (F-statistic): 6.21e-20
Time: 07:55:02 Log-Likelihood: -1317.2
No. Observations: 455 AIC: 2638.
Df Residuals: 453 BIC: 2647.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 139.5534 0.223 624.476 0.000 139.114 139.993
wtdiff -1.2183 0.127 -9.584 0.000 -1.468 -0.968
==============================================================================
Omnibus: 65.662 Durbin-Watson: 1.284
Prob(Omnibus): 0.000 Jarque-Bera (JB): 327.577
Skew: -0.491 Prob(JB): 7.37e-72
Kurtosis: 7.039 Cond. No. 2.04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Store the predicted mean outcome¶
# Add a column in the dataframe containing the predicted outcomes
df['fit'] = model.fittedvalues
# graph individual data points the the predicted mean outcome
plt.scatter(df.wtdiff, df.na, c='r', alpha=.2)
plt.plot(df.wtdiff, df.fit, c='b', alpha=.3 , linewidth=5)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.show()
# comparison of the predicted mean outcome vs observed outcome
subset = np.column_stack((df['wtdiff'], df['na'], df['fit']))
subset[10:30]
array([[ nan, 141. , nan],
[ nan, 143. , nan],
[ nan, 141. , nan],
[ nan, 141. , nan],
[ nan, 140. , nan],
[ nan, 142. , nan],
[ nan, 143. , nan],
[ 0.68181818, 149. , 138.72278443],
[ -0.90909091, 142. , 140.66096356],
[ 1.13636364, 141. , 138.16901896],
[ -0.68181818, 138. , 140.38408083],
[ 0.22727273, 140. , 139.27654989],
[ 0. , 142. , 139.55343263],
[ -0.59659091, 149. , 140.2802498 ],
[ -0.22727273, 140. , 139.83031537],
[ -1.81818182, 147. , 141.76849449],
[ 0. , 136. , 139.55343263],
[ 0.45454546, 139. , 138.99966716],
[ 0. , 143. , 139.55343263],
[ 0.22727273, 142. , 139.27654989]])
Univariate Wald-type test¶
Consider testing an hypothesis involving only one regression coefficient
$H_0: \beta_1 = 0 $
A Wald-type statistic is computed from the estimated model and compared with an F-distribution. The result of this test is shown typically in the output of the estimated model.
hypotheses = 'wtdiff = 0'
t_test = model.t_test(hypotheses)
t_test
<class 'statsmodels.stats.contrast.ContrastResults'>
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 -1.2183 0.127 -9.584 0.000 -1.468 -0.968
==============================================================================
# you can test any other value of the regression coefficient b1
hypotheses = 'wtdiff = -1.3'
t_test = model.t_test(hypotheses)
t_test
<class 'statsmodels.stats.contrast.ContrastResults'>
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 -1.2183 0.127 0.643 0.521 -1.468 -0.968
==============================================================================
Access estimated parameters¶
# access estimated regression coefficients
model.params
Intercept 139.553433 wtdiff -1.218284 dtype: float64
# access estimated var/covariances of the regression coefficients
model.cov_params()
| Intercept | wtdiff | |
|---|---|---|
| Intercept | 0.04994 | 0.011140 |
| wtdiff | 0.01114 | 0.016157 |
# access estimated std errors of the regression coefficients
np.sqrt(np.diag(model.cov_params()))
array([0.22347274, 0.12711121])
Inference for any linear combinations of parameters and data¶
Suppose you want to conduct statistical inference on the predicted outcome (or change in predicted outcome) for a certain linear combination of the parameters and data $X\boldsymbol{\beta}$.
Let's consider a large-sample Wald-type test with reference to a standard normal distribution.
The $100(1-\alpha)$% confidence intervals can be obtained as follows
$ \begin{equation} \begin{gathered} X \boldsymbol{\hat \beta} \pm z_{\alpha / 2} \mathrm{diag} [ XV(\boldsymbol{\hat \beta})X' ]^{1/2} \end{gathered} \end{equation} $
Considering that Type I error is $\alpha=0.05$ in a two-sided test, we have that
$z_{0.05/ 2} = Q_{.025}(Z)$
So, we could rewrite the above in the following way to highlight the fact that the degree of confidence is related to the specific quantiles of the sampling distribution being used.
$ \begin{equation} \begin{gathered} Q_p(X \boldsymbol{\hat \beta}) = X \boldsymbol{\hat \beta} + Q_p(Z) \sqrt{ \mathrm{diag}(XV(\boldsymbol{\hat \beta})X') } \end{gathered} \end{equation} $
where $V(\boldsymbol{\hat \beta})$ is the estimated variance/covariance matrix of the estimated of $\boldsymbol{\hat \beta}$.
In Stata this is done with lincom command
In R this is done with ci.lin() function in Epi library
def lincom(X,b,V):
Var = np.dot(X, np.dot(V,X.T))
if Var.ndim == 0:
Se = np.sqrt(Var)
else:
Se = np.sqrt(np.diag(Var))
return (np.matmul(X,b), np.matmul(X,b)+stats.norm.ppf(.025)*Se, np.matmul(X,b)+stats.norm.ppf(.975)*Se)
b = model.params
V = model.cov_params()
Example 1: Confidence interval for the mean outcome¶
Derive a 95% confidence interval of the mean serum sodium concentration for those runners who increased 3 kg.
$E(\texttt{na}|\texttt{wtdiff}=3) = \beta_0 + \beta_1 3$
lincom(np.array([1, 3]),b,V)
(135.898580556215, 134.89498068034243, 136.90218043208756)
def lincom2(X,b,V):
Var = np.dot(X, np.dot(V,X.T))
if Var.ndim == 0:
Se = np.sqrt(Var)
else:
Se = np.sqrt(np.diag(Var))
z = (np.matmul(X,b)-0)/Se
p = stats.norm.cdf( -abs(z))*2
return ( np.matmul(X,b), Se, z, p, np.matmul(X,b)+stats.norm.ppf(.025)*Se, np.matmul(X,b)+stats.norm.ppf(.975)*Se)
results = lincom2(np.array([0, 1]),b,V)
np.round(np.transpose(results),4)
array([-1.2183, 0.1271, -9.5844, 0. , -1.4674, -0.9692])
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: na R-squared: 0.169
Model: OLS Adj. R-squared: 0.167
Method: Least Squares F-statistic: 91.86
Date: Thu, 02 Oct 2025 Prob (F-statistic): 6.21e-20
Time: 07:55:02 Log-Likelihood: -1317.2
No. Observations: 455 AIC: 2638.
Df Residuals: 453 BIC: 2647.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 139.5534 0.223 624.476 0.000 139.114 139.993
wtdiff -1.2183 0.127 -9.584 0.000 -1.468 -0.968
==============================================================================
Omnibus: 65.662 Durbin-Watson: 1.284
Prob(Omnibus): 0.000 Jarque-Bera (JB): 327.577
Skew: -0.491 Prob(JB): 7.37e-72
Kurtosis: 7.039 Cond. No. 2.04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Example 2: Confidence interval for the difference in mean outcome¶
Derive a 95% confidence interval of the mean outcome difference comparing those who increased 3 kg vs those who decreased 1 kg.
$E(\texttt{na}|\texttt{wtdiff}=3) - E(\texttt{na}|\texttt{wtdiff}=-1) = \beta_1 (3- -1) = \beta_1 4$
lincom(np.array([0, 4]),b,V)
(-4.873136097422644, -5.869669663999436, -3.876602530845852)
Linear and quadratic function for the quantitative predictors¶
Consider a linear regression model to conduct statistical inference on the mean serum sodium concentration as function of weight change (linear function), and bmi (quadratic function)
$E(\texttt{na}|\texttt{wtdiff}, \texttt{bmi}) = \beta_0 + \beta_1\texttt{wtdiff}+ \beta_2\texttt{bmi} + \beta_3\texttt{bmi}^2$
df['bmisq'] = df['bmi']**2
# Pearson correlation between bmi and bmi squared
valid = df[['bmi', 'bmisq']].dropna()
r, pval = stats.pearsonr(valid['bmi'], valid['bmisq'])
print("Correlation:", r, "p-value:", pval)
Correlation: 0.9965742291212139 p-value: 0.0
model = smf.ols('na ~ wtdiff + bmi + bmisq' , data=df).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: na R-squared: 0.186
Model: OLS Adj. R-squared: 0.180
Method: Least Squares F-statistic: 34.03
Date: Thu, 02 Oct 2025 Prob (F-statistic): 7.91e-20
Time: 07:55:02 Log-Likelihood: -1305.2
No. Observations: 452 AIC: 2618.
Df Residuals: 448 BIC: 2635.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 105.0147 11.218 9.361 0.000 82.968 127.061
wtdiff -1.1770 0.127 -9.247 0.000 -1.427 -0.927
bmi 2.8707 0.940 3.053 0.002 1.023 4.719
bmisq -0.0587 0.020 -3.006 0.003 -0.097 -0.020
==============================================================================
Omnibus: 60.670 Durbin-Watson: 1.308
Prob(Omnibus): 0.000 Jarque-Bera (JB): 330.172
Skew: -0.400 Prob(JB): 2.01e-72
Kurtosis: 7.110 Cond. No. 3.03e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.03e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
F test for a joint hypothesis¶
Since body mass index was modelled with a quadratic function (degree-2 polynomial), then a test for the hypothesis of no adjusted association between body mass index and mean serum sodium concentration is obtained by testing the two regression coefficients equal to zero.
$H_0: \beta_2 = \beta_3 = 0 $
A Wald-type statistic is computed from the estimated model and compared with an F-distribution.
hypotheses = ['bmi = 0', 'bmisq = 0']
f_test = model.f_test(hypotheses)
f_test
<class 'statsmodels.stats.contrast.ContrastResults'> <F test: F=4.767628913599036, p=0.00893659800206336, df_denom=448, df_num=2>
Conditional probability¶
Consider a logistic regression model to conduct statistical inference on the outcome probability conditionally on the values of the predictor.
$P(Y=1|x) = \texttt{invlogit}(\beta_0 + \beta_1x) $
The logit of the outcome probability is related to a linear combination of regression coefficients and variables
$\texttt{logit}\left(P(Y=1|x)\right) = \beta_0 + \beta_1x $
Binary predictor¶
Conduct statistical inference on the odds ratio of hyponatremia associated with female sex.
$\texttt{logit}\left(P(\texttt{nas135}|\texttt{female})\right) = \beta_0 + \beta_1\texttt{female} $
Estimates of the regression coefficients are obtained with the maximum likelihood method.
model = smf.logit('nas135 ~ female ', data=df).fit()
Optimization terminated successfully.
Current function value: 0.360585
Iterations 6
print(model.summary())
Logit Regression Results
==============================================================================
Dep. Variable: nas135 No. Observations: 488
Model: Logit Df Residuals: 486
Method: MLE Df Model: 1
Date: Thu, 02 Oct 2025 Pseudo R-squ.: 0.05293
Time: 07:55:02 Log-Likelihood: -175.97
converged: True LL-Null: -185.80
Covariance Type: nonrobust LLR p-value: 9.204e-06
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -2.4749 0.208 -11.884 0.000 -2.883 -2.067
female 1.2260 0.280 4.386 0.000 0.678 1.774
==============================================================================
Confidence interval for the Odds Ratio¶
Thanks to the equivariance property of the maximum likelihood estimator
- obtain estimates with confidence limits
- exponentiate and round them
b = model.params
V = model.cov_params()
np.round(np.exp(lincom(np.array([0, 1]), b, V)), 1)
array([3.4, 2. , 5.9])
params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['LB', 'UB', 'OR']
print(np.around(np.exp(conf[1:]),1))
LB UB OR female 2.0 5.9 3.4
Categorization of continuous predictors¶
df['bmic'] = pd.cut(df['bmi'],bins=[0,20,25,40],labels=['<20','20-25','>25'])
df['runtimehc'] = pd.cut(df['runtime']/60, bins=[0,3.5,4,8] , labels=['<3:30','3:30-4:00','>4:00'])
df['runtimehc'].value_counts(normalize=True)
runtimehc <3:30 0.404612 3:30-4:00 0.308176 >4:00 0.287212 Name: proportion, dtype: float64
df['bmic'].value_counts(normalize=True)
bmic 20-25 0.701075 >25 0.193548 <20 0.105376 Name: proportion, dtype: float64
Final multivariable logistic model with categorized predictors¶
Consider the estimates provided in Table 2 of the NEJM paper.
$\texttt{logit}(P(\texttt{nas135})=1|\texttt{wtdiff}, \texttt{runtime}, \texttt{bmi}) = \beta_0 + \beta_1I(\texttt{wtdiff}>0) + \beta_2 I(3.5<\texttt{runtime}<4) + \beta_3 I(\texttt{runtime}>4) + \beta_4 I(\texttt{bmi}<20) + \beta_5 I(\texttt{bmi}>25)$
Indicator (or dummy) variables are conveniently created with C() notation where treatment is indicating the chosen reference level.
model = smf.logit('nas135 ~ I(wtdiff>0) + C(runtimehc,Treatment(0)) + C(bmic,Treatment(1)) ', data=df).fit()
print(model.summary())
Optimization terminated successfully.
Current function value: 0.311873
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: nas135 No. Observations: 455
Model: Logit Df Residuals: 449
Method: MLE Df Model: 5
Date: Thu, 02 Oct 2025 Pseudo R-squ.: 0.1826
Time: 07:55:02 Log-Likelihood: -141.90
converged: True LL-Null: -173.61
Covariance Type: nonrobust LLR p-value: 2.395e-12
===========================================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------
Intercept -3.8727 0.436 -8.889 0.000 -4.727 -3.019
I(wtdiff > 0)[T.True] 1.4578 0.310 4.698 0.000 0.850 2.066
C(runtimehc, Treatment(0))[T.3:30-4:00] 1.0549 0.475 2.222 0.026 0.124 1.985
C(runtimehc, Treatment(0))[T.>4:00] 1.8032 0.466 3.867 0.000 0.889 2.717
C(bmic, Treatment(1))[T.<20] 1.3910 0.410 3.390 0.001 0.587 2.195
C(bmic, Treatment(1))[T.>25] 0.0369 0.388 0.095 0.924 -0.723 0.797
===========================================================================================================
params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['LB', 'UB', 'OR']
print(np.around(np.exp(conf[1:]),1))
LB UB OR I(wtdiff > 0)[T.True] 2.3 7.9 4.3 C(runtimehc, Treatment(0))[T.3:30-4:00] 1.1 7.3 2.9 C(runtimehc, Treatment(0))[T.>4:00] 2.4 15.1 6.1 C(bmic, Treatment(1))[T.<20] 1.8 9.0 4.0 C(bmic, Treatment(1))[T.>25] 0.5 2.2 1.0
All continuous predictors using linear and quadratic functions¶
This is the final logistic regression model where each continuous preditor is modelled as quantitative
You need to assume the functional relationship between each continuous predicted and the logit of the outcome probability
- linear function for weight change
- linear function for race duration
- quadratic function for BMI
$\texttt{logit}(P(\texttt{nas135})=1|\texttt{wtdiff}, \texttt{runtime}, \texttt{bmi}) = \beta_0 + \beta_1 \texttt{wtdiff} + \beta_2 \texttt{runtime} + \beta_3 \texttt{bmi} + \beta_5 \texttt{bmi}^2$
df['bmisq'] = df['bmi']**2
model = smf.logit('nas135~wtdiff+runtime+bmi+bmisq', data=df).fit()
print(model.summary())
Optimization terminated successfully.
Current function value: 0.282840
Iterations 8
Logit Regression Results
==============================================================================
Dep. Variable: nas135 No. Observations: 446
Model: Logit Df Residuals: 441
Method: MLE Df Model: 4
Date: Thu, 02 Oct 2025 Pseudo R-squ.: 0.2515
Time: 07:55:02 Log-Likelihood: -126.15
converged: True LL-Null: -168.53
Covariance Type: nonrobust LLR p-value: 1.708e-17
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 21.3638 7.482 2.855 0.004 6.700 36.028
wtdiff 0.7329 0.123 5.953 0.000 0.492 0.974
runtime 0.0161 0.004 3.687 0.000 0.008 0.025
bmi -2.2319 0.622 -3.586 0.000 -3.452 -1.012
bmisq 0.0449 0.013 3.472 0.001 0.020 0.070
==============================================================================
b = model.params
V = model.cov_params
params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['LB', 'UB', 'OR']
print(np.around(np.exp(conf[1:3]),2))
LB UB OR wtdiff 1.63 2.65 2.08 runtime 1.01 1.02 1.02
model.nobs
446
AIC and maximized log-likelihood¶
Akaike’s (1974) information criterion is defined as
$AIC = -2 lnL + 2k$
where $lnL$ is the maximized log-likelihood of the model and $k$ is the number of parameters estimated.
model.aic
262.2931311594778
model.llf
-126.1465655797389
Modelling quantitative predictors using splines¶
A spline of a predictor $x$ is just a transformation of it depending on the location of a knot $k$ and the power to which the distance between $x$ and $k$ is raised.
Degree-0 spline
$I(x>k)(x-k)^0$
Degree-1 spline
$I(x>k)(x-k)^1$
Degree-2 spline
$I(x>k)(x-k)^2$
Degree-3 spline
$I(x>k)(x-k)^3$
The binary indicator variable $I(x>k)$ takes value 1 only when $x>k$ and 0 otherwise.
Degree-1 Spline with linear regression¶
$ E(Y|x) = \beta_0 + \beta_1x + \beta_2 I(x>0)(x-0)^1$
if $x<0$
$ E(Y|x) = \beta_0 + \beta_1 x $
if $x \ge 0$
$ E(Y|x) = \beta_0 + (\beta_1 + \beta_2) x $
# Create the linear spline at the chosen knot
df.loc[ df.wtdiff>0, 'wc_s1'] = (df.wtdiff-0)
df.loc[ df.wtdiff<=0, 'wc_s1'] = 0
df[ ['wtdiff', 'wc_s1']]
| wtdiff | wc_s1 | |
|---|---|---|
| 0 | NaN | NaN |
| 1 | NaN | NaN |
| 2 | NaN | NaN |
| 3 | NaN | NaN |
| 4 | NaN | NaN |
| ... | ... | ... |
| 483 | -2.500000 | 0.000000 |
| 484 | 0.454545 | 0.454545 |
| 485 | 0.312500 | 0.312500 |
| 486 | 1.136364 | 1.136364 |
| 487 | -0.909091 | 0.000000 |
488 rows × 2 columns
model = smf.ols('na~ wtdiff + wc_s1', data=df).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: na R-squared: 0.186
Model: OLS Adj. R-squared: 0.182
Method: Least Squares F-statistic: 51.51
Date: Thu, 02 Oct 2025 Prob (F-statistic): 7.06e-21
Time: 07:55:03 Log-Likelihood: -1312.5
No. Observations: 455 AIC: 2631.
Df Residuals: 452 BIC: 2643.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 140.3155 0.332 422.036 0.000 139.662 140.969
wtdiff -0.7795 0.190 -4.094 0.000 -1.154 -0.405
wc_s1 -1.3319 0.433 -3.073 0.002 -2.184 -0.480
==============================================================================
Omnibus: 66.281 Durbin-Watson: 1.324
Prob(Omnibus): 0.000 Jarque-Bera (JB): 326.527
Skew: -0.503 Prob(JB): 1.25e-71
Kurtosis: 7.026 Cond. No. 4.85
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Visualize piece-wise linear trend¶
df['fit'] = model.fittedvalues
to_plot = df[ ['wtdiff', 'fit'] ].sort_values(['wtdiff'])
plt.plot(to_plot.wtdiff, to_plot.fit, c='b', alpha=.5, linewidth=3)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.axvline(0, c='g', alpha=.2, linestyle='dashed', linewidth=2)
plt.show()
Define a function to generate restricted cubic splines¶
# Define a function to generate restricted cubic splines given number of knots
def rcs(x, nk):
if (nk == 3):
lk = [.1, .5, .9]
if (nk == 4):
lk = [.05, .35, .65, .95]
if (nk == 5):
lk = [.05 , .275, .50, .725, .95]
if (nk == 6):
lk = [.05 , .23, .41 , .59, .77 , .95]
if (nk == 7):
lk = [.25 , .1833 , .3417 , .50 , .6583 , .8167 , .975]
knots =[]
for i in range(nk):
knots.append(np.quantile(x, lk[i] ) )
u = np.zeros(shape=(len(x),nk) )
i=0
while i < nk:
u[:,i] = np.array(x>knots[i])*(x-knots[i])
i = i + 1
spline = np.zeros(shape=(len(x),(nk-1)))
spline[:,0] = x
i = 0
while i < (nk-2):
ip1 = i + 1
spline[:,ip1] = ( np.power(u[:,i],3) - 1/(knots[nk-1] - knots[nk-2])*
( np.power(u[:,(nk-2)],3)*(knots[nk-1] - knots[i])-np.power(u[:,(nk-1)],3)*(knots[nk-2] - knots[i]) )) / (knots[nk-1] - knots[0])**2
i=i+1
return spline
df = pd.read_stata('http://www.stats4life.se/data/marathon.dta', convert_categoricals=False)
df = df[['na', 'nas135', 'wtdiff', 'runtime', 'bmi']].dropna()
df['bmi2'] = np.power(df['bmi'],2)
y = df['na'].to_numpy()
# loop to compare AIC across spline models with increasing nr of knots
bmis = []
l_aic = []
nknots = [3,4,5,6,7]
for k in nknots:
print("Nr of knots = " , k)
X = rcs(df.bmi,k)
X = sm.add_constant(X)
smodel = sm.OLS(y,X).fit()
print(smodel.summary())
l_aic.append(smodel.aic)
fits = smodel.fittedvalues
# print(np.column_stack( [X[:,1], fits] ))
plt.scatter(X[:,1], fits, s=2, c='r', alpha=.4)
plt.title('Restricted Cubic Spline with {0} Knots'.format(k))
plt.show()
print('Spline model with %.0f knots AIC is %.0f' % (k, smodel.aic))
# lowest AIC is indicating a good compromise between fit and parameters
list(zip(nknots, l_aic))
Nr of knots = 3
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.038
Model: OLS Adj. R-squared: 0.034
Method: Least Squares F-statistic: 8.754
Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.000187
Time: 07:55:03 Log-Likelihood: -1325.3
No. Observations: 446 AIC: 2657.
Df Residuals: 443 BIC: 2669.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 121.5136 4.555 26.674 0.000 112.561 130.467
x1 0.8859 0.213 4.161 0.000 0.467 1.304
x2 -1.1100 0.277 -4.001 0.000 -1.655 -0.565
==============================================================================
Omnibus: 70.144 Durbin-Watson: 1.181
Prob(Omnibus): 0.000 Jarque-Bera (JB): 316.526
Skew: -0.590 Prob(JB): 1.85e-69
Kurtosis: 6.955 Cond. No. 473.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Spline model with 3 knots AIC is 2657
Nr of knots = 4
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.039
Model: OLS Adj. R-squared: 0.032
Method: Least Squares F-statistic: 5.957
Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.000548
Time: 07:55:03 Log-Likelihood: -1325.1
No. Observations: 446 AIC: 2658.
Df Residuals: 442 BIC: 2675.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 117.1229 7.674 15.263 0.000 102.042 132.204
x1 1.1091 0.377 2.946 0.003 0.369 1.849
x2 -2.3612 1.570 -1.504 0.133 -5.446 0.724
x3 4.6141 5.000 0.923 0.357 -5.213 14.442
==============================================================================
Omnibus: 70.064 Durbin-Watson: 1.183
Prob(Omnibus): 0.000 Jarque-Bera (JB): 316.204
Skew: -0.589 Prob(JB): 2.17e-69
Kurtosis: 6.953 Cond. No. 928.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Spline model with 4 knots AIC is 2658
Nr of knots = 5
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.039
Model: OLS Adj. R-squared: 0.031
Method: Least Squares F-statistic: 4.512
Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.00139
Time: 07:55:03 Log-Likelihood: -1325.0
No. Observations: 446 AIC: 2660.
Df Residuals: 441 BIC: 2680.
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 113.8794 10.237 11.124 0.000 93.760 133.999
x1 1.2772 0.514 2.483 0.013 0.266 2.288
x2 -4.3913 4.339 -1.012 0.312 -12.919 4.136
x3 13.2230 21.448 0.617 0.538 -28.931 55.377
x4 -11.6890 26.956 -0.434 0.665 -64.667 41.289
==============================================================================
Omnibus: 70.538 Durbin-Watson: 1.186
Prob(Omnibus): 0.000 Jarque-Bera (JB): 322.180
Skew: -0.590 Prob(JB): 1.10e-70
Kurtosis: 6.993 Cond. No. 3.65e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.65e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Spline model with 5 knots AIC is 2660
Nr of knots = 6
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.039
Model: OLS Adj. R-squared: 0.028
Method: Least Squares F-statistic: 3.576
Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.00351
Time: 07:55:03 Log-Likelihood: -1325.0
No. Observations: 446 AIC: 2662.
Df Residuals: 440 BIC: 2687.
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 115.5313 11.903 9.706 0.000 92.138 138.925
x1 1.1914 0.605 1.969 0.050 0.002 2.380
x2 -3.2623 7.866 -0.415 0.679 -18.721 12.196
x3 2.4271 45.041 0.054 0.957 -86.095 90.949
x4 8.1119 77.386 0.105 0.917 -143.981 160.205
x5 -10.9087 61.681 -0.177 0.860 -132.135 110.318
==============================================================================
Omnibus: 70.170 Durbin-Watson: 1.186
Prob(Omnibus): 0.000 Jarque-Bera (JB): 318.165
Skew: -0.589 Prob(JB): 8.15e-70
Kurtosis: 6.967 Cond. No. 1.11e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.11e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Spline model with 6 knots AIC is 2662
Nr of knots = 7
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.043
Model: OLS Adj. R-squared: 0.030
Method: Least Squares F-statistic: 3.301
Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.00345
Time: 07:55:03 Log-Likelihood: -1324.1
No. Observations: 446 AIC: 2662.
Df Residuals: 439 BIC: 2691.
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 116.2837 9.241 12.584 0.000 98.122 134.445
x1 1.1481 0.461 2.491 0.013 0.242 2.054
x2 -121.1906 295.221 -0.411 0.682 -701.412 459.031
x3 11.8721 94.523 0.126 0.900 -173.901 197.645
x4 241.8662 337.937 0.716 0.475 -422.310 906.042
x5 -191.9364 189.090 -1.015 0.311 -563.570 179.697
x6 74.5418 66.352 1.123 0.262 -55.864 204.948
==============================================================================
Omnibus: 74.168 Durbin-Watson: 1.189
Prob(Omnibus): 0.000 Jarque-Bera (JB): 345.955
Skew: -0.623 Prob(JB): 7.53e-76
Kurtosis: 7.131 Cond. No. 5.07e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.07e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Spline model with 7 knots AIC is 2662
[(3, 2656.5587353722885), (4, 2658.1682651667147), (5, 2659.955934150113), (6, 2662.0784684105115), (7, 2662.1652085479254)]
bmis = rcs(df.bmi,3)
np.size(bmis, 0), np.size(bmis,1)
(446, 2)
# creating the design matrix
bmis = rcs(df.bmi,3)
y = df['na'].to_numpy()
X = bmis
X = sm.add_constant(X)
model1 = sm.OLS(y,X).fit()
print(model1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.038
Model: OLS Adj. R-squared: 0.034
Method: Least Squares F-statistic: 8.754
Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.000187
Time: 07:55:03 Log-Likelihood: -1325.3
No. Observations: 446 AIC: 2657.
Df Residuals: 443 BIC: 2669.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 121.5136 4.555 26.674 0.000 112.561 130.467
x1 0.8859 0.213 4.161 0.000 0.467 1.304
x2 -1.1100 0.277 -4.001 0.000 -1.655 -0.565
==============================================================================
Omnibus: 70.144 Durbin-Watson: 1.181
Prob(Omnibus): 0.000 Jarque-Bera (JB): 316.526
Skew: -0.590 Prob(JB): 1.85e-69
Kurtosis: 6.955 Cond. No. 473.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
df
| na | nas135 | wtdiff | runtime | bmi | bmi2 | |
|---|---|---|---|---|---|---|
| 18 | 142 | 0 | -0.909091 | 195.0 | 19.503027 | 380.368077 |
| 19 | 141 | 0 | 1.136364 | 195.0 | 18.695052 | 349.504971 |
| 20 | 138 | 0 | -0.681818 | 207.0 | 20.157869 | 406.339666 |
| 21 | 140 | 0 | 0.227273 | 234.0 | 17.063750 | 291.171557 |
| 22 | 142 | 0 | 0.000000 | 233.0 | 19.428191 | 377.454619 |
| ... | ... | ... | ... | ... | ... | ... |
| 483 | 133 | 1 | -2.500000 | 308.0 | 29.188370 | 851.960947 |
| 484 | 128 | 1 | 0.454545 | 250.0 | 31.387685 | 985.186749 |
| 485 | 133 | 1 | 0.312500 | 283.0 | 28.311327 | 801.531230 |
| 486 | 131 | 1 | 1.136364 | 308.0 | 31.224404 | 974.963415 |
| 487 | 135 | 1 | -0.909091 | 276.0 | 28.292931 | 800.489939 |
446 rows × 6 columns
# add the cubic splines to the dataframe
df['bmis1'] = bmis[:,0]
df['bmis2'] = bmis[:,1]
# use the formula approach
model2 = smf.ols('na ~ bmis1 + bmis2', data=df).fit()
print(model2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: na R-squared: 0.038
Model: OLS Adj. R-squared: 0.034
Method: Least Squares F-statistic: 8.754
Date: Thu, 02 Oct 2025 Prob (F-statistic): 0.000187
Time: 07:55:03 Log-Likelihood: -1325.3
No. Observations: 446 AIC: 2657.
Df Residuals: 443 BIC: 2669.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 121.5136 4.555 26.674 0.000 112.561 130.467
bmis1 0.8859 0.213 4.161 0.000 0.467 1.304
bmis2 -1.1100 0.277 -4.001 0.000 -1.655 -0.565
==============================================================================
Omnibus: 70.144 Durbin-Watson: 1.181
Prob(Omnibus): 0.000 Jarque-Bera (JB): 316.526
Skew: -0.590 Prob(JB): 1.85e-69
Kurtosis: 6.955 Cond. No. 473.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Testing regression coefficients of the splines¶
$H_0: \beta_1 = \beta_2 = 0 $
hypotheses = ['bmis1 = 0', 'bmis2 = 0']
f_test = model2.f_test(hypotheses)
f_test
<class 'statsmodels.stats.contrast.ContrastResults'> <F test: F=8.753689785498299, p=0.0001868614716924368, df_denom=443, df_num=2>
Graph the predicted mean from the cubic spline linear model¶
df['fits'] = model2.fittedvalues
# (x,y) pairs to be plotted need to be sorted first
to_plot = df[ ['bmis1', 'fits'] ].sort_values(['bmis1', 'fits'])
#plt.scatter(df.bmi, df.na, c='r', alpha=.2, s=3)
plt.plot(to_plot['bmis1'],to_plot['fits'], c='b', linewidth=2, alpha=.5)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.show()
Multivariable logistic regression model¶
model = smf.logit('nas135 ~ wtdiff + runtime', data=df).fit()
print(model.summary())
Optimization terminated successfully.
Current function value: 0.298177
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: nas135 No. Observations: 446
Model: Logit Df Residuals: 443
Method: MLE Df Model: 2
Date: Thu, 02 Oct 2025 Pseudo R-squ.: 0.2109
Time: 07:55:03 Log-Likelihood: -132.99
converged: True LL-Null: -168.53
Covariance Type: nonrobust LLR p-value: 3.681e-16
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -5.7753 0.958 -6.029 0.000 -7.653 -3.898
wtdiff 0.6980 0.115 6.049 0.000 0.472 0.924
runtime 0.0162 0.004 4.268 0.000 0.009 0.024
==============================================================================
Wald-type test for one parameter¶
Recall that under $H_0$, the $Z^2$ follows a $\chi^2(1)$
hypothesis='wtdiff=0'
wald = model.wald_test(hypothesis, scalar=True)
print("Wald test statistic:", wald.statistic)
print("p-value:", wald.pvalue)
Wald test statistic: 36.5913183097515 p-value: 1.4567908936820186e-09
Cubic spline logistic regression model¶
model = smf.logit('nas135 ~ bmis1 + bmis2 + wtdiff', data=df).fit()
print(model.summary())
Optimization terminated successfully.
Current function value: 0.298471
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: nas135 No. Observations: 446
Model: Logit Df Residuals: 442
Method: MLE Df Model: 3
Date: Thu, 02 Oct 2025 Pseudo R-squ.: 0.2101
Time: 07:55:03 Log-Likelihood: -133.12
converged: True LL-Null: -168.53
Covariance Type: nonrobust LLR p-value: 2.857e-15
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 8.5283 2.673 3.191 0.001 3.289 13.767
bmis1 -0.5033 0.128 -3.929 0.000 -0.754 -0.252
bmis2 0.7309 0.170 4.312 0.000 0.399 1.063
wtdiff 0.7518 0.118 6.396 0.000 0.521 0.982
==============================================================================
Wald-type test of two parameters¶
hypothesis = '(bmis1=0, bmis2=0)'
wald = model.wald_test(hypothesis, scalar=True)
print("Wald test statistic:", wald.statistic)
print("p-value:", wald.pvalue)
Wald test statistic: 18.638150365520445 p-value: 8.969682522687936e-05
Likelihood ratio test¶
An alternative to the Wald-type test, based on the estimated regression coefficients and their estimated variance/covariance, is the likelihood ratio test
$ \lambda = 2|\ell_R- \ell_U|$
where $\ell_R$ is the log-likelihood of the restricted model and $\ell_R$ is the log-likelihood of the unrestricted model. It is the absolute distance that matters.
The two models need to be nested, that is, estimated on the same number of subjects but different number of parameters in the model.
The two models are nested if you can obtain the restricted model from the full model by putting constraints on the parameters you want to test.
Under the null hypothesis that the parameters being tested are equal to zero (assuming the restricted model is true), the likelihood ratio test follows a $\chi^2$ distribution with degrees of freedom equal to the difference in the number of parameters $(k_1 - k_0)$ of the two models being compared.
Under the hypothesis that the $(k_1 - k_0)$ parameters are all equal to zero, we have that $ \lambda \sim \chi^2(k_1 - k_0) $
A computed likelihood ratio statistic exceeding a top quantile, saying 0.95, of the null sampling distribution is considered a strong indication of incompatibility between this sample of data and the hypothesis.
Alternatively, one can compute the p-value from the cumulative distribution function of the $\chi^2$.
# Model 1
m1 = smf.logit('nas135 ~ wtdiff + runtime + bmis1 + bmis2 ', data=df).fit()
l1 = m1.llf
# Model 0 - restricted model without spline terms for bmi
m0 = smf.logit('nas135 ~ wtdiff + runtime', data=df).fit()
l0 = m0.llf
Optimization terminated successfully.
Current function value: 0.283066
Iterations 8
Optimization terminated successfully.
Current function value: 0.298177
Iterations 7
lrt = 2*abs(l1-l0)
k1 = len(m1.params)
k0 = len(m0.params)
d_f = k1-k0
print("Likelihood ratio test %.3f (p = %.4f) " % (lrt, 1-stats.chi2.cdf(lrt,d_f)) )
Likelihood ratio test 13.479 (p = 0.0012)
Adjusted Odds Ratio of Hyponatremia according to weight change using 0 kg as referent¶
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 446 entries, 18 to 487 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 na 446 non-null int16 1 nas135 446 non-null int8 2 wtdiff 446 non-null float64 3 runtime 446 non-null float64 4 bmi 446 non-null float64 5 bmi2 446 non-null float64 6 bmis1 446 non-null float64 7 bmis2 446 non-null float64 8 fits 446 non-null float64 dtypes: float64(7), int16(1), int8(1) memory usage: 29.2 KB
Final logistic regression model¶
$\texttt{logit}\left(P(\texttt{nas135}|\texttt{wtdiff}, \texttt{runtime}, \texttt{bmi})\right) = \beta_0 + \beta_1\texttt{wtdiff} + \beta_2\texttt{runtime} + \beta_3\texttt{bmi} + \beta_4\texttt{bmi}^2 $
model = smf.logit('nas135 ~ wtdiff + runtime + bmi + bmi2', data=df).fit()
print(model.summary())
Optimization terminated successfully.
Current function value: 0.282840
Iterations 8
Logit Regression Results
==============================================================================
Dep. Variable: nas135 No. Observations: 446
Model: Logit Df Residuals: 441
Method: MLE Df Model: 4
Date: Thu, 02 Oct 2025 Pseudo R-squ.: 0.2515
Time: 07:55:03 Log-Likelihood: -126.15
converged: True LL-Null: -168.53
Covariance Type: nonrobust LLR p-value: 1.708e-17
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 21.3638 7.482 2.855 0.004 6.700 36.028
wtdiff 0.7329 0.123 5.953 0.000 0.492 0.974
runtime 0.0161 0.004 3.687 0.000 0.008 0.025
bmi -2.2319 0.622 -3.586 0.000 -3.452 -1.012
bmi2 0.0449 0.013 3.472 0.001 0.020 0.070
==============================================================================
Table of adjusted odds ratios¶
b = model.params
x = (-4, -3, -2, -1, 0, 1, 2, 3)
x_ref = 0
or_wtdiff = []
for i in range(8):
or_wtdiff.append(round(np.exp(b['wtdiff']*(x[i]-x_ref)), 2))
or_wtdiff
[0.05, 0.11, 0.23, 0.48, 1.0, 2.08, 4.33, 9.01]
Looping over a grid of predictor values¶
# Loop to tabulate adjusted odds ratios for a grid of values
for i in range(8):
print(x[i], round(np.exp(b['wtdiff']*(x[i]-x_ref)), 2))
-4 0.05 -3 0.11 -2 0.23 -1 0.48 0 1.0 1 2.08 2 4.33 3 9.01
V = model.cov_params()
V
| Intercept | wtdiff | runtime | bmi | bmi2 | |
|---|---|---|---|---|---|
| Intercept | 55.975540 | 0.158859 | -0.005729 | -4.614272 | 0.095350 |
| wtdiff | 0.158859 | 0.015157 | 0.000023 | -0.014786 | 0.000313 |
| runtime | -0.005729 | 0.000023 | 0.000019 | 0.000177 | -0.000006 |
| bmi | -4.614272 | -0.014786 | 0.000177 | 0.387397 | -0.008019 |
| bmi2 | 0.095350 | 0.000313 | -0.000006 | -0.008019 | 0.000167 |
b.T
Intercept 21.363804 wtdiff 0.732879 runtime 0.016116 bmi -2.231851 bmi2 0.044906 dtype: float64
b['wtdiff']
0.7328793472053252
se = np.sqrt(np.diag(V))
b_wtdiff = b['wtdiff']
se_b_wtdiff = se[1]
print(b_wtdiff, se_b_wtdiff)
0.7328793472053252 0.12311509580038163
print("Var %.4f" % 1.23114808e-01)
Var 0.1231
Customize the output of contrasts¶
for i in range(8):
print("wtdiff = ", x[i])
print("OR = %.2f CI = %.2f to %.2f " % (np.exp(b_wtdiff*(x[i]-0)), np.exp(b_wtdiff*(x[i]-0)-1.96*se_b_wtdiff*(x[i]-0)), np.exp(b_wtdiff*(x[i]-0)+1.96*se_b_wtdiff*(x[i]-0)))
)
wtdiff = -4 OR = 0.05 CI = 0.14 to 0.02 wtdiff = -3 OR = 0.11 CI = 0.23 to 0.05 wtdiff = -2 OR = 0.23 CI = 0.37 to 0.14 wtdiff = -1 OR = 0.48 CI = 0.61 to 0.38 wtdiff = 0 OR = 1.00 CI = 1.00 to 1.00 wtdiff = 1 OR = 2.08 CI = 1.63 to 2.65 wtdiff = 2 OR = 4.33 CI = 2.67 to 7.02 wtdiff = 3 OR = 9.01 CI = 4.37 to 18.59
Estimated odds ratio of hyponatremia conferred by 1 unit increase in weight change¶
$ OR_{wtdiff} = \exp(\hat \beta_1\times1) $
# log odds ratio and 95% CI
lincom(np.array([0,1,0,0,0]),b,V)
(0.7328793472053252, 0.4915781934833787, 0.9741805009272717)
# Using the equivariance property of ML estimate we can easily obtain OR (95% CI)
np.around(np.exp(lincom(np.array([0,1,0,0,0]),b,V)), 1)
array([2.1, 1.6, 2.6])
Estimated odds ratio of hyponatremia conferred by 2 unit increase in weight change¶
$ OR_{wtdiff} = \exp(\hat \beta_1\times 2) $
np.exp(lincom(np.array([0,2,0,0,0]),b,V))
array([4.33082777, 2.67287958, 7.01717701])
Estimated odds ratio of hyponatremia conferred by 30 min increase in race duration¶
$ OR_{runtime} = \exp(\hat \beta_2\times 30) $
np.round(np.exp(lincom(np.array([0,0,30,0,0]),b,V)),1)
array([1.6, 1.3, 2.1])
List comprehensions to produce tables of estimates¶
or_wtdiff = np.array([np.exp(lincom(np.array([0,(x-0),0,0,0]),b,V)) for x in np.arange(-4,3,1)] )
or_wtdiff
array([[0.05331608, 0.02030837, 0.13997201],
[0.11095417, 0.05379679, 0.22883946],
[0.23090274, 0.14250745, 0.37412834],
[0.4805234 , 0.37750159, 0.61166031],
[1. , 1. , 1. ],
[2.0810641 , 1.63489436, 2.64899547],
[4.33082777, 2.67287958, 7.01717701]])
or_runtime = np.array( [np.exp(lincom(np.array([0,0,(x-210),0,0]),b,V)) for x in np.arange(180,330,30)] )
or_runtime
array([[0.61663599, 0.47689288, 0.79732779],
[1. , 1. , 1. ],
[1.62170229, 1.25418933, 2.09690696],
[2.62991833, 1.57299087, 4.39701881],
[4.26494459, 1.97282836, 9.22013936]])
or_bmi = np.array([np.exp(lincom(np.array([0,0,0,(x-22.5),(x**2-506.25)]),b,V)) for x in np.arange(17.5,32.5,0.5)] )
or_bmi
array([[ 8.8295876 , 2.83287201, 27.52034575],
[ 6.41898241, 2.42612167, 16.98321052],
[ 4.77246888, 2.09990604, 10.84641826],
[ 3.62886806, 1.83720915, 7.16776497],
[ 2.82195709, 1.62508584, 4.90032072],
[ 2.24429937, 1.45366227, 3.46495865],
[ 1.82541797, 1.3154089 , 2.53316726],
[ 1.51843082, 1.20461365, 1.91400135],
[ 1.29175114, 1.11700515, 1.49383466],
[ 1.12386423, 1.04949246, 1.20350633],
[ 1. , 1. , 1. ],
[ 0.90999143, 0.8560019 , 0.96738618],
[ 0.84688758, 0.75382167, 0.95144329],
[ 0.8060563 , 0.68178057, 0.95298514],
[ 0.78461416, 0.63202642, 0.97404058],
[ 0.78108457, 0.59919431, 1.01818909],
[ 0.79522703, 0.57958228, 1.09110656],
[ 0.82800958, 0.57064619, 1.20144473],
[ 0.8817201 , 0.57069133, 1.36226065],
[ 0.96023444, 0.57868102, 1.59336515],
[ 1.06948566, 0.59411267, 1.92522332],
[ 1.21821462, 0.61693521, 2.40551494],
[ 1.41913531, 0.64749775, 3.11035059],
[ 1.69073271, 0.6865272 , 4.16382207],
[ 2.06004762, 0.73513502, 5.77281192],
[ 2.56702867, 0.79485443, 8.29036854],
[ 3.27141266, 0.8677102 , 12.33377327],
[ 4.26374376, 0.95632541, 19.00975401],
[ 5.68326643, 1.06407279, 30.35461268],
[ 7.74740197, 1.19528217, 50.21595654]])
Use lincom to make inference on predicted probabilities¶
$\texttt{logit}\left(P(\texttt{nas135}|\texttt{wtdiff},\texttt{runtime}, \texttt{bmi}\right) = \beta_0 + \beta_1\texttt{wtdiff} + \beta_2\texttt{runtime}+ \beta_3\texttt{bmi}+\beta_4\texttt{bmi}^2$
$\texttt{logit}\left(P(\texttt{nas135}|\texttt{wtdiff}=-1,\texttt{runtime}=210, \texttt{bmi}=22.5\right) = \beta_0 + \beta_1(-1) + \beta_2(210)+ \beta_3(22.5)+\beta_4(506.25)$
lincom(np.array([1,-1,210,22.5, 22.5**2]),b,V)
(-3.4678567099077813, -4.102390654917354, -2.8333227648982087)
def invlogit(xb):
return np.exp(xb) / (1 + np.exp(xb))
invlogit(lincom(np.array([1,-1,210,22.5, 22.5**2]),b,V))
array([0.03024077, 0.01626421, 0.05554981])
The probability of hyponatremia for those runners who lost 1 kg, completed the marathon in 3:30, and had a BMI equal to 22.5 kg/m^2 is estimated to be 3% (95% CI = 1.6 to 5.6%).
Plot statistical inference based on linear combinations¶
Graph partial predictions¶
The idea is to use the lincom function defined above to extract adjusted odds ratios.
xval = np.arange(-4,4,1)
ors = np.array([np.exp(lincom(np.array([0,(x-0),0,0,0]),b,V)) for x in xval] )
ors.shape
(8, 3)
ors
array([[ 0.05331608, 0.02030837, 0.13997201],
[ 0.11095417, 0.05379679, 0.22883946],
[ 0.23090274, 0.14250745, 0.37412834],
[ 0.4805234 , 0.37750159, 0.61166031],
[ 1. , 1. , 1. ],
[ 2.0810641 , 1.63489436, 2.64899547],
[ 4.33082777, 2.67287958, 7.01717701],
[ 9.01273017, 4.36987577, 18.58847013]])
Figure 2 - Panel A) Weight change¶
# graph the adjusted odds ratios for weight change
xval = np.arange(-4,4,1)
ors = np.array([np.exp(lincom(np.array([0,(x-0),0,0,0]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "wtdiff"]
sample0 = df.loc[df["nas135"] == 0, "wtdiff"]
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=1)
ax.plot(xval, ors[:,1], color='blue', alpha=0.2, linewidth=0)
ax.plot(xval, ors[:,2], color='blue', alpha=0.2, linewidth=0)
ax.fill_between(xval, ors[:,1], ors[:,2], color='blue', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%4.3f'))
ax.set_yticks([0.125, .250, .5, 1 , 2, 4, 8])
ax.text(-3, 4, r'$p$<0.001', fontsize=14)
ax.plot(sample0, [.07]*len(sample0), '|', color='k')
ax.plot(sample1, [12]*len(sample1), '|', color='k')
ax.set_xlim(-4,3.5) # most of the data
ax.set_ylim(.05,13) # most of the data
ax.vlines(0, .05, 1, color='black', linewidth=.5)
plt.xlabel('Weight Change (kg)')
plt.ylabel('Adjusted Odds Ratio')
plt.savefig("wtdiff_or.pdf")
plt.show()
Figure 2 - Panel B) Race duration¶
# graph the adjusted odds ratios for runtime
xval = np.arange(180,360,30)
ors = np.array([np.exp(lincom(np.array([0,0,(x-210),0,0]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "runtime"]
sample0 = df.loc[df["nas135"] == 0, "runtime"]
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=4)
ax.plot(xval, ors[:,1], color='blue', alpha=0.2, linewidth=0)
ax.plot(xval, ors[:,2], color='blue', alpha=0.2, linewidth=0)
ax.fill_between(xval, ors[:,1], ors[:,2], color='blue', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%4.3f'))
ax.vlines(210, .05, 1, color='black', linewidth=.5)
ax.set_yticks([.5, 1 , 2, 4, 8])
ax.set_xticks([180, 210, 240, 270, 300, 330] )
ax.set_xticklabels(['3:00', '3:30', '4:00', '4:30', '5:00', '5:30'])
ax.plot(sample0, [.5]*len(sample0), '|', color='k')
ax.plot(sample1, [12]*len(sample1), '|', color='k')
ax.set_xlim(180, 360) # most of the data
ax.set_ylim(.4,13) # most of the data
ax.text(200, 9, r'$p$<0.001', fontsize=14)
plt.xlabel('Race Duration (min)')
plt.ylabel('Odds Ratio')
plt.savefig("runtime_or.pdf")
plt.show()
Figure 2 - Panel C) Body Mass Index¶
# graph the adjusted odds ratios for BMI
xval = np.arange(17.5,32.5,0.5)
ors = np.array([np.exp(lincom(np.array([0,0,0,(x-22.5),(x**2-506.25)]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "bmi"]
sample0 = df.loc[df["nas135"] == 0, "bmi"]
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylim(.4, 12) # most of the data
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=4)
ax.plot(xval, ors[:,1], color='grey', alpha=0.2, linewidth=0)
ax.plot(xval, ors[:,2], color='grey', alpha=0.2, linewidth=0)
ax.fill_between(xval, ors[:,1], ors[:,2], color='blue', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%3.2f'))
ax.set_yticks([.5, 1 , 2, 4, 8])
ax.set_xticks(np.arange(17.5,32.5, 2.5))
#ax.set_title('Non-Linear Modeling of BMI vs. Disease Risk', fontsize=18, pad=20)
ax.vlines(22.5, .05, 1, color='black', linewidth=.5)
plt.xlabel(r'Body Mass Index (kg/$m^2$)',fontsize=16)
plt.ylabel('Adjusted Odds Ratio', fontsize=16)
ax.plot(sample0, [.5]*len(sample0), '|', color='k')
ax.plot(sample1, [11]*len(sample1), '|', color='k')
ax.text(20, 7, r'$p$<0.001', fontsize=14)
plt.savefig("bmi_or.pdf")
plt.show()
# graph the adjusted odds ratios for BMI with a histogram
xval = np.arange(17.5,32.5,0.5)
ors = np.array([np.exp(lincom(np.array([0,0,0,(x-22.5),(x**2-506.25)]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "bmi"]
sample0 = df.loc[df["nas135"] == 0, "bmi"]
bmi = np.array(df[['bmi']])
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylim(.4, 12) # most of the data
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=4)
ax.plot(xval, ors[:,1], color='blue', alpha=0.2, linewidth=0)
ax.plot(xval, ors[:,2], color='blue', alpha=0.2, linewidth=0)
ax.fill_between(xval, ors[:,1], ors[:,2], color='b', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%3.2f'))
ax.set_yticks([.5, 1 , 2, 4, 8])
ax.set_xticks(np.arange(17.5,32.5, 2.5))
ax.set_ylim(.4, 12) # most of the data
plt.xlabel(r'Body Mass Index (kg/$m^2$)',fontsize=16)
plt.ylabel('Adjusted Odds Ratio', fontsize=16)
#ax.set_title('Non-Linear Modeling of BMI vs. Disease Risk', fontsize=18, pad=20)
ax.plot(sample0, [.5]*len(sample0), '|', color='k')
ax.plot(sample1, [11]*len(sample1), '|', color='k')
ax.text(20, 7, r'$p$<0.001', fontsize=14)
ax.vlines(22.5, .05, 1, color='black', linewidth=.5)
ax2=ax.twinx()
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.hist(bmi, 15, color='red', alpha=0.1, density=True, rwidth=.9)
ax2.set_yticks([])
plt.savefig("bmi_or2.pdf")
plt.show()