Course 5316: Fundamentals of using Python in Health Related Research ¶

Peter Alping, PhD ¶

Orsini Nicola, PhD ¶

Day 4 - Linear and logistic regression models ¶

What are you going to learn?

  • specify a linear and logistic regression model
  • conduct a univariate and multivariate test of hypothesis
  • derive predicted outcomes and their changes for any linear combination
  • present the estimated model in a graphical form
  • model categorical predictors with dummy variables
  • model quantitative predictors with splines

Package¶

statsmodel¶

statsmodel is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

https://www.statsmodels.org

In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

from matplotlib.ticker import FormatStrFormatter

Motivating example: observational cohort study¶

head_NEJM.png

In [2]:
df = pd.read_stata('http://www.stats4life.se/data/marathon.dta', convert_categoricals=False)

Conditional mean¶

Consider a linear regression model to conduct statistical inference on the mean outcome conditionally on the values of the predictor.

$E(Y|x) = \beta_0 + \beta_1x $

For example, let's conduct statistical inference on the mean serum sodium concentration depending on weight change.

$E(\texttt{na}|\texttt{wtdiff}) = \beta_0 + \beta_1\texttt{wtdiff}$

In [3]:
# A scatter plot of individual outcomes can be a good starting point 

plt.scatter(df.wtdiff, df.na, c='r', alpha=.2)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
plt.show()
No description has been provided for this image
In [4]:
# Estimates of the model parameter using Ordinary Least Squares (OLS) method 

model = smf.ols('na ~ wtdiff', data=df).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     na   R-squared:                       0.169
Model:                            OLS   Adj. R-squared:                  0.167
Method:                 Least Squares   F-statistic:                     91.86
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           6.21e-20
Time:                        07:55:02   Log-Likelihood:                -1317.2
No. Observations:                 455   AIC:                             2638.
Df Residuals:                     453   BIC:                             2647.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    139.5534      0.223    624.476      0.000     139.114     139.993
wtdiff        -1.2183      0.127     -9.584      0.000      -1.468      -0.968
==============================================================================
Omnibus:                       65.662   Durbin-Watson:                   1.284
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              327.577
Skew:                          -0.491   Prob(JB):                     7.37e-72
Kurtosis:                       7.039   Cond. No.                         2.04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Store the predicted mean outcome¶

In [5]:
# Add a column in the dataframe containing the predicted outcomes 

df['fit'] = model.fittedvalues
In [6]:
# graph individual data points the the predicted mean outcome

plt.scatter(df.wtdiff, df.na, c='r', alpha=.2)
plt.plot(df.wtdiff, df.fit, c='b', alpha=.3 , linewidth=5)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.show()
No description has been provided for this image
In [7]:
# comparison of the predicted mean outcome vs observed outcome 

subset = np.column_stack((df['wtdiff'], df['na'], df['fit']))
In [8]:
subset[10:30]
Out[8]:
array([[         nan, 141.        ,          nan],
       [         nan, 143.        ,          nan],
       [         nan, 141.        ,          nan],
       [         nan, 141.        ,          nan],
       [         nan, 140.        ,          nan],
       [         nan, 142.        ,          nan],
       [         nan, 143.        ,          nan],
       [  0.68181818, 149.        , 138.72278443],
       [ -0.90909091, 142.        , 140.66096356],
       [  1.13636364, 141.        , 138.16901896],
       [ -0.68181818, 138.        , 140.38408083],
       [  0.22727273, 140.        , 139.27654989],
       [  0.        , 142.        , 139.55343263],
       [ -0.59659091, 149.        , 140.2802498 ],
       [ -0.22727273, 140.        , 139.83031537],
       [ -1.81818182, 147.        , 141.76849449],
       [  0.        , 136.        , 139.55343263],
       [  0.45454546, 139.        , 138.99966716],
       [  0.        , 143.        , 139.55343263],
       [  0.22727273, 142.        , 139.27654989]])

Univariate Wald-type test¶

Consider testing an hypothesis involving only one regression coefficient

$H_0: \beta_1 = 0 $

A Wald-type statistic is computed from the estimated model and compared with an F-distribution. The result of this test is shown typically in the output of the estimated model.

In [9]:
hypotheses = 'wtdiff = 0'
t_test = model.t_test(hypotheses)
t_test
Out[9]:
<class 'statsmodels.stats.contrast.ContrastResults'>
                             Test for Constraints                             
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0            -1.2183      0.127     -9.584      0.000      -1.468      -0.968
==============================================================================
In [10]:
# you can test any other value of the regression coefficient b1 

hypotheses = 'wtdiff = -1.3'
t_test = model.t_test(hypotheses)
t_test
Out[10]:
<class 'statsmodels.stats.contrast.ContrastResults'>
                             Test for Constraints                             
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0            -1.2183      0.127      0.643      0.521      -1.468      -0.968
==============================================================================

Access estimated parameters¶

In [11]:
# access estimated regression coefficients

model.params
Out[11]:
Intercept    139.553433
wtdiff        -1.218284
dtype: float64
In [12]:
# access estimated var/covariances of the regression coefficients

model.cov_params()
Out[12]:
Intercept wtdiff
Intercept 0.04994 0.011140
wtdiff 0.01114 0.016157
In [13]:
# access estimated std errors of the regression coefficients

np.sqrt(np.diag(model.cov_params()))
Out[13]:
array([0.22347274, 0.12711121])

Inference for any linear combinations of parameters and data¶

Suppose you want to conduct statistical inference on the predicted outcome (or change in predicted outcome) for a certain linear combination of the parameters and data $X\boldsymbol{\beta}$.

Let's consider a large-sample Wald-type test with reference to a standard normal distribution.

The $100(1-\alpha)$% confidence intervals can be obtained as follows

$ \begin{equation} \begin{gathered} X \boldsymbol{\hat \beta} \pm z_{\alpha / 2} \mathrm{diag} [ XV(\boldsymbol{\hat \beta})X' ]^{1/2} \end{gathered} \end{equation} $

Considering that Type I error is $\alpha=0.05$ in a two-sided test, we have that

$z_{0.05/ 2} = Q_{.025}(Z)$

So, we could rewrite the above in the following way to highlight the fact that the degree of confidence is related to the specific quantiles of the sampling distribution being used.

$ \begin{equation} \begin{gathered} Q_p(X \boldsymbol{\hat \beta}) = X \boldsymbol{\hat \beta} + Q_p(Z) \sqrt{ \mathrm{diag}(XV(\boldsymbol{\hat \beta})X') } \end{gathered} \end{equation} $

where $V(\boldsymbol{\hat \beta})$ is the estimated variance/covariance matrix of the estimated of $\boldsymbol{\hat \beta}$.

In Stata this is done with lincom command

In R this is done with ci.lin() function in Epi library

In [14]:
def lincom(X,b,V):
    Var = np.dot(X, np.dot(V,X.T))
    if Var.ndim == 0:
        Se = np.sqrt(Var)
    else: 
        Se = np.sqrt(np.diag(Var))
    return (np.matmul(X,b), np.matmul(X,b)+stats.norm.ppf(.025)*Se, np.matmul(X,b)+stats.norm.ppf(.975)*Se)
In [15]:
b = model.params
V = model.cov_params()

Example 1: Confidence interval for the mean outcome¶

Derive a 95% confidence interval of the mean serum sodium concentration for those runners who increased 3 kg.

$E(\texttt{na}|\texttt{wtdiff}=3) = \beta_0 + \beta_1 3$

In [16]:
lincom(np.array([1, 3]),b,V)
Out[16]:
(135.898580556215, 134.89498068034243, 136.90218043208756)
In [17]:
def lincom2(X,b,V):
    Var = np.dot(X, np.dot(V,X.T))
    if Var.ndim == 0:
        Se = np.sqrt(Var)
    else: 
        Se = np.sqrt(np.diag(Var))
    z = (np.matmul(X,b)-0)/Se
    p = stats.norm.cdf( -abs(z))*2
    return ( np.matmul(X,b), Se, z,  p, np.matmul(X,b)+stats.norm.ppf(.025)*Se, np.matmul(X,b)+stats.norm.ppf(.975)*Se)
In [18]:
results = lincom2(np.array([0, 1]),b,V)
In [19]:
np.round(np.transpose(results),4)
Out[19]:
array([-1.2183,  0.1271, -9.5844,  0.    , -1.4674, -0.9692])
In [20]:
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     na   R-squared:                       0.169
Model:                            OLS   Adj. R-squared:                  0.167
Method:                 Least Squares   F-statistic:                     91.86
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           6.21e-20
Time:                        07:55:02   Log-Likelihood:                -1317.2
No. Observations:                 455   AIC:                             2638.
Df Residuals:                     453   BIC:                             2647.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    139.5534      0.223    624.476      0.000     139.114     139.993
wtdiff        -1.2183      0.127     -9.584      0.000      -1.468      -0.968
==============================================================================
Omnibus:                       65.662   Durbin-Watson:                   1.284
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              327.577
Skew:                          -0.491   Prob(JB):                     7.37e-72
Kurtosis:                       7.039   Cond. No.                         2.04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2: Confidence interval for the difference in mean outcome¶

Derive a 95% confidence interval of the mean outcome difference comparing those who increased 3 kg vs those who decreased 1 kg.

$E(\texttt{na}|\texttt{wtdiff}=3) - E(\texttt{na}|\texttt{wtdiff}=-1) = \beta_1 (3- -1) = \beta_1 4$

In [21]:
lincom(np.array([0, 4]),b,V)
Out[21]:
(-4.873136097422644, -5.869669663999436, -3.876602530845852)

Linear and quadratic function for the quantitative predictors¶

Consider a linear regression model to conduct statistical inference on the mean serum sodium concentration as function of weight change (linear function), and bmi (quadratic function)

$E(\texttt{na}|\texttt{wtdiff}, \texttt{bmi}) = \beta_0 + \beta_1\texttt{wtdiff}+ \beta_2\texttt{bmi} + \beta_3\texttt{bmi}^2$

In [22]:
df['bmisq'] = df['bmi']**2
In [23]:
# Pearson correlation between bmi and bmi squared
valid = df[['bmi', 'bmisq']].dropna()
r, pval = stats.pearsonr(valid['bmi'], valid['bmisq'])
print("Correlation:", r, "p-value:", pval)
Correlation: 0.9965742291212139 p-value: 0.0
In [24]:
model = smf.ols('na ~ wtdiff + bmi + bmisq' , data=df).fit()
In [25]:
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     na   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.180
Method:                 Least Squares   F-statistic:                     34.03
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           7.91e-20
Time:                        07:55:02   Log-Likelihood:                -1305.2
No. Observations:                 452   AIC:                             2618.
Df Residuals:                     448   BIC:                             2635.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    105.0147     11.218      9.361      0.000      82.968     127.061
wtdiff        -1.1770      0.127     -9.247      0.000      -1.427      -0.927
bmi            2.8707      0.940      3.053      0.002       1.023       4.719
bmisq         -0.0587      0.020     -3.006      0.003      -0.097      -0.020
==============================================================================
Omnibus:                       60.670   Durbin-Watson:                   1.308
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              330.172
Skew:                          -0.400   Prob(JB):                     2.01e-72
Kurtosis:                       7.110   Cond. No.                     3.03e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.03e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

F test for a joint hypothesis¶

Since body mass index was modelled with a quadratic function (degree-2 polynomial), then a test for the hypothesis of no adjusted association between body mass index and mean serum sodium concentration is obtained by testing the two regression coefficients equal to zero.

$H_0: \beta_2 = \beta_3 = 0 $

A Wald-type statistic is computed from the estimated model and compared with an F-distribution.

In [26]:
hypotheses = ['bmi = 0',  'bmisq = 0']
f_test = model.f_test(hypotheses)
In [27]:
f_test
Out[27]:
<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=4.767628913599036, p=0.00893659800206336, df_denom=448, df_num=2>

Conditional probability¶

Consider a logistic regression model to conduct statistical inference on the outcome probability conditionally on the values of the predictor.

$P(Y=1|x) = \texttt{invlogit}(\beta_0 + \beta_1x) $

The logit of the outcome probability is related to a linear combination of regression coefficients and variables

$\texttt{logit}\left(P(Y=1|x)\right) = \beta_0 + \beta_1x $

Binary predictor¶

Conduct statistical inference on the odds ratio of hyponatremia associated with female sex.

$\texttt{logit}\left(P(\texttt{nas135}|\texttt{female})\right) = \beta_0 + \beta_1\texttt{female} $

Estimates of the regression coefficients are obtained with the maximum likelihood method.

In [28]:
model = smf.logit('nas135 ~ female ', data=df).fit()
Optimization terminated successfully.
         Current function value: 0.360585
         Iterations 6
In [29]:
print(model.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 nas135   No. Observations:                  488
Model:                          Logit   Df Residuals:                      486
Method:                           MLE   Df Model:                            1
Date:                Thu, 02 Oct 2025   Pseudo R-squ.:                 0.05293
Time:                        07:55:02   Log-Likelihood:                -175.97
converged:                       True   LL-Null:                       -185.80
Covariance Type:            nonrobust   LLR p-value:                 9.204e-06
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.4749      0.208    -11.884      0.000      -2.883      -2.067
female         1.2260      0.280      4.386      0.000       0.678       1.774
==============================================================================

Confidence interval for the Odds Ratio¶

Thanks to the equivariance property of the maximum likelihood estimator

  • obtain estimates with confidence limits
  • exponentiate and round them
In [30]:
b = model.params
V = model.cov_params()
np.round(np.exp(lincom(np.array([0, 1]), b,  V)), 1)
Out[30]:
array([3.4, 2. , 5.9])
In [31]:
params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['LB', 'UB', 'OR']
print(np.around(np.exp(conf[1:]),1))
         LB   UB   OR
female  2.0  5.9  3.4

Categorization of continuous predictors¶

In [32]:
df['bmic'] = pd.cut(df['bmi'],bins=[0,20,25,40],labels=['<20','20-25','>25'])
df['runtimehc'] = pd.cut(df['runtime']/60, bins=[0,3.5,4,8] , labels=['<3:30','3:30-4:00','>4:00'])
In [33]:
df['runtimehc'].value_counts(normalize=True)
Out[33]:
runtimehc
<3:30        0.404612
3:30-4:00    0.308176
>4:00        0.287212
Name: proportion, dtype: float64
In [34]:
df['bmic'].value_counts(normalize=True)
Out[34]:
bmic
20-25    0.701075
>25      0.193548
<20      0.105376
Name: proportion, dtype: float64

Final multivariable logistic model with categorized predictors¶

Consider the estimates provided in Table 2 of the NEJM paper.

$\texttt{logit}(P(\texttt{nas135})=1|\texttt{wtdiff}, \texttt{runtime}, \texttt{bmi}) = \beta_0 + \beta_1I(\texttt{wtdiff}>0) + \beta_2 I(3.5<\texttt{runtime}<4) + \beta_3 I(\texttt{runtime}>4) + \beta_4 I(\texttt{bmi}<20) + \beta_5 I(\texttt{bmi}>25)$

Indicator (or dummy) variables are conveniently created with C() notation where treatment is indicating the chosen reference level.

In [35]:
model = smf.logit('nas135 ~ I(wtdiff>0) + C(runtimehc,Treatment(0))  + C(bmic,Treatment(1)) ', data=df).fit()
print(model.summary())
Optimization terminated successfully.
         Current function value: 0.311873
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 nas135   No. Observations:                  455
Model:                          Logit   Df Residuals:                      449
Method:                           MLE   Df Model:                            5
Date:                Thu, 02 Oct 2025   Pseudo R-squ.:                  0.1826
Time:                        07:55:02   Log-Likelihood:                -141.90
converged:                       True   LL-Null:                       -173.61
Covariance Type:            nonrobust   LLR p-value:                 2.395e-12
===========================================================================================================
                                              coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------
Intercept                                  -3.8727      0.436     -8.889      0.000      -4.727      -3.019
I(wtdiff > 0)[T.True]                       1.4578      0.310      4.698      0.000       0.850       2.066
C(runtimehc, Treatment(0))[T.3:30-4:00]     1.0549      0.475      2.222      0.026       0.124       1.985
C(runtimehc, Treatment(0))[T.>4:00]         1.8032      0.466      3.867      0.000       0.889       2.717
C(bmic, Treatment(1))[T.<20]                1.3910      0.410      3.390      0.001       0.587       2.195
C(bmic, Treatment(1))[T.>25]                0.0369      0.388      0.095      0.924      -0.723       0.797
===========================================================================================================
In [36]:
params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['LB', 'UB', 'OR']
print(np.around(np.exp(conf[1:]),1))
                                          LB    UB   OR
I(wtdiff > 0)[T.True]                    2.3   7.9  4.3
C(runtimehc, Treatment(0))[T.3:30-4:00]  1.1   7.3  2.9
C(runtimehc, Treatment(0))[T.>4:00]      2.4  15.1  6.1
C(bmic, Treatment(1))[T.<20]             1.8   9.0  4.0
C(bmic, Treatment(1))[T.>25]             0.5   2.2  1.0

All continuous predictors using linear and quadratic functions¶

This is the final logistic regression model where each continuous preditor is modelled as quantitative

You need to assume the functional relationship between each continuous predicted and the logit of the outcome probability

  • linear function for weight change
  • linear function for race duration
  • quadratic function for BMI

$\texttt{logit}(P(\texttt{nas135})=1|\texttt{wtdiff}, \texttt{runtime}, \texttt{bmi}) = \beta_0 + \beta_1 \texttt{wtdiff} + \beta_2 \texttt{runtime} + \beta_3 \texttt{bmi} + \beta_5 \texttt{bmi}^2$

In [37]:
df['bmisq'] = df['bmi']**2
model = smf.logit('nas135~wtdiff+runtime+bmi+bmisq', data=df).fit()
print(model.summary())
Optimization terminated successfully.
         Current function value: 0.282840
         Iterations 8
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 nas135   No. Observations:                  446
Model:                          Logit   Df Residuals:                      441
Method:                           MLE   Df Model:                            4
Date:                Thu, 02 Oct 2025   Pseudo R-squ.:                  0.2515
Time:                        07:55:02   Log-Likelihood:                -126.15
converged:                       True   LL-Null:                       -168.53
Covariance Type:            nonrobust   LLR p-value:                 1.708e-17
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     21.3638      7.482      2.855      0.004       6.700      36.028
wtdiff         0.7329      0.123      5.953      0.000       0.492       0.974
runtime        0.0161      0.004      3.687      0.000       0.008       0.025
bmi           -2.2319      0.622     -3.586      0.000      -3.452      -1.012
bmisq          0.0449      0.013      3.472      0.001       0.020       0.070
==============================================================================
In [38]:
b = model.params
V = model.cov_params
In [39]:
params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['LB', 'UB', 'OR']
print(np.around(np.exp(conf[1:3]),2))
           LB    UB    OR
wtdiff   1.63  2.65  2.08
runtime  1.01  1.02  1.02
In [40]:
model.nobs
Out[40]:
446

AIC and maximized log-likelihood¶

Akaike’s (1974) information criterion is defined as

$AIC = -2 lnL + 2k$

where $lnL$ is the maximized log-likelihood of the model and $k$ is the number of parameters estimated.

In [41]:
model.aic
Out[41]:
262.2931311594778
In [42]:
model.llf
Out[42]:
-126.1465655797389

Modelling quantitative predictors using splines¶

A spline of a predictor $x$ is just a transformation of it depending on the location of a knot $k$ and the power to which the distance between $x$ and $k$ is raised.

Degree-0 spline

$I(x>k)(x-k)^0$

Degree-1 spline

$I(x>k)(x-k)^1$

Degree-2 spline

$I(x>k)(x-k)^2$

Degree-3 spline

$I(x>k)(x-k)^3$

The binary indicator variable $I(x>k)$ takes value 1 only when $x>k$ and 0 otherwise.

Degree-1 Spline with linear regression¶

$ E(Y|x) = \beta_0 + \beta_1x + \beta_2 I(x>0)(x-0)^1$

if $x<0$

$ E(Y|x) = \beta_0 + \beta_1 x $

if $x \ge 0$

$ E(Y|x) = \beta_0 + (\beta_1 + \beta_2) x $

In [43]:
# Create the linear spline at the chosen knot 
df.loc[ df.wtdiff>0, 'wc_s1'] = (df.wtdiff-0)
df.loc[ df.wtdiff<=0, 'wc_s1'] = 0
In [44]:
df[ ['wtdiff', 'wc_s1']]
Out[44]:
wtdiff wc_s1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
483 -2.500000 0.000000
484 0.454545 0.454545
485 0.312500 0.312500
486 1.136364 1.136364
487 -0.909091 0.000000

488 rows × 2 columns

In [45]:
model = smf.ols('na~ wtdiff + wc_s1', data=df).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     na   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.182
Method:                 Least Squares   F-statistic:                     51.51
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           7.06e-21
Time:                        07:55:03   Log-Likelihood:                -1312.5
No. Observations:                 455   AIC:                             2631.
Df Residuals:                     452   BIC:                             2643.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    140.3155      0.332    422.036      0.000     139.662     140.969
wtdiff        -0.7795      0.190     -4.094      0.000      -1.154      -0.405
wc_s1         -1.3319      0.433     -3.073      0.002      -2.184      -0.480
==============================================================================
Omnibus:                       66.281   Durbin-Watson:                   1.324
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              326.527
Skew:                          -0.503   Prob(JB):                     1.25e-71
Kurtosis:                       7.026   Cond. No.                         4.85
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Visualize piece-wise linear trend¶

In [46]:
df['fit'] = model.fittedvalues
to_plot = df[ ['wtdiff', 'fit'] ].sort_values(['wtdiff'])
plt.plot(to_plot.wtdiff, to_plot.fit, c='b', alpha=.5, linewidth=3)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.axvline(0, c='g', alpha=.2, linestyle='dashed', linewidth=2)
plt.show()
No description has been provided for this image

Define a function to generate restricted cubic splines¶

mkspline.png

knots_mkspline.png

In [47]:
# Define a function to generate restricted cubic splines given number of knots

def rcs(x, nk):
    if (nk == 3):
        lk = [.1, .5, .9]
    if (nk == 4):
        lk = [.05, .35, .65, .95]
    if (nk == 5):
        lk = [.05 , .275, .50, .725, .95]
    if (nk == 6):
        lk = [.05 , .23, .41 , .59, .77 , .95]
    if (nk == 7):
        lk = [.25 , .1833 , .3417 , .50 , .6583 , .8167 , .975]
    knots =[]
    for i in range(nk):
        knots.append(np.quantile(x, lk[i] ) )
    u = np.zeros(shape=(len(x),nk) )
    i=0
    while i < nk:
        u[:,i] = np.array(x>knots[i])*(x-knots[i])
        i = i + 1
    spline = np.zeros(shape=(len(x),(nk-1)))
    spline[:,0] = x
    i = 0 
    while i < (nk-2):
        ip1 = i + 1
        spline[:,ip1] = ( np.power(u[:,i],3) - 1/(knots[nk-1] - knots[nk-2])* 
        ( np.power(u[:,(nk-2)],3)*(knots[nk-1] - knots[i])-np.power(u[:,(nk-1)],3)*(knots[nk-2] - knots[i]) )) / (knots[nk-1] - knots[0])**2 
        i=i+1
    return spline
In [48]:
df = pd.read_stata('http://www.stats4life.se/data/marathon.dta', convert_categoricals=False)
df = df[['na', 'nas135', 'wtdiff', 'runtime', 'bmi']].dropna()
df['bmi2'] = np.power(df['bmi'],2)
y = df['na'].to_numpy()        
In [49]:
# loop to compare AIC across spline models with increasing nr of knots

bmis = []
l_aic = []
nknots = [3,4,5,6,7]
for k in nknots:
    print("Nr of knots = " , k)
    X = rcs(df.bmi,k)
    X = sm.add_constant(X)
    smodel = sm.OLS(y,X).fit()
    print(smodel.summary())
    l_aic.append(smodel.aic)
    fits = smodel.fittedvalues
    # print(np.column_stack( [X[:,1], fits] ))
    plt.scatter(X[:,1], fits, s=2, c='r', alpha=.4)
    plt.title('Restricted Cubic Spline with {0} Knots'.format(k))
    plt.show()
    print('Spline model with %.0f knots AIC is %.0f' % (k, smodel.aic))

# lowest AIC is indicating a good compromise between fit and parameters
list(zip(nknots, l_aic))
Nr of knots =  3
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.034
Method:                 Least Squares   F-statistic:                     8.754
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           0.000187
Time:                        07:55:03   Log-Likelihood:                -1325.3
No. Observations:                 446   AIC:                             2657.
Df Residuals:                     443   BIC:                             2669.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        121.5136      4.555     26.674      0.000     112.561     130.467
x1             0.8859      0.213      4.161      0.000       0.467       1.304
x2            -1.1100      0.277     -4.001      0.000      -1.655      -0.565
==============================================================================
Omnibus:                       70.144   Durbin-Watson:                   1.181
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              316.526
Skew:                          -0.590   Prob(JB):                     1.85e-69
Kurtosis:                       6.955   Cond. No.                         473.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
No description has been provided for this image
Spline model with 3 knots AIC is 2657
Nr of knots =  4
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.039
Model:                            OLS   Adj. R-squared:                  0.032
Method:                 Least Squares   F-statistic:                     5.957
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           0.000548
Time:                        07:55:03   Log-Likelihood:                -1325.1
No. Observations:                 446   AIC:                             2658.
Df Residuals:                     442   BIC:                             2675.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        117.1229      7.674     15.263      0.000     102.042     132.204
x1             1.1091      0.377      2.946      0.003       0.369       1.849
x2            -2.3612      1.570     -1.504      0.133      -5.446       0.724
x3             4.6141      5.000      0.923      0.357      -5.213      14.442
==============================================================================
Omnibus:                       70.064   Durbin-Watson:                   1.183
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              316.204
Skew:                          -0.589   Prob(JB):                     2.17e-69
Kurtosis:                       6.953   Cond. No.                         928.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
No description has been provided for this image
Spline model with 4 knots AIC is 2658
Nr of knots =  5
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.039
Model:                            OLS   Adj. R-squared:                  0.031
Method:                 Least Squares   F-statistic:                     4.512
Date:                Thu, 02 Oct 2025   Prob (F-statistic):            0.00139
Time:                        07:55:03   Log-Likelihood:                -1325.0
No. Observations:                 446   AIC:                             2660.
Df Residuals:                     441   BIC:                             2680.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        113.8794     10.237     11.124      0.000      93.760     133.999
x1             1.2772      0.514      2.483      0.013       0.266       2.288
x2            -4.3913      4.339     -1.012      0.312     -12.919       4.136
x3            13.2230     21.448      0.617      0.538     -28.931      55.377
x4           -11.6890     26.956     -0.434      0.665     -64.667      41.289
==============================================================================
Omnibus:                       70.538   Durbin-Watson:                   1.186
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              322.180
Skew:                          -0.590   Prob(JB):                     1.10e-70
Kurtosis:                       6.993   Cond. No.                     3.65e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.65e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
No description has been provided for this image
Spline model with 5 knots AIC is 2660
Nr of knots =  6
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.039
Model:                            OLS   Adj. R-squared:                  0.028
Method:                 Least Squares   F-statistic:                     3.576
Date:                Thu, 02 Oct 2025   Prob (F-statistic):            0.00351
Time:                        07:55:03   Log-Likelihood:                -1325.0
No. Observations:                 446   AIC:                             2662.
Df Residuals:                     440   BIC:                             2687.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        115.5313     11.903      9.706      0.000      92.138     138.925
x1             1.1914      0.605      1.969      0.050       0.002       2.380
x2            -3.2623      7.866     -0.415      0.679     -18.721      12.196
x3             2.4271     45.041      0.054      0.957     -86.095      90.949
x4             8.1119     77.386      0.105      0.917    -143.981     160.205
x5           -10.9087     61.681     -0.177      0.860    -132.135     110.318
==============================================================================
Omnibus:                       70.170   Durbin-Watson:                   1.186
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              318.165
Skew:                          -0.589   Prob(JB):                     8.15e-70
Kurtosis:                       6.967   Cond. No.                     1.11e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.11e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
No description has been provided for this image
Spline model with 6 knots AIC is 2662
Nr of knots =  7
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.043
Model:                            OLS   Adj. R-squared:                  0.030
Method:                 Least Squares   F-statistic:                     3.301
Date:                Thu, 02 Oct 2025   Prob (F-statistic):            0.00345
Time:                        07:55:03   Log-Likelihood:                -1324.1
No. Observations:                 446   AIC:                             2662.
Df Residuals:                     439   BIC:                             2691.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        116.2837      9.241     12.584      0.000      98.122     134.445
x1             1.1481      0.461      2.491      0.013       0.242       2.054
x2          -121.1906    295.221     -0.411      0.682    -701.412     459.031
x3            11.8721     94.523      0.126      0.900    -173.901     197.645
x4           241.8662    337.937      0.716      0.475    -422.310     906.042
x5          -191.9364    189.090     -1.015      0.311    -563.570     179.697
x6            74.5418     66.352      1.123      0.262     -55.864     204.948
==============================================================================
Omnibus:                       74.168   Durbin-Watson:                   1.189
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              345.955
Skew:                          -0.623   Prob(JB):                     7.53e-76
Kurtosis:                       7.131   Cond. No.                     5.07e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.07e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
No description has been provided for this image
Spline model with 7 knots AIC is 2662
Out[49]:
[(3, 2656.5587353722885),
 (4, 2658.1682651667147),
 (5, 2659.955934150113),
 (6, 2662.0784684105115),
 (7, 2662.1652085479254)]
In [50]:
bmis = rcs(df.bmi,3)
In [51]:
np.size(bmis, 0), np.size(bmis,1)
Out[51]:
(446, 2)
In [52]:
# creating the design matrix 

bmis = rcs(df.bmi,3)
y = df['na'].to_numpy()        
X = bmis
X = sm.add_constant(X)
model1 = sm.OLS(y,X).fit()
print(model1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.034
Method:                 Least Squares   F-statistic:                     8.754
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           0.000187
Time:                        07:55:03   Log-Likelihood:                -1325.3
No. Observations:                 446   AIC:                             2657.
Df Residuals:                     443   BIC:                             2669.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        121.5136      4.555     26.674      0.000     112.561     130.467
x1             0.8859      0.213      4.161      0.000       0.467       1.304
x2            -1.1100      0.277     -4.001      0.000      -1.655      -0.565
==============================================================================
Omnibus:                       70.144   Durbin-Watson:                   1.181
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              316.526
Skew:                          -0.590   Prob(JB):                     1.85e-69
Kurtosis:                       6.955   Cond. No.                         473.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [53]:
df
Out[53]:
na nas135 wtdiff runtime bmi bmi2
18 142 0 -0.909091 195.0 19.503027 380.368077
19 141 0 1.136364 195.0 18.695052 349.504971
20 138 0 -0.681818 207.0 20.157869 406.339666
21 140 0 0.227273 234.0 17.063750 291.171557
22 142 0 0.000000 233.0 19.428191 377.454619
... ... ... ... ... ... ...
483 133 1 -2.500000 308.0 29.188370 851.960947
484 128 1 0.454545 250.0 31.387685 985.186749
485 133 1 0.312500 283.0 28.311327 801.531230
486 131 1 1.136364 308.0 31.224404 974.963415
487 135 1 -0.909091 276.0 28.292931 800.489939

446 rows × 6 columns

In [54]:
# add the cubic splines to the dataframe
df['bmis1'] = bmis[:,0]
df['bmis2'] = bmis[:,1]
In [55]:
# use the formula approach 

model2 = smf.ols('na ~ bmis1 + bmis2', data=df).fit()
print(model2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     na   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.034
Method:                 Least Squares   F-statistic:                     8.754
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           0.000187
Time:                        07:55:03   Log-Likelihood:                -1325.3
No. Observations:                 446   AIC:                             2657.
Df Residuals:                     443   BIC:                             2669.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    121.5136      4.555     26.674      0.000     112.561     130.467
bmis1          0.8859      0.213      4.161      0.000       0.467       1.304
bmis2         -1.1100      0.277     -4.001      0.000      -1.655      -0.565
==============================================================================
Omnibus:                       70.144   Durbin-Watson:                   1.181
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              316.526
Skew:                          -0.590   Prob(JB):                     1.85e-69
Kurtosis:                       6.955   Cond. No.                         473.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Testing regression coefficients of the splines¶

$H_0: \beta_1 = \beta_2 = 0 $

In [56]:
hypotheses = ['bmis1 = 0', 'bmis2 = 0']
f_test = model2.f_test(hypotheses)
In [57]:
f_test
Out[57]:
<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=8.753689785498299, p=0.0001868614716924368, df_denom=443, df_num=2>

Graph the predicted mean from the cubic spline linear model¶

In [58]:
df['fits'] = model2.fittedvalues
In [59]:
# (x,y) pairs to be plotted need to be sorted first 

to_plot = df[ ['bmis1', 'fits'] ].sort_values(['bmis1', 'fits'])
In [60]:
#plt.scatter(df.bmi, df.na, c='r', alpha=.2, s=3)
plt.plot(to_plot['bmis1'],to_plot['fits'], c='b', linewidth=2, alpha=.5)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.show()
No description has been provided for this image

Multivariable logistic regression model¶

In [61]:
model = smf.logit('nas135 ~ wtdiff + runtime', data=df).fit()
print(model.summary())
Optimization terminated successfully.
         Current function value: 0.298177
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 nas135   No. Observations:                  446
Model:                          Logit   Df Residuals:                      443
Method:                           MLE   Df Model:                            2
Date:                Thu, 02 Oct 2025   Pseudo R-squ.:                  0.2109
Time:                        07:55:03   Log-Likelihood:                -132.99
converged:                       True   LL-Null:                       -168.53
Covariance Type:            nonrobust   LLR p-value:                 3.681e-16
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -5.7753      0.958     -6.029      0.000      -7.653      -3.898
wtdiff         0.6980      0.115      6.049      0.000       0.472       0.924
runtime        0.0162      0.004      4.268      0.000       0.009       0.024
==============================================================================

Wald-type test for one parameter¶

Recall that under $H_0$, the $Z^2$ follows a $\chi^2(1)$

In [62]:
hypothesis='wtdiff=0'
wald = model.wald_test(hypothesis, scalar=True)
print("Wald test statistic:", wald.statistic)
print("p-value:", wald.pvalue)
Wald test statistic: 36.5913183097515
p-value: 1.4567908936820186e-09

Cubic spline logistic regression model¶

In [63]:
model = smf.logit('nas135 ~ bmis1 + bmis2 + wtdiff', data=df).fit()
print(model.summary())
Optimization terminated successfully.
         Current function value: 0.298471
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 nas135   No. Observations:                  446
Model:                          Logit   Df Residuals:                      442
Method:                           MLE   Df Model:                            3
Date:                Thu, 02 Oct 2025   Pseudo R-squ.:                  0.2101
Time:                        07:55:03   Log-Likelihood:                -133.12
converged:                       True   LL-Null:                       -168.53
Covariance Type:            nonrobust   LLR p-value:                 2.857e-15
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      8.5283      2.673      3.191      0.001       3.289      13.767
bmis1         -0.5033      0.128     -3.929      0.000      -0.754      -0.252
bmis2          0.7309      0.170      4.312      0.000       0.399       1.063
wtdiff         0.7518      0.118      6.396      0.000       0.521       0.982
==============================================================================

Wald-type test of two parameters¶

In [64]:
hypothesis = '(bmis1=0, bmis2=0)'
wald = model.wald_test(hypothesis, scalar=True)
print("Wald test statistic:", wald.statistic)
print("p-value:", wald.pvalue)
Wald test statistic: 18.638150365520445
p-value: 8.969682522687936e-05

Likelihood ratio test¶

An alternative to the Wald-type test, based on the estimated regression coefficients and their estimated variance/covariance, is the likelihood ratio test

$ \lambda = 2|\ell_R- \ell_U|$

where $\ell_R$ is the log-likelihood of the restricted model and $\ell_R$ is the log-likelihood of the unrestricted model. It is the absolute distance that matters.

The two models need to be nested, that is, estimated on the same number of subjects but different number of parameters in the model.

The two models are nested if you can obtain the restricted model from the full model by putting constraints on the parameters you want to test.

Under the null hypothesis that the parameters being tested are equal to zero (assuming the restricted model is true), the likelihood ratio test follows a $\chi^2$ distribution with degrees of freedom equal to the difference in the number of parameters $(k_1 - k_0)$ of the two models being compared.

Under the hypothesis that the $(k_1 - k_0)$ parameters are all equal to zero, we have that $ \lambda \sim \chi^2(k_1 - k_0) $

A computed likelihood ratio statistic exceeding a top quantile, saying 0.95, of the null sampling distribution is considered a strong indication of incompatibility between this sample of data and the hypothesis.

Alternatively, one can compute the p-value from the cumulative distribution function of the $\chi^2$.

In [65]:
# Model 1  

m1 = smf.logit('nas135 ~ wtdiff + runtime + bmis1 + bmis2 ', data=df).fit()
l1 = m1.llf

# Model 0 - restricted model without spline terms for bmi

m0 = smf.logit('nas135 ~ wtdiff + runtime', data=df).fit()
l0 = m0.llf
Optimization terminated successfully.
         Current function value: 0.283066
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.298177
         Iterations 7
In [66]:
lrt = 2*abs(l1-l0)
k1 = len(m1.params)
k0 = len(m0.params)
d_f = k1-k0
print("Likelihood ratio test %.3f (p = %.4f) " % (lrt, 1-stats.chi2.cdf(lrt,d_f)) )
Likelihood ratio test 13.479 (p = 0.0012) 

Adjusted Odds Ratio of Hyponatremia according to weight change using 0 kg as referent¶

In [67]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 446 entries, 18 to 487
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   na       446 non-null    int16  
 1   nas135   446 non-null    int8   
 2   wtdiff   446 non-null    float64
 3   runtime  446 non-null    float64
 4   bmi      446 non-null    float64
 5   bmi2     446 non-null    float64
 6   bmis1    446 non-null    float64
 7   bmis2    446 non-null    float64
 8   fits     446 non-null    float64
dtypes: float64(7), int16(1), int8(1)
memory usage: 29.2 KB

Final logistic regression model¶

$\texttt{logit}\left(P(\texttt{nas135}|\texttt{wtdiff}, \texttt{runtime}, \texttt{bmi})\right) = \beta_0 + \beta_1\texttt{wtdiff} + \beta_2\texttt{runtime} + \beta_3\texttt{bmi} + \beta_4\texttt{bmi}^2 $

In [68]:
model = smf.logit('nas135 ~ wtdiff + runtime + bmi + bmi2', data=df).fit()
print(model.summary())
Optimization terminated successfully.
         Current function value: 0.282840
         Iterations 8
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 nas135   No. Observations:                  446
Model:                          Logit   Df Residuals:                      441
Method:                           MLE   Df Model:                            4
Date:                Thu, 02 Oct 2025   Pseudo R-squ.:                  0.2515
Time:                        07:55:03   Log-Likelihood:                -126.15
converged:                       True   LL-Null:                       -168.53
Covariance Type:            nonrobust   LLR p-value:                 1.708e-17
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     21.3638      7.482      2.855      0.004       6.700      36.028
wtdiff         0.7329      0.123      5.953      0.000       0.492       0.974
runtime        0.0161      0.004      3.687      0.000       0.008       0.025
bmi           -2.2319      0.622     -3.586      0.000      -3.452      -1.012
bmi2           0.0449      0.013      3.472      0.001       0.020       0.070
==============================================================================

Table of adjusted odds ratios¶

In [69]:
b = model.params
x = (-4, -3, -2, -1, 0, 1, 2, 3)
x_ref = 0
or_wtdiff = []
for i in range(8):
    or_wtdiff.append(round(np.exp(b['wtdiff']*(x[i]-x_ref)), 2))
In [70]:
or_wtdiff
Out[70]:
[0.05, 0.11, 0.23, 0.48, 1.0, 2.08, 4.33, 9.01]

Looping over a grid of predictor values¶

In [71]:
# Loop to tabulate adjusted odds ratios for a grid of values 
for i in range(8):
    print(x[i], round(np.exp(b['wtdiff']*(x[i]-x_ref)), 2))
-4 0.05
-3 0.11
-2 0.23
-1 0.48
0 1.0
1 2.08
2 4.33
3 9.01
In [72]:
V = model.cov_params()
In [73]:
V
Out[73]:
Intercept wtdiff runtime bmi bmi2
Intercept 55.975540 0.158859 -0.005729 -4.614272 0.095350
wtdiff 0.158859 0.015157 0.000023 -0.014786 0.000313
runtime -0.005729 0.000023 0.000019 0.000177 -0.000006
bmi -4.614272 -0.014786 0.000177 0.387397 -0.008019
bmi2 0.095350 0.000313 -0.000006 -0.008019 0.000167
In [74]:
b.T
Out[74]:
Intercept    21.363804
wtdiff        0.732879
runtime       0.016116
bmi          -2.231851
bmi2          0.044906
dtype: float64
In [75]:
b['wtdiff']
Out[75]:
0.7328793472053252
In [76]:
se = np.sqrt(np.diag(V))
In [77]:
b_wtdiff = b['wtdiff']
se_b_wtdiff = se[1]
print(b_wtdiff, se_b_wtdiff)
0.7328793472053252 0.12311509580038163
In [78]:
print("Var %.4f" % 1.23114808e-01)
Var 0.1231

Customize the output of contrasts¶

In [79]:
for i in range(8):
    print("wtdiff = ", x[i])
    print("OR = %.2f CI = %.2f to %.2f " % (np.exp(b_wtdiff*(x[i]-0)), np.exp(b_wtdiff*(x[i]-0)-1.96*se_b_wtdiff*(x[i]-0)), np.exp(b_wtdiff*(x[i]-0)+1.96*se_b_wtdiff*(x[i]-0)))
)
wtdiff =  -4
OR = 0.05 CI = 0.14 to 0.02 
wtdiff =  -3
OR = 0.11 CI = 0.23 to 0.05 
wtdiff =  -2
OR = 0.23 CI = 0.37 to 0.14 
wtdiff =  -1
OR = 0.48 CI = 0.61 to 0.38 
wtdiff =  0
OR = 1.00 CI = 1.00 to 1.00 
wtdiff =  1
OR = 2.08 CI = 1.63 to 2.65 
wtdiff =  2
OR = 4.33 CI = 2.67 to 7.02 
wtdiff =  3
OR = 9.01 CI = 4.37 to 18.59 

Estimated odds ratio of hyponatremia conferred by 1 unit increase in weight change¶

$ OR_{wtdiff} = \exp(\hat \beta_1\times1) $

In [80]:
# log odds ratio and 95% CI 
lincom(np.array([0,1,0,0,0]),b,V)
Out[80]:
(0.7328793472053252, 0.4915781934833787, 0.9741805009272717)
In [81]:
# Using the equivariance property of ML estimate we can easily obtain OR (95% CI)
np.around(np.exp(lincom(np.array([0,1,0,0,0]),b,V)), 1)
Out[81]:
array([2.1, 1.6, 2.6])

Estimated odds ratio of hyponatremia conferred by 2 unit increase in weight change¶

$ OR_{wtdiff} = \exp(\hat \beta_1\times 2) $

In [82]:
np.exp(lincom(np.array([0,2,0,0,0]),b,V))
Out[82]:
array([4.33082777, 2.67287958, 7.01717701])

Estimated odds ratio of hyponatremia conferred by 30 min increase in race duration¶

$ OR_{runtime} = \exp(\hat \beta_2\times 30) $

In [83]:
np.round(np.exp(lincom(np.array([0,0,30,0,0]),b,V)),1)
Out[83]:
array([1.6, 1.3, 2.1])

List comprehensions to produce tables of estimates¶

In [84]:
or_wtdiff = np.array([np.exp(lincom(np.array([0,(x-0),0,0,0]),b,V)) for x in np.arange(-4,3,1)] )
or_wtdiff 
Out[84]:
array([[0.05331608, 0.02030837, 0.13997201],
       [0.11095417, 0.05379679, 0.22883946],
       [0.23090274, 0.14250745, 0.37412834],
       [0.4805234 , 0.37750159, 0.61166031],
       [1.        , 1.        , 1.        ],
       [2.0810641 , 1.63489436, 2.64899547],
       [4.33082777, 2.67287958, 7.01717701]])
In [85]:
or_runtime = np.array( [np.exp(lincom(np.array([0,0,(x-210),0,0]),b,V)) for x in np.arange(180,330,30)] )
or_runtime
Out[85]:
array([[0.61663599, 0.47689288, 0.79732779],
       [1.        , 1.        , 1.        ],
       [1.62170229, 1.25418933, 2.09690696],
       [2.62991833, 1.57299087, 4.39701881],
       [4.26494459, 1.97282836, 9.22013936]])
In [86]:
or_bmi = np.array([np.exp(lincom(np.array([0,0,0,(x-22.5),(x**2-506.25)]),b,V)) for x in np.arange(17.5,32.5,0.5)] )
or_bmi
Out[86]:
array([[ 8.8295876 ,  2.83287201, 27.52034575],
       [ 6.41898241,  2.42612167, 16.98321052],
       [ 4.77246888,  2.09990604, 10.84641826],
       [ 3.62886806,  1.83720915,  7.16776497],
       [ 2.82195709,  1.62508584,  4.90032072],
       [ 2.24429937,  1.45366227,  3.46495865],
       [ 1.82541797,  1.3154089 ,  2.53316726],
       [ 1.51843082,  1.20461365,  1.91400135],
       [ 1.29175114,  1.11700515,  1.49383466],
       [ 1.12386423,  1.04949246,  1.20350633],
       [ 1.        ,  1.        ,  1.        ],
       [ 0.90999143,  0.8560019 ,  0.96738618],
       [ 0.84688758,  0.75382167,  0.95144329],
       [ 0.8060563 ,  0.68178057,  0.95298514],
       [ 0.78461416,  0.63202642,  0.97404058],
       [ 0.78108457,  0.59919431,  1.01818909],
       [ 0.79522703,  0.57958228,  1.09110656],
       [ 0.82800958,  0.57064619,  1.20144473],
       [ 0.8817201 ,  0.57069133,  1.36226065],
       [ 0.96023444,  0.57868102,  1.59336515],
       [ 1.06948566,  0.59411267,  1.92522332],
       [ 1.21821462,  0.61693521,  2.40551494],
       [ 1.41913531,  0.64749775,  3.11035059],
       [ 1.69073271,  0.6865272 ,  4.16382207],
       [ 2.06004762,  0.73513502,  5.77281192],
       [ 2.56702867,  0.79485443,  8.29036854],
       [ 3.27141266,  0.8677102 , 12.33377327],
       [ 4.26374376,  0.95632541, 19.00975401],
       [ 5.68326643,  1.06407279, 30.35461268],
       [ 7.74740197,  1.19528217, 50.21595654]])

Use lincom to make inference on predicted probabilities¶

$\texttt{logit}\left(P(\texttt{nas135}|\texttt{wtdiff},\texttt{runtime}, \texttt{bmi}\right) = \beta_0 + \beta_1\texttt{wtdiff} + \beta_2\texttt{runtime}+ \beta_3\texttt{bmi}+\beta_4\texttt{bmi}^2$

$\texttt{logit}\left(P(\texttt{nas135}|\texttt{wtdiff}=-1,\texttt{runtime}=210, \texttt{bmi}=22.5\right) = \beta_0 + \beta_1(-1) + \beta_2(210)+ \beta_3(22.5)+\beta_4(506.25)$

In [87]:
lincom(np.array([1,-1,210,22.5, 22.5**2]),b,V)
Out[87]:
(-3.4678567099077813, -4.102390654917354, -2.8333227648982087)
In [88]:
def invlogit(xb):
    return np.exp(xb) / (1 + np.exp(xb))
In [89]:
invlogit(lincom(np.array([1,-1,210,22.5, 22.5**2]),b,V))
Out[89]:
array([0.03024077, 0.01626421, 0.05554981])

The probability of hyponatremia for those runners who lost 1 kg, completed the marathon in 3:30, and had a BMI equal to 22.5 kg/m^2 is estimated to be 3% (95% CI = 1.6 to 5.6%).

Plot statistical inference based on linear combinations¶

Graph partial predictions¶

The idea is to use the lincom function defined above to extract adjusted odds ratios.

figure2a.png

In [90]:
xval = np.arange(-4,4,1)
ors = np.array([np.exp(lincom(np.array([0,(x-0),0,0,0]),b,V)) for x in xval] )
ors.shape
Out[90]:
(8, 3)
In [91]:
ors
Out[91]:
array([[ 0.05331608,  0.02030837,  0.13997201],
       [ 0.11095417,  0.05379679,  0.22883946],
       [ 0.23090274,  0.14250745,  0.37412834],
       [ 0.4805234 ,  0.37750159,  0.61166031],
       [ 1.        ,  1.        ,  1.        ],
       [ 2.0810641 ,  1.63489436,  2.64899547],
       [ 4.33082777,  2.67287958,  7.01717701],
       [ 9.01273017,  4.36987577, 18.58847013]])

Figure 2 - Panel A) Weight change¶

In [92]:
# graph the adjusted odds ratios for weight change

xval = np.arange(-4,4,1)
ors = np.array([np.exp(lincom(np.array([0,(x-0),0,0,0]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "wtdiff"]
sample0 = df.loc[df["nas135"] == 0, "wtdiff"]
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=1)   
ax.plot(xval, ors[:,1], color='blue', alpha=0.2, linewidth=0)   
ax.plot(xval, ors[:,2], color='blue', alpha=0.2, linewidth=0)   
ax.fill_between(xval, ors[:,1], ors[:,2], color='blue', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%4.3f'))
ax.set_yticks([0.125, .250, .5, 1 , 2, 4, 8])
ax.text(-3, 4, r'$p$<0.001', fontsize=14)
ax.plot(sample0, [.07]*len(sample0), '|', color='k')
ax.plot(sample1, [12]*len(sample1), '|', color='k')
ax.set_xlim(-4,3.5)  # most of the data
ax.set_ylim(.05,13)  # most of the data
ax.vlines(0, .05, 1, color='black', linewidth=.5)
plt.xlabel('Weight Change (kg)')
plt.ylabel('Adjusted Odds Ratio')
plt.savefig("wtdiff_or.pdf")
plt.show()
No description has been provided for this image

Figure 2 - Panel B) Race duration¶

figure2b.png

In [93]:
# graph the adjusted odds ratios for runtime

xval = np.arange(180,360,30)
ors = np.array([np.exp(lincom(np.array([0,0,(x-210),0,0]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "runtime"]
sample0 = df.loc[df["nas135"] == 0, "runtime"]
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=4)  
ax.plot(xval, ors[:,1], color='blue', alpha=0.2, linewidth=0)  
ax.plot(xval, ors[:,2], color='blue', alpha=0.2, linewidth=0)  
ax.fill_between(xval, ors[:,1], ors[:,2], color='blue', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%4.3f'))
ax.vlines(210, .05, 1, color='black', linewidth=.5)
ax.set_yticks([.5, 1 , 2, 4, 8])
ax.set_xticks([180, 210, 240, 270, 300, 330] )
ax.set_xticklabels(['3:00', '3:30', '4:00', '4:30', '5:00', '5:30'])
ax.plot(sample0, [.5]*len(sample0), '|', color='k')
ax.plot(sample1, [12]*len(sample1), '|', color='k')
ax.set_xlim(180, 360)  # most of the data
ax.set_ylim(.4,13)  # most of the data
ax.text(200, 9, r'$p$<0.001', fontsize=14)
plt.xlabel('Race Duration (min)')
plt.ylabel('Odds Ratio')
plt.savefig("runtime_or.pdf")
plt.show()
No description has been provided for this image

Figure 2 - Panel C) Body Mass Index¶

figure2c.png

In [94]:
# graph the adjusted odds ratios for BMI

xval = np.arange(17.5,32.5,0.5)
ors = np.array([np.exp(lincom(np.array([0,0,0,(x-22.5),(x**2-506.25)]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "bmi"]
sample0 = df.loc[df["nas135"] == 0, "bmi"]
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylim(.4, 12)  # most of the data
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=4)  
ax.plot(xval, ors[:,1], color='grey', alpha=0.2, linewidth=0)  
ax.plot(xval, ors[:,2], color='grey', alpha=0.2, linewidth=0)  
ax.fill_between(xval, ors[:,1], ors[:,2], color='blue', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%3.2f'))
ax.set_yticks([.5, 1 , 2, 4, 8])
ax.set_xticks(np.arange(17.5,32.5, 2.5))
#ax.set_title('Non-Linear Modeling of BMI vs. Disease Risk', fontsize=18, pad=20)
ax.vlines(22.5, .05, 1, color='black', linewidth=.5)
plt.xlabel(r'Body Mass Index (kg/$m^2$)',fontsize=16)
plt.ylabel('Adjusted Odds Ratio', fontsize=16)
ax.plot(sample0, [.5]*len(sample0), '|', color='k')
ax.plot(sample1, [11]*len(sample1), '|', color='k')
ax.text(20, 7, r'$p$<0.001', fontsize=14)
plt.savefig("bmi_or.pdf")
plt.show()
No description has been provided for this image
In [95]:
# graph the adjusted odds ratios for BMI with a histogram 

xval = np.arange(17.5,32.5,0.5)
ors = np.array([np.exp(lincom(np.array([0,0,0,(x-22.5),(x**2-506.25)]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "bmi"]
sample0 = df.loc[df["nas135"] == 0, "bmi"]
bmi = np.array(df[['bmi']])

fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylim(.4, 12)  # most of the data
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=4)  
ax.plot(xval, ors[:,1], color='blue', alpha=0.2, linewidth=0)  
ax.plot(xval, ors[:,2], color='blue', alpha=0.2, linewidth=0)  
ax.fill_between(xval, ors[:,1], ors[:,2], color='b', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%3.2f'))
ax.set_yticks([.5, 1 , 2, 4, 8])
ax.set_xticks(np.arange(17.5,32.5, 2.5))
ax.set_ylim(.4, 12)  # most of the data

plt.xlabel(r'Body Mass Index (kg/$m^2$)',fontsize=16)
plt.ylabel('Adjusted Odds Ratio', fontsize=16)
#ax.set_title('Non-Linear Modeling of BMI vs. Disease Risk', fontsize=18, pad=20)
ax.plot(sample0, [.5]*len(sample0), '|', color='k')
ax.plot(sample1, [11]*len(sample1), '|', color='k')
ax.text(20, 7, r'$p$<0.001', fontsize=14)
ax.vlines(22.5, .05, 1, color='black', linewidth=.5)

ax2=ax.twinx()
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.hist(bmi, 15, color='red', alpha=0.1, density=True, rwidth=.9) 
ax2.set_yticks([])
plt.savefig("bmi_or2.pdf")
plt.show()
No description has been provided for this image