%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

from matplotlib.ticker import FormatStrFormatter

df = pd.read_stata('http://www.stats4life.se/data/marathon.dta', convert_categoricals=False)

# A scatter plot of individual outcomes can be a good starting point 

plt.scatter(df.wtdiff, df.na, c='r', alpha=.2)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.spines.top'] = False
plt.show()

# Estimates of the model parameter using Ordinary Least Squares (OLS) method 

model = smf.ols('na ~ wtdiff', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     na   R-squared:                       0.169
Model:                            OLS   Adj. R-squared:                  0.167
Method:                 Least Squares   F-statistic:                     91.86
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           6.21e-20
Time:                        07:55:02   Log-Likelihood:                -1317.2
No. Observations:                 455   AIC:                             2638.
Df Residuals:                     453   BIC:                             2647.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    139.5534      0.223    624.476      0.000     139.114     139.993
wtdiff        -1.2183      0.127     -9.584      0.000      -1.468      -0.968
==============================================================================
Omnibus:                       65.662   Durbin-Watson:                   1.284
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              327.577
Skew:                          -0.491   Prob(JB):                     7.37e-72
Kurtosis:                       7.039   Cond. No.                         2.04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

# Add a column in the dataframe containing the predicted outcomes 

df['fit'] = model.fittedvalues

# graph individual data points the the predicted mean outcome

plt.scatter(df.wtdiff, df.na, c='r', alpha=.2)
plt.plot(df.wtdiff, df.fit, c='b', alpha=.3 , linewidth=5)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.show()

# comparison of the predicted mean outcome vs observed outcome 

subset = np.column_stack((df['wtdiff'], df['na'], df['fit']))

subset[10:30]

array([[         nan, 141.        ,          nan],
       [         nan, 143.        ,          nan],
       [         nan, 141.        ,          nan],
       [         nan, 141.        ,          nan],
       [         nan, 140.        ,          nan],
       [         nan, 142.        ,          nan],
       [         nan, 143.        ,          nan],
       [  0.68181818, 149.        , 138.72278443],
       [ -0.90909091, 142.        , 140.66096356],
       [  1.13636364, 141.        , 138.16901896],
       [ -0.68181818, 138.        , 140.38408083],
       [  0.22727273, 140.        , 139.27654989],
       [  0.        , 142.        , 139.55343263],
       [ -0.59659091, 149.        , 140.2802498 ],
       [ -0.22727273, 140.        , 139.83031537],
       [ -1.81818182, 147.        , 141.76849449],
       [  0.        , 136.        , 139.55343263],
       [  0.45454546, 139.        , 138.99966716],
       [  0.        , 143.        , 139.55343263],
       [  0.22727273, 142.        , 139.27654989]])

hypotheses = 'wtdiff = 0'
t_test = model.t_test(hypotheses)
t_test

<class 'statsmodels.stats.contrast.ContrastResults'>
                             Test for Constraints                             
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0            -1.2183      0.127     -9.584      0.000      -1.468      -0.968
==============================================================================

# you can test any other value of the regression coefficient b1 

hypotheses = 'wtdiff = -1.3'
t_test = model.t_test(hypotheses)
t_test

<class 'statsmodels.stats.contrast.ContrastResults'>
                             Test for Constraints                             
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0            -1.2183      0.127      0.643      0.521      -1.468      -0.968
==============================================================================

# access estimated regression coefficients

model.params

Intercept    139.553433
wtdiff        -1.218284
dtype: float64

# access estimated var/covariances of the regression coefficients

model.cov_params()

# access estimated std errors of the regression coefficients

np.sqrt(np.diag(model.cov_params()))

array([0.22347274, 0.12711121])

def lincom(X,b,V):
    Var = np.dot(X, np.dot(V,X.T))
    if Var.ndim == 0:
        Se = np.sqrt(Var)
    else: 
        Se = np.sqrt(np.diag(Var))
    return (np.matmul(X,b), np.matmul(X,b)+stats.norm.ppf(.025)*Se, np.matmul(X,b)+stats.norm.ppf(.975)*Se)

b = model.params
V = model.cov_params()

lincom(np.array([1, 3]),b,V)

(135.898580556215, 134.89498068034243, 136.90218043208756)

def lincom2(X,b,V):
    Var = np.dot(X, np.dot(V,X.T))
    if Var.ndim == 0:
        Se = np.sqrt(Var)
    else: 
        Se = np.sqrt(np.diag(Var))
    z = (np.matmul(X,b)-0)/Se
    p = stats.norm.cdf( -abs(z))*2
    return ( np.matmul(X,b), Se, z,  p, np.matmul(X,b)+stats.norm.ppf(.025)*Se, np.matmul(X,b)+stats.norm.ppf(.975)*Se)

results = lincom2(np.array([0, 1]),b,V)

np.round(np.transpose(results),4)

array([-1.2183,  0.1271, -9.5844,  0.    , -1.4674, -0.9692])

print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     na   R-squared:                       0.169
Model:                            OLS   Adj. R-squared:                  0.167
Method:                 Least Squares   F-statistic:                     91.86
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           6.21e-20
Time:                        07:55:02   Log-Likelihood:                -1317.2
No. Observations:                 455   AIC:                             2638.
Df Residuals:                     453   BIC:                             2647.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    139.5534      0.223    624.476      0.000     139.114     139.993
wtdiff        -1.2183      0.127     -9.584      0.000      -1.468      -0.968
==============================================================================
Omnibus:                       65.662   Durbin-Watson:                   1.284
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              327.577
Skew:                          -0.491   Prob(JB):                     7.37e-72
Kurtosis:                       7.039   Cond. No.                         2.04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

lincom(np.array([0, 4]),b,V)

(-4.873136097422644, -5.869669663999436, -3.876602530845852)

df['bmisq'] = df['bmi']**2

# Pearson correlation between bmi and bmi squared
valid = df[['bmi', 'bmisq']].dropna()
r, pval = stats.pearsonr(valid['bmi'], valid['bmisq'])
print("Correlation:", r, "p-value:", pval)

Correlation: 0.9965742291212139 p-value: 0.0

model = smf.ols('na ~ wtdiff + bmi + bmisq' , data=df).fit()

print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     na   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.180
Method:                 Least Squares   F-statistic:                     34.03
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           7.91e-20
Time:                        07:55:02   Log-Likelihood:                -1305.2
No. Observations:                 452   AIC:                             2618.
Df Residuals:                     448   BIC:                             2635.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    105.0147     11.218      9.361      0.000      82.968     127.061
wtdiff        -1.1770      0.127     -9.247      0.000      -1.427      -0.927
bmi            2.8707      0.940      3.053      0.002       1.023       4.719
bmisq         -0.0587      0.020     -3.006      0.003      -0.097      -0.020
==============================================================================
Omnibus:                       60.670   Durbin-Watson:                   1.308
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              330.172
Skew:                          -0.400   Prob(JB):                     2.01e-72
Kurtosis:                       7.110   Cond. No.                     3.03e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.03e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

hypotheses = ['bmi = 0',  'bmisq = 0']
f_test = model.f_test(hypotheses)

f_test

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=4.767628913599036, p=0.00893659800206336, df_denom=448, df_num=2>

model = smf.logit('nas135 ~ female ', data=df).fit()

Optimization terminated successfully.
         Current function value: 0.360585
         Iterations 6

print(model.summary())

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 nas135   No. Observations:                  488
Model:                          Logit   Df Residuals:                      486
Method:                           MLE   Df Model:                            1
Date:                Thu, 02 Oct 2025   Pseudo R-squ.:                 0.05293
Time:                        07:55:02   Log-Likelihood:                -175.97
converged:                       True   LL-Null:                       -185.80
Covariance Type:            nonrobust   LLR p-value:                 9.204e-06
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.4749      0.208    -11.884      0.000      -2.883      -2.067
female         1.2260      0.280      4.386      0.000       0.678       1.774
==============================================================================

b = model.params
V = model.cov_params()
np.round(np.exp(lincom(np.array([0, 1]), b,  V)), 1)

array([3.4, 2. , 5.9])

params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['LB', 'UB', 'OR']
print(np.around(np.exp(conf[1:]),1))

         LB   UB   OR
female  2.0  5.9  3.4

df['bmic'] = pd.cut(df['bmi'],bins=[0,20,25,40],labels=['<20','20-25','>25'])
df['runtimehc'] = pd.cut(df['runtime']/60, bins=[0,3.5,4,8] , labels=['<3:30','3:30-4:00','>4:00'])

df['runtimehc'].value_counts(normalize=True)

runtimehc
<3:30        0.404612
3:30-4:00    0.308176
>4:00        0.287212
Name: proportion, dtype: float64

df['bmic'].value_counts(normalize=True)

bmic
20-25    0.701075
>25      0.193548
<20      0.105376
Name: proportion, dtype: float64

model = smf.logit('nas135 ~ I(wtdiff>0) + C(runtimehc,Treatment(0))  + C(bmic,Treatment(1)) ', data=df).fit()
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.311873
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 nas135   No. Observations:                  455
Model:                          Logit   Df Residuals:                      449
Method:                           MLE   Df Model:                            5
Date:                Thu, 02 Oct 2025   Pseudo R-squ.:                  0.1826
Time:                        07:55:02   Log-Likelihood:                -141.90
converged:                       True   LL-Null:                       -173.61
Covariance Type:            nonrobust   LLR p-value:                 2.395e-12
===========================================================================================================
                                              coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------
Intercept                                  -3.8727      0.436     -8.889      0.000      -4.727      -3.019
I(wtdiff > 0)[T.True]                       1.4578      0.310      4.698      0.000       0.850       2.066
C(runtimehc, Treatment(0))[T.3:30-4:00]     1.0549      0.475      2.222      0.026       0.124       1.985
C(runtimehc, Treatment(0))[T.>4:00]         1.8032      0.466      3.867      0.000       0.889       2.717
C(bmic, Treatment(1))[T.<20]                1.3910      0.410      3.390      0.001       0.587       2.195
C(bmic, Treatment(1))[T.>25]                0.0369      0.388      0.095      0.924      -0.723       0.797
===========================================================================================================

params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['LB', 'UB', 'OR']
print(np.around(np.exp(conf[1:]),1))

                                          LB    UB   OR
I(wtdiff > 0)[T.True]                    2.3   7.9  4.3
C(runtimehc, Treatment(0))[T.3:30-4:00]  1.1   7.3  2.9
C(runtimehc, Treatment(0))[T.>4:00]      2.4  15.1  6.1
C(bmic, Treatment(1))[T.<20]             1.8   9.0  4.0
C(bmic, Treatment(1))[T.>25]             0.5   2.2  1.0

df['bmisq'] = df['bmi']**2
model = smf.logit('nas135~wtdiff+runtime+bmi+bmisq', data=df).fit()
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.282840
         Iterations 8
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 nas135   No. Observations:                  446
Model:                          Logit   Df Residuals:                      441
Method:                           MLE   Df Model:                            4
Date:                Thu, 02 Oct 2025   Pseudo R-squ.:                  0.2515
Time:                        07:55:02   Log-Likelihood:                -126.15
converged:                       True   LL-Null:                       -168.53
Covariance Type:            nonrobust   LLR p-value:                 1.708e-17
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     21.3638      7.482      2.855      0.004       6.700      36.028
wtdiff         0.7329      0.123      5.953      0.000       0.492       0.974
runtime        0.0161      0.004      3.687      0.000       0.008       0.025
bmi           -2.2319      0.622     -3.586      0.000      -3.452      -1.012
bmisq          0.0449      0.013      3.472      0.001       0.020       0.070
==============================================================================

b = model.params
V = model.cov_params

params = model.params
conf = model.conf_int()
conf['OR'] = params
conf.columns = ['LB', 'UB', 'OR']
print(np.around(np.exp(conf[1:3]),2))

           LB    UB    OR
wtdiff   1.63  2.65  2.08
runtime  1.01  1.02  1.02

model.nobs

446

model.aic

262.2931311594778

model.llf

-126.1465655797389

# Create the linear spline at the chosen knot 
df.loc[ df.wtdiff>0, 'wc_s1'] = (df.wtdiff-0)
df.loc[ df.wtdiff<=0, 'wc_s1'] = 0

df[ ['wtdiff', 'wc_s1']]

model = smf.ols('na~ wtdiff + wc_s1', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     na   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.182
Method:                 Least Squares   F-statistic:                     51.51
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           7.06e-21
Time:                        07:55:03   Log-Likelihood:                -1312.5
No. Observations:                 455   AIC:                             2631.
Df Residuals:                     452   BIC:                             2643.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    140.3155      0.332    422.036      0.000     139.662     140.969
wtdiff        -0.7795      0.190     -4.094      0.000      -1.154      -0.405
wc_s1         -1.3319      0.433     -3.073      0.002      -2.184      -0.480
==============================================================================
Omnibus:                       66.281   Durbin-Watson:                   1.324
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              326.527
Skew:                          -0.503   Prob(JB):                     1.25e-71
Kurtosis:                       7.026   Cond. No.                         4.85
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

df['fit'] = model.fittedvalues
to_plot = df[ ['wtdiff', 'fit'] ].sort_values(['wtdiff'])
plt.plot(to_plot.wtdiff, to_plot.fit, c='b', alpha=.5, linewidth=3)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.axvline(0, c='g', alpha=.2, linestyle='dashed', linewidth=2)
plt.show()

# Define a function to generate restricted cubic splines given number of knots

def rcs(x, nk):
    if (nk == 3):
        lk = [.1, .5, .9]
    if (nk == 4):
        lk = [.05, .35, .65, .95]
    if (nk == 5):
        lk = [.05 , .275, .50, .725, .95]
    if (nk == 6):
        lk = [.05 , .23, .41 , .59, .77 , .95]
    if (nk == 7):
        lk = [.25 , .1833 , .3417 , .50 , .6583 , .8167 , .975]
    knots =[]
    for i in range(nk):
        knots.append(np.quantile(x, lk[i] ) )
    u = np.zeros(shape=(len(x),nk) )
    i=0
    while i < nk:
        u[:,i] = np.array(x>knots[i])*(x-knots[i])
        i = i + 1
    spline = np.zeros(shape=(len(x),(nk-1)))
    spline[:,0] = x
    i = 0 
    while i < (nk-2):
        ip1 = i + 1
        spline[:,ip1] = ( np.power(u[:,i],3) - 1/(knots[nk-1] - knots[nk-2])* 
        ( np.power(u[:,(nk-2)],3)*(knots[nk-1] - knots[i])-np.power(u[:,(nk-1)],3)*(knots[nk-2] - knots[i]) )) / (knots[nk-1] - knots[0])**2 
        i=i+1
    return spline

df = pd.read_stata('http://www.stats4life.se/data/marathon.dta', convert_categoricals=False)
df = df[['na', 'nas135', 'wtdiff', 'runtime', 'bmi']].dropna()
df['bmi2'] = np.power(df['bmi'],2)
y = df['na'].to_numpy()

# loop to compare AIC across spline models with increasing nr of knots

bmis = []
l_aic = []
nknots = [3,4,5,6,7]
for k in nknots:
    print("Nr of knots = " , k)
    X = rcs(df.bmi,k)
    X = sm.add_constant(X)
    smodel = sm.OLS(y,X).fit()
    print(smodel.summary())
    l_aic.append(smodel.aic)
    fits = smodel.fittedvalues
    # print(np.column_stack( [X[:,1], fits] ))
    plt.scatter(X[:,1], fits, s=2, c='r', alpha=.4)
    plt.title('Restricted Cubic Spline with {0} Knots'.format(k))
    plt.show()
    print('Spline model with %.0f knots AIC is %.0f' % (k, smodel.aic))

# lowest AIC is indicating a good compromise between fit and parameters
list(zip(nknots, l_aic))

Nr of knots =  3
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.034
Method:                 Least Squares   F-statistic:                     8.754
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           0.000187
Time:                        07:55:03   Log-Likelihood:                -1325.3
No. Observations:                 446   AIC:                             2657.
Df Residuals:                     443   BIC:                             2669.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        121.5136      4.555     26.674      0.000     112.561     130.467
x1             0.8859      0.213      4.161      0.000       0.467       1.304
x2            -1.1100      0.277     -4.001      0.000      -1.655      -0.565
==============================================================================
Omnibus:                       70.144   Durbin-Watson:                   1.181
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              316.526
Skew:                          -0.590   Prob(JB):                     1.85e-69
Kurtosis:                       6.955   Cond. No.                         473.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Spline model with 3 knots AIC is 2657
Nr of knots =  4
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.039
Model:                            OLS   Adj. R-squared:                  0.032
Method:                 Least Squares   F-statistic:                     5.957
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           0.000548
Time:                        07:55:03   Log-Likelihood:                -1325.1
No. Observations:                 446   AIC:                             2658.
Df Residuals:                     442   BIC:                             2675.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        117.1229      7.674     15.263      0.000     102.042     132.204
x1             1.1091      0.377      2.946      0.003       0.369       1.849
x2            -2.3612      1.570     -1.504      0.133      -5.446       0.724
x3             4.6141      5.000      0.923      0.357      -5.213      14.442
==============================================================================
Omnibus:                       70.064   Durbin-Watson:                   1.183
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              316.204
Skew:                          -0.589   Prob(JB):                     2.17e-69
Kurtosis:                       6.953   Cond. No.                         928.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Spline model with 4 knots AIC is 2658
Nr of knots =  5
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.039
Model:                            OLS   Adj. R-squared:                  0.031
Method:                 Least Squares   F-statistic:                     4.512
Date:                Thu, 02 Oct 2025   Prob (F-statistic):            0.00139
Time:                        07:55:03   Log-Likelihood:                -1325.0
No. Observations:                 446   AIC:                             2660.
Df Residuals:                     441   BIC:                             2680.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        113.8794     10.237     11.124      0.000      93.760     133.999
x1             1.2772      0.514      2.483      0.013       0.266       2.288
x2            -4.3913      4.339     -1.012      0.312     -12.919       4.136
x3            13.2230     21.448      0.617      0.538     -28.931      55.377
x4           -11.6890     26.956     -0.434      0.665     -64.667      41.289
==============================================================================
Omnibus:                       70.538   Durbin-Watson:                   1.186
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              322.180
Skew:                          -0.590   Prob(JB):                     1.10e-70
Kurtosis:                       6.993   Cond. No.                     3.65e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.65e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Spline model with 5 knots AIC is 2660
Nr of knots =  6
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.039
Model:                            OLS   Adj. R-squared:                  0.028
Method:                 Least Squares   F-statistic:                     3.576
Date:                Thu, 02 Oct 2025   Prob (F-statistic):            0.00351
Time:                        07:55:03   Log-Likelihood:                -1325.0
No. Observations:                 446   AIC:                             2662.
Df Residuals:                     440   BIC:                             2687.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        115.5313     11.903      9.706      0.000      92.138     138.925
x1             1.1914      0.605      1.969      0.050       0.002       2.380
x2            -3.2623      7.866     -0.415      0.679     -18.721      12.196
x3             2.4271     45.041      0.054      0.957     -86.095      90.949
x4             8.1119     77.386      0.105      0.917    -143.981     160.205
x5           -10.9087     61.681     -0.177      0.860    -132.135     110.318
==============================================================================
Omnibus:                       70.170   Durbin-Watson:                   1.186
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              318.165
Skew:                          -0.589   Prob(JB):                     8.15e-70
Kurtosis:                       6.967   Cond. No.                     1.11e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.11e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Spline model with 6 knots AIC is 2662
Nr of knots =  7
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.043
Model:                            OLS   Adj. R-squared:                  0.030
Method:                 Least Squares   F-statistic:                     3.301
Date:                Thu, 02 Oct 2025   Prob (F-statistic):            0.00345
Time:                        07:55:03   Log-Likelihood:                -1324.1
No. Observations:                 446   AIC:                             2662.
Df Residuals:                     439   BIC:                             2691.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        116.2837      9.241     12.584      0.000      98.122     134.445
x1             1.1481      0.461      2.491      0.013       0.242       2.054
x2          -121.1906    295.221     -0.411      0.682    -701.412     459.031
x3            11.8721     94.523      0.126      0.900    -173.901     197.645
x4           241.8662    337.937      0.716      0.475    -422.310     906.042
x5          -191.9364    189.090     -1.015      0.311    -563.570     179.697
x6            74.5418     66.352      1.123      0.262     -55.864     204.948
==============================================================================
Omnibus:                       74.168   Durbin-Watson:                   1.189
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              345.955
Skew:                          -0.623   Prob(JB):                     7.53e-76
Kurtosis:                       7.131   Cond. No.                     5.07e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.07e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Spline model with 7 knots AIC is 2662

[(3, 2656.5587353722885),
 (4, 2658.1682651667147),
 (5, 2659.955934150113),
 (6, 2662.0784684105115),
 (7, 2662.1652085479254)]

bmis = rcs(df.bmi,3)

np.size(bmis, 0), np.size(bmis,1)

(446, 2)

# creating the design matrix 

bmis = rcs(df.bmi,3)
y = df['na'].to_numpy()        
X = bmis
X = sm.add_constant(X)
model1 = sm.OLS(y,X).fit()
print(model1.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.034
Method:                 Least Squares   F-statistic:                     8.754
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           0.000187
Time:                        07:55:03   Log-Likelihood:                -1325.3
No. Observations:                 446   AIC:                             2657.
Df Residuals:                     443   BIC:                             2669.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        121.5136      4.555     26.674      0.000     112.561     130.467
x1             0.8859      0.213      4.161      0.000       0.467       1.304
x2            -1.1100      0.277     -4.001      0.000      -1.655      -0.565
==============================================================================
Omnibus:                       70.144   Durbin-Watson:                   1.181
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              316.526
Skew:                          -0.590   Prob(JB):                     1.85e-69
Kurtosis:                       6.955   Cond. No.                         473.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

df

# add the cubic splines to the dataframe
df['bmis1'] = bmis[:,0]
df['bmis2'] = bmis[:,1]

# use the formula approach 

model2 = smf.ols('na ~ bmis1 + bmis2', data=df).fit()
print(model2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     na   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.034
Method:                 Least Squares   F-statistic:                     8.754
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           0.000187
Time:                        07:55:03   Log-Likelihood:                -1325.3
No. Observations:                 446   AIC:                             2657.
Df Residuals:                     443   BIC:                             2669.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    121.5136      4.555     26.674      0.000     112.561     130.467
bmis1          0.8859      0.213      4.161      0.000       0.467       1.304
bmis2         -1.1100      0.277     -4.001      0.000      -1.655      -0.565
==============================================================================
Omnibus:                       70.144   Durbin-Watson:                   1.181
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              316.526
Skew:                          -0.590   Prob(JB):                     1.85e-69
Kurtosis:                       6.955   Cond. No.                         473.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

hypotheses = ['bmis1 = 0', 'bmis2 = 0']
f_test = model2.f_test(hypotheses)

f_test

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=8.753689785498299, p=0.0001868614716924368, df_denom=443, df_num=2>

df['fits'] = model2.fittedvalues

# (x,y) pairs to be plotted need to be sorted first 

to_plot = df[ ['bmis1', 'fits'] ].sort_values(['bmis1', 'fits'])

#plt.scatter(df.bmi, df.na, c='r', alpha=.2, s=3)
plt.plot(to_plot['bmis1'],to_plot['fits'], c='b', linewidth=2, alpha=.5)
plt.ylabel("Serum sodium concentration (mmol/l)")
plt.xlabel("Weight Change (kg)")
plt.show()

model = smf.logit('nas135 ~ wtdiff + runtime', data=df).fit()
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.298177
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 nas135   No. Observations:                  446
Model:                          Logit   Df Residuals:                      443
Method:                           MLE   Df Model:                            2
Date:                Thu, 02 Oct 2025   Pseudo R-squ.:                  0.2109
Time:                        07:55:03   Log-Likelihood:                -132.99
converged:                       True   LL-Null:                       -168.53
Covariance Type:            nonrobust   LLR p-value:                 3.681e-16
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -5.7753      0.958     -6.029      0.000      -7.653      -3.898
wtdiff         0.6980      0.115      6.049      0.000       0.472       0.924
runtime        0.0162      0.004      4.268      0.000       0.009       0.024
==============================================================================

hypothesis='wtdiff=0'
wald = model.wald_test(hypothesis, scalar=True)
print("Wald test statistic:", wald.statistic)
print("p-value:", wald.pvalue)

Wald test statistic: 36.5913183097515
p-value: 1.4567908936820186e-09

model = smf.logit('nas135 ~ bmis1 + bmis2 + wtdiff', data=df).fit()
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.298471
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 nas135   No. Observations:                  446
Model:                          Logit   Df Residuals:                      442
Method:                           MLE   Df Model:                            3
Date:                Thu, 02 Oct 2025   Pseudo R-squ.:                  0.2101
Time:                        07:55:03   Log-Likelihood:                -133.12
converged:                       True   LL-Null:                       -168.53
Covariance Type:            nonrobust   LLR p-value:                 2.857e-15
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      8.5283      2.673      3.191      0.001       3.289      13.767
bmis1         -0.5033      0.128     -3.929      0.000      -0.754      -0.252
bmis2          0.7309      0.170      4.312      0.000       0.399       1.063
wtdiff         0.7518      0.118      6.396      0.000       0.521       0.982
==============================================================================

hypothesis = '(bmis1=0, bmis2=0)'
wald = model.wald_test(hypothesis, scalar=True)
print("Wald test statistic:", wald.statistic)
print("p-value:", wald.pvalue)

Wald test statistic: 18.638150365520445
p-value: 8.969682522687936e-05

# Model 1  

m1 = smf.logit('nas135 ~ wtdiff + runtime + bmis1 + bmis2 ', data=df).fit()
l1 = m1.llf

# Model 0 - restricted model without spline terms for bmi

m0 = smf.logit('nas135 ~ wtdiff + runtime', data=df).fit()
l0 = m0.llf

Optimization terminated successfully.
         Current function value: 0.283066
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.298177
         Iterations 7

lrt = 2*abs(l1-l0)
k1 = len(m1.params)
k0 = len(m0.params)
d_f = k1-k0
print("Likelihood ratio test %.3f (p = %.4f) " % (lrt, 1-stats.chi2.cdf(lrt,d_f)) )

Likelihood ratio test 13.479 (p = 0.0012)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 446 entries, 18 to 487
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   na       446 non-null    int16  
 1   nas135   446 non-null    int8   
 2   wtdiff   446 non-null    float64
 3   runtime  446 non-null    float64
 4   bmi      446 non-null    float64
 5   bmi2     446 non-null    float64
 6   bmis1    446 non-null    float64
 7   bmis2    446 non-null    float64
 8   fits     446 non-null    float64
dtypes: float64(7), int16(1), int8(1)
memory usage: 29.2 KB

model = smf.logit('nas135 ~ wtdiff + runtime + bmi + bmi2', data=df).fit()
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.282840
         Iterations 8
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 nas135   No. Observations:                  446
Model:                          Logit   Df Residuals:                      441
Method:                           MLE   Df Model:                            4
Date:                Thu, 02 Oct 2025   Pseudo R-squ.:                  0.2515
Time:                        07:55:03   Log-Likelihood:                -126.15
converged:                       True   LL-Null:                       -168.53
Covariance Type:            nonrobust   LLR p-value:                 1.708e-17
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     21.3638      7.482      2.855      0.004       6.700      36.028
wtdiff         0.7329      0.123      5.953      0.000       0.492       0.974
runtime        0.0161      0.004      3.687      0.000       0.008       0.025
bmi           -2.2319      0.622     -3.586      0.000      -3.452      -1.012
bmi2           0.0449      0.013      3.472      0.001       0.020       0.070
==============================================================================

b = model.params
x = (-4, -3, -2, -1, 0, 1, 2, 3)
x_ref = 0
or_wtdiff = []
for i in range(8):
    or_wtdiff.append(round(np.exp(b['wtdiff']*(x[i]-x_ref)), 2))

or_wtdiff

[0.05, 0.11, 0.23, 0.48, 1.0, 2.08, 4.33, 9.01]

# Loop to tabulate adjusted odds ratios for a grid of values 
for i in range(8):
    print(x[i], round(np.exp(b['wtdiff']*(x[i]-x_ref)), 2))

-4 0.05
-3 0.11
-2 0.23
-1 0.48
0 1.0
1 2.08
2 4.33
3 9.01

V = model.cov_params()

V

b.T

Intercept    21.363804
wtdiff        0.732879
runtime       0.016116
bmi          -2.231851
bmi2          0.044906
dtype: float64

b['wtdiff']

0.7328793472053252

se = np.sqrt(np.diag(V))

b_wtdiff = b['wtdiff']
se_b_wtdiff = se[1]
print(b_wtdiff, se_b_wtdiff)

0.7328793472053252 0.12311509580038163

print("Var %.4f" % 1.23114808e-01)

Var 0.1231

for i in range(8):
    print("wtdiff = ", x[i])
    print("OR = %.2f CI = %.2f to %.2f " % (np.exp(b_wtdiff*(x[i]-0)), np.exp(b_wtdiff*(x[i]-0)-1.96*se_b_wtdiff*(x[i]-0)), np.exp(b_wtdiff*(x[i]-0)+1.96*se_b_wtdiff*(x[i]-0)))
)

wtdiff =  -4
OR = 0.05 CI = 0.14 to 0.02 
wtdiff =  -3
OR = 0.11 CI = 0.23 to 0.05 
wtdiff =  -2
OR = 0.23 CI = 0.37 to 0.14 
wtdiff =  -1
OR = 0.48 CI = 0.61 to 0.38 
wtdiff =  0
OR = 1.00 CI = 1.00 to 1.00 
wtdiff =  1
OR = 2.08 CI = 1.63 to 2.65 
wtdiff =  2
OR = 4.33 CI = 2.67 to 7.02 
wtdiff =  3
OR = 9.01 CI = 4.37 to 18.59

# log odds ratio and 95% CI 
lincom(np.array([0,1,0,0,0]),b,V)

(0.7328793472053252, 0.4915781934833787, 0.9741805009272717)

# Using the equivariance property of ML estimate we can easily obtain OR (95% CI)
np.around(np.exp(lincom(np.array([0,1,0,0,0]),b,V)), 1)

array([2.1, 1.6, 2.6])

np.exp(lincom(np.array([0,2,0,0,0]),b,V))

array([4.33082777, 2.67287958, 7.01717701])

np.round(np.exp(lincom(np.array([0,0,30,0,0]),b,V)),1)

array([1.6, 1.3, 2.1])

or_wtdiff = np.array([np.exp(lincom(np.array([0,(x-0),0,0,0]),b,V)) for x in np.arange(-4,3,1)] )
or_wtdiff

array([[0.05331608, 0.02030837, 0.13997201],
       [0.11095417, 0.05379679, 0.22883946],
       [0.23090274, 0.14250745, 0.37412834],
       [0.4805234 , 0.37750159, 0.61166031],
       [1.        , 1.        , 1.        ],
       [2.0810641 , 1.63489436, 2.64899547],
       [4.33082777, 2.67287958, 7.01717701]])

or_runtime = np.array( [np.exp(lincom(np.array([0,0,(x-210),0,0]),b,V)) for x in np.arange(180,330,30)] )
or_runtime

array([[0.61663599, 0.47689288, 0.79732779],
       [1.        , 1.        , 1.        ],
       [1.62170229, 1.25418933, 2.09690696],
       [2.62991833, 1.57299087, 4.39701881],
       [4.26494459, 1.97282836, 9.22013936]])

or_bmi = np.array([np.exp(lincom(np.array([0,0,0,(x-22.5),(x**2-506.25)]),b,V)) for x in np.arange(17.5,32.5,0.5)] )
or_bmi

array([[ 8.8295876 ,  2.83287201, 27.52034575],
       [ 6.41898241,  2.42612167, 16.98321052],
       [ 4.77246888,  2.09990604, 10.84641826],
       [ 3.62886806,  1.83720915,  7.16776497],
       [ 2.82195709,  1.62508584,  4.90032072],
       [ 2.24429937,  1.45366227,  3.46495865],
       [ 1.82541797,  1.3154089 ,  2.53316726],
       [ 1.51843082,  1.20461365,  1.91400135],
       [ 1.29175114,  1.11700515,  1.49383466],
       [ 1.12386423,  1.04949246,  1.20350633],
       [ 1.        ,  1.        ,  1.        ],
       [ 0.90999143,  0.8560019 ,  0.96738618],
       [ 0.84688758,  0.75382167,  0.95144329],
       [ 0.8060563 ,  0.68178057,  0.95298514],
       [ 0.78461416,  0.63202642,  0.97404058],
       [ 0.78108457,  0.59919431,  1.01818909],
       [ 0.79522703,  0.57958228,  1.09110656],
       [ 0.82800958,  0.57064619,  1.20144473],
       [ 0.8817201 ,  0.57069133,  1.36226065],
       [ 0.96023444,  0.57868102,  1.59336515],
       [ 1.06948566,  0.59411267,  1.92522332],
       [ 1.21821462,  0.61693521,  2.40551494],
       [ 1.41913531,  0.64749775,  3.11035059],
       [ 1.69073271,  0.6865272 ,  4.16382207],
       [ 2.06004762,  0.73513502,  5.77281192],
       [ 2.56702867,  0.79485443,  8.29036854],
       [ 3.27141266,  0.8677102 , 12.33377327],
       [ 4.26374376,  0.95632541, 19.00975401],
       [ 5.68326643,  1.06407279, 30.35461268],
       [ 7.74740197,  1.19528217, 50.21595654]])

lincom(np.array([1,-1,210,22.5, 22.5**2]),b,V)

(-3.4678567099077813, -4.102390654917354, -2.8333227648982087)

def invlogit(xb):
    return np.exp(xb) / (1 + np.exp(xb))

invlogit(lincom(np.array([1,-1,210,22.5, 22.5**2]),b,V))

array([0.03024077, 0.01626421, 0.05554981])

xval = np.arange(-4,4,1)
ors = np.array([np.exp(lincom(np.array([0,(x-0),0,0,0]),b,V)) for x in xval] )
ors.shape

(8, 3)

ors

array([[ 0.05331608,  0.02030837,  0.13997201],
       [ 0.11095417,  0.05379679,  0.22883946],
       [ 0.23090274,  0.14250745,  0.37412834],
       [ 0.4805234 ,  0.37750159,  0.61166031],
       [ 1.        ,  1.        ,  1.        ],
       [ 2.0810641 ,  1.63489436,  2.64899547],
       [ 4.33082777,  2.67287958,  7.01717701],
       [ 9.01273017,  4.36987577, 18.58847013]])

# graph the adjusted odds ratios for weight change

xval = np.arange(-4,4,1)
ors = np.array([np.exp(lincom(np.array([0,(x-0),0,0,0]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "wtdiff"]
sample0 = df.loc[df["nas135"] == 0, "wtdiff"]
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=1)   
ax.plot(xval, ors[:,1], color='blue', alpha=0.2, linewidth=0)   
ax.plot(xval, ors[:,2], color='blue', alpha=0.2, linewidth=0)   
ax.fill_between(xval, ors[:,1], ors[:,2], color='blue', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%4.3f'))
ax.set_yticks([0.125, .250, .5, 1 , 2, 4, 8])
ax.text(-3, 4, r'$p$<0.001', fontsize=14)
ax.plot(sample0, [.07]*len(sample0), '|', color='k')
ax.plot(sample1, [12]*len(sample1), '|', color='k')
ax.set_xlim(-4,3.5)  # most of the data
ax.set_ylim(.05,13)  # most of the data
ax.vlines(0, .05, 1, color='black', linewidth=.5)
plt.xlabel('Weight Change (kg)')
plt.ylabel('Adjusted Odds Ratio')
plt.savefig("wtdiff_or.pdf")
plt.show()

# graph the adjusted odds ratios for runtime

xval = np.arange(180,360,30)
ors = np.array([np.exp(lincom(np.array([0,0,(x-210),0,0]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "runtime"]
sample0 = df.loc[df["nas135"] == 0, "runtime"]
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=4)  
ax.plot(xval, ors[:,1], color='blue', alpha=0.2, linewidth=0)  
ax.plot(xval, ors[:,2], color='blue', alpha=0.2, linewidth=0)  
ax.fill_between(xval, ors[:,1], ors[:,2], color='blue', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%4.3f'))
ax.vlines(210, .05, 1, color='black', linewidth=.5)
ax.set_yticks([.5, 1 , 2, 4, 8])
ax.set_xticks([180, 210, 240, 270, 300, 330] )
ax.set_xticklabels(['3:00', '3:30', '4:00', '4:30', '5:00', '5:30'])
ax.plot(sample0, [.5]*len(sample0), '|', color='k')
ax.plot(sample1, [12]*len(sample1), '|', color='k')
ax.set_xlim(180, 360)  # most of the data
ax.set_ylim(.4,13)  # most of the data
ax.text(200, 9, r'$p$<0.001', fontsize=14)
plt.xlabel('Race Duration (min)')
plt.ylabel('Odds Ratio')
plt.savefig("runtime_or.pdf")
plt.show()

# graph the adjusted odds ratios for BMI

xval = np.arange(17.5,32.5,0.5)
ors = np.array([np.exp(lincom(np.array([0,0,0,(x-22.5),(x**2-506.25)]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "bmi"]
sample0 = df.loc[df["nas135"] == 0, "bmi"]
fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylim(.4, 12)  # most of the data
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=4)  
ax.plot(xval, ors[:,1], color='grey', alpha=0.2, linewidth=0)  
ax.plot(xval, ors[:,2], color='grey', alpha=0.2, linewidth=0)  
ax.fill_between(xval, ors[:,1], ors[:,2], color='blue', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%3.2f'))
ax.set_yticks([.5, 1 , 2, 4, 8])
ax.set_xticks(np.arange(17.5,32.5, 2.5))
#ax.set_title('Non-Linear Modeling of BMI vs. Disease Risk', fontsize=18, pad=20)
ax.vlines(22.5, .05, 1, color='black', linewidth=.5)
plt.xlabel(r'Body Mass Index (kg/$m^2$)',fontsize=16)
plt.ylabel('Adjusted Odds Ratio', fontsize=16)
ax.plot(sample0, [.5]*len(sample0), '|', color='k')
ax.plot(sample1, [11]*len(sample1), '|', color='k')
ax.text(20, 7, r'$p$<0.001', fontsize=14)
plt.savefig("bmi_or.pdf")
plt.show()

# graph the adjusted odds ratios for BMI with a histogram 

xval = np.arange(17.5,32.5,0.5)
ors = np.array([np.exp(lincom(np.array([0,0,0,(x-22.5),(x**2-506.25)]),b,V)) for x in xval] )
sample1 = df.loc[df["nas135"] == 1, "bmi"]
sample0 = df.loc[df["nas135"] == 0, "bmi"]
bmi = np.array(df[['bmi']])

fig, ax = plt.subplots(figsize=(7, 5))
ax.set(yscale="log", xscale="linear")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylim(.4, 12)  # most of the data
ax.plot(xval, ors[:,0], color='blue', alpha=0.5, linewidth=4)  
ax.plot(xval, ors[:,1], color='blue', alpha=0.2, linewidth=0)  
ax.plot(xval, ors[:,2], color='blue', alpha=0.2, linewidth=0)  
ax.fill_between(xval, ors[:,1], ors[:,2], color='b', alpha=0.2)
ax.yaxis.set_major_formatter(FormatStrFormatter('%3.2f'))
ax.set_yticks([.5, 1 , 2, 4, 8])
ax.set_xticks(np.arange(17.5,32.5, 2.5))
ax.set_ylim(.4, 12)  # most of the data

plt.xlabel(r'Body Mass Index (kg/$m^2$)',fontsize=16)
plt.ylabel('Adjusted Odds Ratio', fontsize=16)
#ax.set_title('Non-Linear Modeling of BMI vs. Disease Risk', fontsize=18, pad=20)
ax.plot(sample0, [.5]*len(sample0), '|', color='k')
ax.plot(sample1, [11]*len(sample1), '|', color='k')
ax.text(20, 7, r'$p$<0.001', fontsize=14)
ax.vlines(22.5, .05, 1, color='black', linewidth=.5)

ax2=ax.twinx()
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.hist(bmi, 15, color='red', alpha=0.1, density=True, rwidth=.9) 
ax2.set_yticks([])
plt.savefig("bmi_or2.pdf")
plt.show()

	wtdiff	wc_s1
0	NaN	NaN
1	NaN	NaN
2	NaN	NaN
3	NaN	NaN
4	NaN	NaN
...	...	...
483	-2.500000	0.000000
484	0.454545	0.454545
485	0.312500	0.312500
486	1.136364	1.136364
487	-0.909091	0.000000

	na	nas135	wtdiff	runtime	bmi	bmi2
18	142	0	-0.909091	195.0	19.503027	380.368077
19	141	0	1.136364	195.0	18.695052	349.504971
20	138	0	-0.681818	207.0	20.157869	406.339666
21	140	0	0.227273	234.0	17.063750	291.171557
22	142	0	0.000000	233.0	19.428191	377.454619
...	...	...	...	...	...	...
483	133	1	-2.500000	308.0	29.188370	851.960947
484	128	1	0.454545	250.0	31.387685	985.186749
485	133	1	0.312500	283.0	28.311327	801.531230
486	131	1	1.136364	308.0	31.224404	974.963415
487	135	1	-0.909091	276.0	28.292931	800.489939

	Intercept	wtdiff	runtime	bmi	bmi2
Intercept	55.975540	0.158859	-0.005729	-4.614272	0.095350
wtdiff	0.158859	0.015157	0.000023	-0.014786	0.000313
runtime	-0.005729	0.000023	0.000019	0.000177	-0.000006
bmi	-4.614272	-0.014786	0.000177	0.387397	-0.008019
bmi2	0.095350	0.000313	-0.000006	-0.008019	0.000167

	Intercept	wtdiff
Intercept	0.04994	0.011140
wtdiff	0.01114	0.016157

Course 5316: Fundamentals of using Python in Health Related Research ¶

Peter Alping, PhD ¶

Orsini Nicola, PhD ¶

Day 4 - Linear and logistic regression models ¶

Package¶

statsmodel¶

Motivating example: observational cohort study¶

Conditional mean¶

Store the predicted mean outcome¶

Univariate Wald-type test¶

Access estimated parameters¶

Inference for any linear combinations of parameters and data¶

Example 1: Confidence interval for the mean outcome¶

Example 2: Confidence interval for the difference in mean outcome¶

Linear and quadratic function for the quantitative predictors¶

F test for a joint hypothesis¶

Conditional probability¶

Binary predictor¶

Confidence interval for the Odds Ratio¶

Categorization of continuous predictors¶

Final multivariable logistic model with categorized predictors¶

All continuous predictors using linear and quadratic functions¶

AIC and maximized log-likelihood¶

Modelling quantitative predictors using splines¶

Degree-1 Spline with linear regression¶

Visualize piece-wise linear trend¶

Define a function to generate restricted cubic splines¶

Testing regression coefficients of the splines¶

Graph the predicted mean from the cubic spline linear model¶

Multivariable logistic regression model¶

Wald-type test for one parameter¶

Cubic spline logistic regression model¶

Wald-type test of two parameters¶

Likelihood ratio test¶

Adjusted Odds Ratio of Hyponatremia according to weight change using 0 kg as referent¶

Final logistic regression model¶

Table of adjusted odds ratios¶

Looping over a grid of predictor values¶

Customize the output of contrasts¶

Estimated odds ratio of hyponatremia conferred by 1 unit increase in weight change¶

Estimated odds ratio of hyponatremia conferred by 2 unit increase in weight change¶

Estimated odds ratio of hyponatremia conferred by 30 min increase in race duration¶

List comprehensions to produce tables of estimates¶

Use lincom to make inference on predicted probabilities¶

Plot statistical inference based on linear combinations¶

Graph partial predictions¶

Figure 2 - Panel A) Weight change¶

Figure 2 - Panel B) Race duration¶

Figure 2 - Panel C) Body Mass Index¶