Leave-one-out cross-validation in Python

Published

August 11, 2023

Strategy

  • Look at the cars dataset that is shipped in base R
  • Fit models
    • for each model, leave out one data point
    • predict the response value for that data point based on the model that is not based on that data point
    • calculate mean absolute error and mean squared error

Implementation

Init

Chunk in R to setup my Python environment (can be omitted for users with Python ready to use)

library('reticulate')
use_condaenv('sbloggel', required = TRUE)

Chunk to load Python libraries and set up the data

import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
cars = pd.DataFrame(r['cars'])

Show that fitting in R and Python yields identical results

fm = smf.ols(formula = 'speed ~ dist', data = cars).fit()
fm.summary()
OLS Regression Results
Dep. Variable: speed R-squared: 0.651
Model: OLS Adj. R-squared: 0.644
Method: Least Squares F-statistic: 89.57
Date: Wed, 23 Aug 2023 Prob (F-statistic): 1.49e-12
Time: 12:04:41 Log-Likelihood: -127.39
No. Observations: 50 AIC: 258.8
Df Residuals: 48 BIC: 262.6
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 8.2839 0.874 9.474 0.000 6.526 10.042
dist 0.1656 0.017 9.464 0.000 0.130 0.201
Omnibus: 0.720 Durbin-Watson: 1.195
Prob(Omnibus): 0.698 Jarque-Bera (JB): 0.827
Skew: -0.207 Prob(JB): 0.661
Kurtosis: 2.526 Cond. No. 98.0


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
fm <- lm(speed ~ dist, data = cars)
summary(fm)

Call:
lm(formula = speed ~ dist, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.5293 -2.1550  0.3615  2.4377  6.4179 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
dist         0.16557    0.01749   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.156 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Show that prediction in R and Python yields identical results

fm.predict(cars.iloc[0])
0    8.615041
dtype: float64
predict(fm, cars[1, ])
       1 
8.615041 

Perform leave-one-out-cross-validation in Python

# prepare column for predictions:
cars['prediction'] = np.nan

# fit models:
for i in range(50):
  train = cars.drop(index = i)
  test = cars.iloc[i]
  fm = smf.ols(formula='speed ~ dist', data=train).fit()
  cars['prediction'].iloc[i] = fm.predict(test)

# calculate prediction error:
cars['error'] = cars['speed'] - cars['prediction']

# mean absolute error:
np.mean(np.abs(cars['error']))
2.6335683917489323
# root mean squared error:
np.sqrt(np.mean(np.square(cars['error'])))
3.244276599631193

Conclusion

  • Results are identical to those produced in R: Link
  • Also, the discrepancy with Friedemann’s results is now resolved