Leave-one-out cross-validation in Python

Python

Linear model vs neural nets

Published

August 11, 2023

Strategy

Look at the cars dataset that is shipped in base R
Fit models
- for each model, leave out one data point
- predict the response value for that data point based on the model that is not based on that data point
- calculate mean absolute error and mean squared error

Implementation

Init

Chunk in R to setup my Python environment (can be omitted for users with Python ready to use)

library('reticulate')
use_condaenv('sbloggel', required = TRUE)

Chunk to load Python libraries and set up the data

import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
cars = pd.DataFrame(r['cars'])

Show that fitting in R and Python yields identical results

fm = smf.ols(formula = 'speed ~ dist', data = cars).fit()
fm.summary()

OLS Regression Results
Dep. Variable:	speed	R-squared:	0.651
Model:	OLS	Adj. R-squared:	0.644
Method:	Least Squares	F-statistic:	89.57
Date:	Wed, 23 Aug 2023	Prob (F-statistic):	1.49e-12
Time:	12:04:41	Log-Likelihood:	-127.39
No. Observations:	50	AIC:	258.8
Df Residuals:	48	BIC:	262.6
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	8.2839	0.874	9.474	0.000	6.526	10.042
dist	0.1656	0.017	9.464	0.000	0.130	0.201

Omnibus:	0.720	Durbin-Watson:	1.195
Prob(Omnibus):	0.698	Jarque-Bera (JB):	0.827
Skew:	-0.207	Prob(JB):	0.661
Kurtosis:	2.526	Cond. No.	98.0

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

fm <- lm(speed ~ dist, data = cars)
summary(fm)


Call:
lm(formula = speed ~ dist, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.5293 -2.1550  0.3615  2.4377  6.4179 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
dist         0.16557    0.01749   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.156 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Show that prediction in R and Python yields identical results

fm.predict(cars.iloc[0])

0    8.615041
dtype: float64

predict(fm, cars[1, ])

       1 
8.615041

Perform leave-one-out-cross-validation in Python

# prepare column for predictions:
cars['prediction'] = np.nan

# fit models:
for i in range(50):
  train = cars.drop(index = i)
  test = cars.iloc[i]
  fm = smf.ols(formula='speed ~ dist', data=train).fit()
  cars['prediction'].iloc[i] = fm.predict(test)

# calculate prediction error:
cars['error'] = cars['speed'] - cars['prediction']

# mean absolute error:
np.mean(np.abs(cars['error']))

2.6335683917489323

# root mean squared error:
np.sqrt(np.mean(np.square(cars['error'])))

3.244276599631193

Conclusion

Results are identical to those produced in R: Link
Also, the discrepancy with Friedemann’s results is now resolved