library('reticulate')
use_condaenv('sbloggel', required = TRUE)
Leave-one-out cross-validation in Python
Strategy
- Look at the
cars
dataset that is shipped inbase
R - Fit models
- for each model, leave out one data point
- predict the response value for that data point based on the model that is not based on that data point
- calculate mean absolute error and mean squared error
Implementation
Init
Chunk in R to setup my Python environment (can be omitted for users with Python ready to use)
Chunk to load Python libraries and set up the data
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
= pd.DataFrame(r['cars']) cars
Show that fitting in R and Python yields identical results
= smf.ols(formula = 'speed ~ dist', data = cars).fit()
fm fm.summary()
Dep. Variable: | speed | R-squared: | 0.651 |
Model: | OLS | Adj. R-squared: | 0.644 |
Method: | Least Squares | F-statistic: | 89.57 |
Date: | Wed, 23 Aug 2023 | Prob (F-statistic): | 1.49e-12 |
Time: | 12:04:41 | Log-Likelihood: | -127.39 |
No. Observations: | 50 | AIC: | 258.8 |
Df Residuals: | 48 | BIC: | 262.6 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 8.2839 | 0.874 | 9.474 | 0.000 | 6.526 | 10.042 |
dist | 0.1656 | 0.017 | 9.464 | 0.000 | 0.130 | 0.201 |
Omnibus: | 0.720 | Durbin-Watson: | 1.195 |
Prob(Omnibus): | 0.698 | Jarque-Bera (JB): | 0.827 |
Skew: | -0.207 | Prob(JB): | 0.661 |
Kurtosis: | 2.526 | Cond. No. | 98.0 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<- lm(speed ~ dist, data = cars)
fm summary(fm)
Call:
lm(formula = speed ~ dist, data = cars)
Residuals:
Min 1Q Median 3Q Max
-7.5293 -2.1550 0.3615 2.4377 6.4179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
dist 0.16557 0.01749 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.156 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Show that prediction in R and Python yields identical results
0]) fm.predict(cars.iloc[
0 8.615041
dtype: float64
predict(fm, cars[1, ])
1
8.615041
Perform leave-one-out-cross-validation in Python
# prepare column for predictions:
'prediction'] = np.nan
cars[
# fit models:
for i in range(50):
= cars.drop(index = i)
train = cars.iloc[i]
test = smf.ols(formula='speed ~ dist', data=train).fit()
fm 'prediction'].iloc[i] = fm.predict(test)
cars[
# calculate prediction error:
'error'] = cars['speed'] - cars['prediction']
cars[
# mean absolute error:
abs(cars['error'])) np.mean(np.
2.6335683917489323
# root mean squared error:
'error']))) np.sqrt(np.mean(np.square(cars[
3.244276599631193
Conclusion
- Results are identical to those produced in R: Link
- Also, the discrepancy with Friedemann’s results is now resolved