Cars dataset and tidymodels

Linear model vs neural nets

Published

August 8, 2023

Init

set.seed(1)
library('tidymodels')

── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──

✔ broom        1.0.4     ✔ recipes      1.0.6
✔ dials        1.2.0     ✔ rsample      1.1.1
✔ dplyr        1.1.2     ✔ tibble       3.2.1
✔ ggplot2      3.4.2     ✔ tidyr        1.3.0
✔ infer        1.0.4     ✔ tune         1.1.1
✔ modeldata    1.1.0     ✔ workflows    1.1.3
✔ parsnip      1.1.0     ✔ workflowsets 1.0.1
✔ purrr        1.0.1     ✔ yardstick    1.2.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/

library('knitr')

Do vfold cross-validation on the cars dataset using a linear model, a random forest model and a neural net

# Create data for resampling:
cars_folds <- vfold_cv(cars, v = 10, repeats = 1)

# Create model specifications:
lm_spec <- linear_reg()
rf_spec <- rand_forest() |> set_mode('regression')
nn_spec <- mlp(epochs = 500, hidden_units = 3, penalty = 0.01) |> 
  set_mode('regression')

# Create workflows:
all_workflows <- 
  workflow_set(
    preproc = list("formula" = speed ~ dist),
    models = list(lm = lm_spec, 
                  rf = rf_spec,
                  nn = nn_spec)
  )

# Run the workflows:
result <- all_workflows |> 
  workflow_map(resamples = cars_folds,
               control = control_resamples(save_pred = TRUE))

# Display summary statistics for all model types:
(metrics <- collect_metrics(result))

# A tibble: 6 × 9
  wflow_id   .config        preproc model .metric .estimator  mean     n std_err
  <chr>      <chr>          <chr>   <chr> <chr>   <chr>      <dbl> <int>   <dbl>
1 formula_lm Preprocessor1… formula line… rmse    standard   3.18     10  0.288 
2 formula_lm Preprocessor1… formula line… rsq     standard   0.656    10  0.0885
3 formula_rf Preprocessor1… formula rand… rmse    standard   3.14     10  0.340 
4 formula_rf Preprocessor1… formula rand… rsq     standard   0.621    10  0.0901
5 formula_nn Preprocessor1… formula mlp   rmse    standard   2.99     10  0.253 
6 formula_nn Preprocessor1… formula mlp   rsq     standard   0.643    10  0.0834

metrics |> pivot_wider(names_from = .metric, 
                       values_from = c(mean, std_err), 
                       id_cols = wflow_id) |> 
  kable(align = 'c')

wflow_id	mean_rmse	mean_rsq	std_err_rmse	std_err_rsq
formula_lm	3.180583	0.6561715	0.2883717	0.0885333
formula_rf	3.142965	0.6212505	0.3399041	0.0901077
formula_nn	2.994578	0.6426283	0.2528375	0.0833891

# Plot predictions against actual values for all predictions:
collect_predictions(result) |> 
  pivot_longer(cols = model) |> 
  ggplot(aes(speed, .pred, colour = value)) + 
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) +
  labs(x = 'Real value of speed', y = 'Prediction', colour = 'Model')

Open Questions

I don’t exactly understand how tidymodels performs cross-validation, how the metrics are calculated etc. Documentation of the tidymodels package has not been informative to me so far (probably I have to read more into it).
How to collect other metrics, e.g. mean absolute error?
How to conventiently join the predictors to the prediction tibble?
How do I extract honest coefficients from the linear model?