Cars dataset and tidymodels

R
Linear model vs neural nets
Published

August 8, 2023

Init

set.seed(1)
library('tidymodels')
── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
✔ broom        1.0.4     ✔ recipes      1.0.6
✔ dials        1.2.0     ✔ rsample      1.1.1
✔ dplyr        1.1.2     ✔ tibble       3.2.1
✔ ggplot2      3.4.2     ✔ tidyr        1.3.0
✔ infer        1.0.4     ✔ tune         1.1.1
✔ modeldata    1.1.0     ✔ workflows    1.1.3
✔ parsnip      1.1.0     ✔ workflowsets 1.0.1
✔ purrr        1.0.1     ✔ yardstick    1.2.0
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/
library('knitr')

Do vfold cross-validation on the cars dataset using a linear model, a random forest model and a neural net

# Create data for resampling:
cars_folds <- vfold_cv(cars, v = 10, repeats = 1)

# Create model specifications:
lm_spec <- linear_reg()
rf_spec <- rand_forest() |> set_mode('regression')
nn_spec <- mlp(epochs = 500, hidden_units = 3, penalty = 0.01) |> 
  set_mode('regression')

# Create workflows:
all_workflows <- 
  workflow_set(
    preproc = list("formula" = speed ~ dist),
    models = list(lm = lm_spec, 
                  rf = rf_spec,
                  nn = nn_spec)
  )

# Run the workflows:
result <- all_workflows |> 
  workflow_map(resamples = cars_folds,
               control = control_resamples(save_pred = TRUE))

# Display summary statistics for all model types:
(metrics <- collect_metrics(result))
# A tibble: 6 × 9
  wflow_id   .config        preproc model .metric .estimator  mean     n std_err
  <chr>      <chr>          <chr>   <chr> <chr>   <chr>      <dbl> <int>   <dbl>
1 formula_lm Preprocessor1… formula line… rmse    standard   3.18     10  0.288 
2 formula_lm Preprocessor1… formula line… rsq     standard   0.656    10  0.0885
3 formula_rf Preprocessor1… formula rand… rmse    standard   3.14     10  0.340 
4 formula_rf Preprocessor1… formula rand… rsq     standard   0.621    10  0.0901
5 formula_nn Preprocessor1… formula mlp   rmse    standard   2.99     10  0.253 
6 formula_nn Preprocessor1… formula mlp   rsq     standard   0.643    10  0.0834
metrics |> pivot_wider(names_from = .metric, 
                       values_from = c(mean, std_err), 
                       id_cols = wflow_id) |> 
  kable(align = 'c')
wflow_id mean_rmse mean_rsq std_err_rmse std_err_rsq
formula_lm 3.180583 0.6561715 0.2883717 0.0885333
formula_rf 3.142965 0.6212505 0.3399041 0.0901077
formula_nn 2.994578 0.6426283 0.2528375 0.0833891
# Plot predictions against actual values for all predictions:
collect_predictions(result) |> 
  pivot_longer(cols = model) |> 
  ggplot(aes(speed, .pred, colour = value)) + 
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) +
  labs(x = 'Real value of speed', y = 'Prediction', colour = 'Model')

Open Questions

  • I don’t exactly understand how tidymodels performs cross-validation, how the metrics are calculated etc. Documentation of the tidymodels package has not been informative to me so far (probably I have to read more into it).
  • How to collect other metrics, e.g. mean absolute error?
  • How to conventiently join the predictors to the prediction tibble?
  • How do I extract honest coefficients from the linear model?