Prediction

Who will win the FIFA Women’s World Cup?

https://theanalyst.com/eu/2023/07/womens-world-cup-predictions-2023/

Prediction examples

  • Success of a transplant
  • Stock prices
  • Abundance of fish species
  • Diagnostic of medical images
  • Image recognition

(Walsh and Tardy 2022)

Prediction

We wish to predict what the response \(\hat{y}\) will be for new observations of covariates.

  • Choose which subset of variables are best
  • Which method is best at predicting the response
  • How to tune hyper-parameters (machine learning)

Inference

  • Hypothesis testing - test a specific hypothesis or claim, for example we wish to test if a variable is associated with the outcome.
  • Estimating the magnitude of the effect of a covariate, adjusted for other covariates.

Prediction vs Inference

There is some overlap between models for prediction and inference.

  • Some covariates may be used in both methods
  • Some methods can be used for both goals, e.g. linear regression.
Not all covariates may be used in both models.

Correlated covariates can be used in prediction models but are not suitable for inference.

Assumptions we need to make

Models for inference typically relies on various assumptions holding, e.g. linearity and equal variance. Whereas methods for prediction can be non-parametric (or “black-box”).

Our interest in interpretation is different.

The parameters are not of interest for prediction models, so they don’t need to be interpretable.

Prediction vs Inference

  • How we approach a prediction problem is different to inference.

  • Prediction and inference should not be used together on the same dataset. Check out the seminar

Like Cats and Dogs – Why model selection and inference just can’t get along

  • If inference is your main goal there are lots of seminars by Stats Central on this topic.

Prediction approach

The performance of a method relates to its prediction performance on independent test data.

How should we choose the number of observations in each part? Not sure!

Randomly divide the data into three parts accounting for temporal, spatial, or other structure in the data. Otherwise prediction will appear better (Roberts et al. 2017).

Training data set

The training set is used to fit the models/methods. There are so many methods to choose from including:

  • Linear regression
  • Generalized additive models
  • Lasso
  • Tree-based methods
  • Neural networks
  • Support vector machines

Validation data set

  • Validation is used to evaluate the predictive performance of data independent of the training data.
  • This is important to help avoid over-fitting.

Warton (2022)

Validation data set

There are many different metrics used to measure the predictive error. Some common measures are:

  • Mean squared error
  • Mean absolute error
  • Misclassification error
  • Sensitivity and specificity
  • Area under ROC curve

Validation data set

There are different ways of splitting the data into training and validation sets.

  • Holdout method
  • k-fold cross-validation
  • Leave-one-out cross-validation

5-fold cross-validation

Word of caution

  • Cross-validation must be applied to the entire sequence of modelling steps (Hastie et al. 2009).
  • For example, when you want to reduce the number of predictors prior to modelling you should not filter them based on all of the samples. You should filter them based on the training data.
  • However, you can filter them if you do it without looking at the response.

Test data set

  • Data are set aside at the beginning of the analysis.
  • Should be looked at only once.
  • The test set is used for assessment of the prediction error of the final chosen model.
  • Publish the results from this data set.

Prediction workflow

Hosseini et al. (2020)

Some other methods

If the results from the test set are not good, you should not re-analyse the data.

Other methods for preventing overfitting are:

  • Nested cross-validation
  • Blind analysis
  • Pre-registration approach

References

Hastie, Trevor, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2. Springer.
Hosseini, Mahan, Michael Powell, John Collins, Chloe Callahan-Flintoft, William Jones, Howard Bowman, and Brad Wyble. 2020. “I Tried a Bunch of Things: The Dangers of Unexpected Overfitting in Classification of Brain Data.” Neuroscience & Biobehavioral Reviews 119: 456–67.
Roberts, David R, Volker Bahn, Simone Ciuti, Mark S Boyce, Jane Elith, Gurutzeta Guillera-Arroita, Severin Hauenstein, et al. 2017. “Cross-Validation Strategies for Data with Temporal, Spatial, Hierarchical, or Phylogenetic Structure.” Ecography 40 (8): 913–29.
Walsh, Ricky, and Mickael Tardy. 2022. “A Comparison of Techniques for Class Imbalance in Deep Learning Classification of Breast Cancer.” Diagnostics 13 (1): 67.
Warton, David I. 2022. Eco-Stats: Data Analysis in Ecology: From t-Tests to Multivariate Abundances. Springer Nature.