Who will win the FIFA Women’s World Cup?
We wish to predict what the response \(\hat{y}\) will be for new observations of covariates.
There is some overlap between models for prediction and inference.
Correlated covariates can be used in prediction models but are not suitable for inference.
Models for inference typically relies on various assumptions holding, e.g. linearity and equal variance. Whereas methods for prediction can be non-parametric (or “black-box”).
The parameters are not of interest for prediction models, so they don’t need to be interpretable.
How we approach a prediction problem is different to inference.
Prediction and inference should not be used together on the same dataset. Check out the seminar
Like Cats and Dogs – Why model selection and inference just can’t get along
The performance of a method relates to its prediction performance on independent test data.
How should we choose the number of observations in each part? Not sure!
Randomly divide the data into three parts accounting for temporal, spatial, or other structure in the data. Otherwise prediction will appear better (Roberts et al. 2017).
The training set is used to fit the models/methods. There are so many methods to choose from including:
Warton (2022)
There are many different metrics used to measure the predictive error. Some common measures are:
There are different ways of splitting the data into training and validation sets.
Hosseini et al. (2020)
If the results from the test set are not good, you should not re-analyse the data.
Other methods for preventing overfitting are: