Prediction

Who will win the FIFA Women’s World Cup?

Prediction examples

Success of a transplant
Stock prices
Abundance of fish species
Diagnostic of medical images
Image recognition

(Walsh and Tardy 2022)

Prediction

We wish to predict what the response \(\hat{y}\) will be for new observations of covariates.

Choose which subset of variables are best
Which method is best at predicting the response
How to tune hyper-parameters (machine learning)

Inference

Hypothesis testing - test a specific hypothesis or claim, for example we wish to test if a variable is associated with the outcome.
Estimating the magnitude of the effect of a covariate, adjusted for other covariates.

Prediction vs Inference

There is some overlap between models for prediction and inference.

Some things in common
but they are a bit different

Some covariates may be used in both methods
Some methods can be used for both goals, e.g. linear regression.

Not all covariates may be used in both models.

Correlated covariates can be used in prediction models but are not suitable for inference.

Assumptions we need to make

Models for inference typically relies on various assumptions holding, e.g. linearity and equal variance. Whereas methods for prediction can be non-parametric (or “black-box”).

Our interest in interpretation is different.

The parameters are not of interest for prediction models, so they don’t need to be interpretable.

So we just saw there is two different focuses for inference and prediction.

Some covariates may be used in both models - gender and age are usually.

You can predict from a linear regression. IF there is a relationship between a variable and the outcome then it will help with prediction.

In inference multicollinearity is an issue especially when variables are correlated with the variable of interest. Whereas correlated variables don’t affect the predictions so we don’t really mind uf there are correlated variables.
In inference we validate our model by looking at residual diagnostics to see if the assumptions are OK. Predictions we don’t need to make any assumptions about the relationship.
For inference, we are really interested in understanding the relationship between a variable and a response so we want to be able to interpret it. But for prediction it is sometimes not possible to interpret

Prediction vs Inference

How we approach a prediction problem is different to inference.
Prediction and inference should not be used together on the same dataset. Check out the seminar

Like Cats and Dogs – Why model selection and inference just can’t get along

If inference is your main goal there are lots of seminars by Stats Central on this topic.

Prediction approach

The performance of a method relates to its prediction performance on independent test data.

How should we choose the number of observations in each part? Not sure!

Randomly divide the data into three parts accounting for temporal, spatial, or other structure in the data. Otherwise prediction will appear better (Roberts et al. 2017).

Now for how do we approach prediction problems.

The performance of a method for prediction is assessed by its prediction performance on independent test data.

So generally we split the data.

Ideally if we have lots of data we can split it into three parts. Training data set, validation and test data. How many observations should be in each part - well that really depends on the data, and the complexity of the problem.

Hastie recommends 50% 25% in validation and test. Other people recommend as the sample size increases the proportion assigned to training should decrease.

Two key points here, is it should randomly divided to avoid any bias. If there is any structured dependence then you want to stratify the samples this is because you want the training and test data to be independent. Otherwise, you underestimate the prediction error and predictions will appear better.

Training data set

The training set is used to fit the models/methods. There are so many methods to choose from including:

Linear regression
Generalized additive models
Lasso
Tree-based methods
Neural networks
Support vector machines

Validation data set

Validation is used to evaluate the predictive performance of data independent of the training data.
This is important to help avoid over-fitting.

Warton (2022)

Validation data set

There are many different metrics used to measure the predictive error. Some common measures are:

Mean squared error
Mean absolute error
Misclassification error
Sensitivity and specificity
Area under ROC curve

Validation data set

There are different ways of splitting the data into training and validation sets.

Holdout method
k-fold cross-validation
Leave-one-out cross-validation

5-fold cross-validation

Word of caution

Cross-validation must be applied to the entire sequence of modelling steps (Hastie et al. 2009).
For example, when you want to reduce the number of predictors prior to modelling you should not filter them based on all of the samples. You should filter them based on the training data.
However, you can filter them if you do it without looking at the response.

Test data set

Data are set aside at the beginning of the analysis.
Should be looked at only once.
The test set is used for assessment of the prediction error of the final chosen model.
Publish the results from this data set.

Prediction workflow

Hosseini et al. (2020)

Some other methods

If the results from the test set are not good, you should not re-analyse the data.

Other methods for preventing overfitting are:

Nested cross-validation
Blind analysis
Pre-registration approach

References

Hastie, Trevor, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2. Springer.

Hosseini, Mahan, Michael Powell, John Collins, Chloe Callahan-Flintoft, William Jones, Howard Bowman, and Brad Wyble. 2020. “I Tried a Bunch of Things: The Dangers of Unexpected Overfitting in Classification of Brain Data.” Neuroscience & Biobehavioral Reviews 119: 456–67.

Roberts, David R, Volker Bahn, Simone Ciuti, Mark S Boyce, Jane Elith, Gurutzeta Guillera-Arroita, Severin Hauenstein, et al. 2017. “Cross-Validation Strategies for Data with Temporal, Spatial, Hierarchical, or Phylogenetic Structure.” Ecography 40 (8): 913–29.

Walsh, Ricky, and Mickael Tardy. 2022. “A Comparison of Techniques for Class Imbalance in Deep Learning Classification of Breast Cancer.” Diagnostics 13 (1): 67.

Warton, David I. 2022. Eco-Stats: Data Analysis in Ecology: From t-Tests to Multivariate Abundances. Springer Nature.