Siegert et al "Detecting Improvements in Forecast Correlation Skill - Statistical Testing and Power Analysis"

Siegert, S., Bellprat, O., Ménégoz, M., Stephenson, D. B., & Doblas-Reyes, F. J. (2017). Detecting improvements in forecast correlation skill: Statistical testing and power analysis. Monthly Weather Review, 145(2), 437-450.

The skill of weather and climate forecast systems is often assessed by
calculating the correlation coefficient between past forecasts and their
verifying observations. Improvements in forecast skill can thus be quantified
by correlation differences. The uncertainty in the correlation difference needs
to be assessed to judge whether the observed difference constitutes a genuine
improvement, or is compatible with random sampling variations. A widely used
statistical test for correlation difference is known to be unsuitable, because
it assumes that the competing forecasting systems are independent. In this
paper, appropriate statistical methods are reviewed to assess correlation
differences when the competing forecasting systems are strongly correlated with
one another. The methods are used to compare correlation skill between seasonal
temperature forecasts that differ in initialization scheme and model
resolution. A simple power analysis framework is proposed to estimate the
probability of correctly detecting skill improvements, and to determine the
minimum number of samples required to reliably detect improvements. The
proposed statistical test has a higher power of detecting improvements than the
traditional test. The main examples suggest that sample sizes of climate
hindcasts should be increased to about 40 years to ensure sufficiently high
power. It is found that seasonal temperature forecasts are significantly
improved by using realistic land surface initial conditions.