title: “psr Explained” authors: “Thomas McGinn”, “Alan Huebner”, and “Matthew Sisk” date: “04/04/2021” output: rmarkdown::html_vignette vignette: > % % % —

1. Motivation for Package

The psr package is intended for coaches and analysts in the field of sports performance science. It aims to allow users to efficiently calculate metrics to assess measurement reliability as well as individual change in athletic performance. For example, a strength and conditioning coach might be interested in assessing a group of athletes’ speed development via a sprint test. For this task, a team of n athletes may perform the 100-meter sprint test on three different occasions, yielding data such as the following:

Subject Trial 1 Trial 2 Trial 3
1 10.0 12.0 10.9
2 10.1 11.3 11.1
n 9.9 10.3 10.2

Then, the coach would be interested in one or more of the following tasks: 1. Assess the reliability, or stability of the measurements to verify that they are not contaminated by excessive error. 2. Determine which athletes showed a statistically significant improvement in sprint time compared to a baseline measurement. 3. Determine which athletes displayed a practically significant improvement in sprint time.

Specifically, psr can be used to assess measurement reliability via functions that compute the typical error (TE), coefficient of variation (CV), standard error of measurement (SEM), and intra-class correlation coefficient (ICC_long) functions. ICC_long() builds on the psych::ICC() function by taking data in long format as its input, saving the user the task of pivoting the data to wide format. Furthermore, ICC_long allows for the computation of ICC’s for multiple metrics at once. This vignette illustrates the use and interprets the output of each function via example data.

Several of the methods come from the literature of Hopkins (2000), and other well-known sources such as Atkinson & Nevill (1998).

First, the package is loaded:


2. Reliability of Measurement Instrument

The psr package includes four different functions for assessing reliability of measurements: TE(), CV(), SEM(), and ICC_long(). Each will be discussed in turn.

2.1. Typical Error: TE()

The Typical Error (TE) is the within-athlete variation of the measurements. It is a measure of how much one should “typically” expect the measurements of each athlete to vary from trial to trial, due to noise in the data. Such noise could be the result of biological variation in the athletes or measurement error on the part of the instrument being used to record the in the test that the athletes are completing.

If there are just two trials being analyzed, the TE is just the standard deviation of the difference scores of each athlete between the first trial and the second trial, divided by root 2 (Hopkins, 2000). If more than two trials are being studied, the TE is the residual standard error of a linear model with the metric as the response and the subject and trial as independent variable, entered into the model as factors (Hopkins, 2000).

The following (fictional) data set consists of three different athletes performing three trials of a certain task, and three different metrics on bio-mechanical measurements are recorded:

subject <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
trial <- c('Trial 1', 'Trial 2', 'Trial 3', 'Trial 1', 'Trial 2', 'Trial 3', 'Trial 1', 'Trial 2', 'Trial 3')
metric_1 <- c(250, 258, 252, 279, 270, 277, 218, 213, 218)
metric_2 <- c(10, 7, 10, 14, 18, 17, 11, 7, 8)
metric_3 <- c(1214, 1276, 1289, 1037, 1010, 1069, 1481, 1465, 1443)

The TE() function is called by specifying the rows containing the subjects, trials, and metrics:

TE(subject, trial, metric_1, metric_2, metric_3)
##  metric_1 metric_2 metric_3
##  4.690416 2.309401 34.77946

TE is expressed in terms of the units of the original metric. Thus, the TE of metric_3 is much larger than the TE’s of the other two metrics, but that is largely due to the fact that metric_3 is on a different scale than metric_1 and metric_2.

2.2. Coefficient of Variation: CV()

The Coefficient of Variation (CV) is the TE expressed as a percentage of the mean of the data. It is calculated by first computing the TE of each metric, and dividing this value by the mean of the metric, across all athletes and trials. There are inconsistent explanations of the CV in the performance science literature, and this package follows the interpretation found in Hopkins (2000), which denotes the CV as the typical percentage error. In other words, the TE is expressed in absolute terms, while the CV represents the same statistic expressed as a percentage of the mean of the data. The CV is a unit-less measure, which allows for easy comparison of metrics with different units and scales of measurement, whereas the interpretation of the TE is entirely dependent on unit and scale.

CV(subject, trial, metric_1, metric_2, metric_3)
##  metric_1 metric_2 metric_3
##  1.888758 20.37707 2.773974

Using the same data as in the section above, it is seen that the CV of metric_2 is actually much larger than either CV of the other two metrics. Hopkins notes that the CV for a test will typically be 1-5% (Hopkins, 2000). The CV’s of metric_1 and metric_3, respectively, fall within this range, while the CV of metric_2 does not.

2.3. Standard Error of Measurement: SEM()

The Standard Error of Measurement (SEM) is similar to the TE and CV, and is a way of quantifying the absolute reliability of a metric. It is the deviation around the true mean. Unlike the two previous measurements: SEM utilizes the ICC, which is another measure of reliability. The formula for the SEM is the between-subject standard deviation of each athlete’s scores multiplied by root (1 - ICC), as described in Atkinson & Nevill (1998) and de Vet et al. (2006), among others.

The SEM() function is called, using the same data as in the case of the TE() and CV():

SEM(subject, trial, metric_1, metric_2, metric_3, ICC = c(0.99, 0.93, 0.99))
##  metric_1 metric_2 metric_3
##  2.605283 1.090871 18.57181

As noted earlier, SEM() adds one additional argument to those in TE() and CV(). This addition is a list of the individual ICC values for the metrics, which must be entered in order. In the above example, the ICC values for metric_1, metric_2, and metric_3 are 0.92, 0.98, and 0.95, respectively.

Running the code, the familiar form of the output is displayed above. It shows the SEM for each metric, which is useful as a way of comparing the reliability of metrics. The SEM can be interpreted as the minimum change that can be considered real and not due to measurement error (Atkinson & Nevill, 1998).

This output mirrors that of the TE function shown earlier in that metric_2 is more reliable than metric_1, which is more reliable than metric_3. It differs from what the CV output indicates, however, which indicates that metric_2 is actually much less reliable than either metric_1 or metric_3. This pattern emerges because the CV is a unit-less measure, whereas both the TE and the SEM derive their interpretations (at least in part) from the units in which they are measured.

Another point that is important to emphasize is that for any metric, the confidence band that one can form around the observed score of an athlete by adding and subtracting the SEM only covers 68% of the variability, as the statistic should be interpreted in connection to the normal distribution (Atkinson & Nevill, 1998). This is a weaker threshold than that yielded by a typical MDC value, as will be made clear in the section that describes MDC().

2.4. Intra-class Correlation Coefficient: ICC_long()

The Intra-class Correlation Coefficient (ICC) is a measure of reliability for a set of measurements that ranges from 0 to 1, with values closer to 1 indicating a greater degree of reliability. There are six different forms of the ICC that are appropriate for different experimental designs, as detailed in Shrout & Fleiss (1979). The output of ICC_long() prints the ICC values of all six forms, for each metric that was entered as an argument to the function.

ICC_long() is simply a wrapper for psych::ICC(), but has two key differences: it can produce the ICC output for multiple metrics in a single call, and it takes data in long format as its inputs (Revelle, 2020), both of which are explained in detail below.

To illustrate ICC_long() at work, the same example as above can be employed. The function takes the same inputs as TE() and CV(), but yields output that takes a very different form:

ICC_long(subject, trial, metric_1, metric_2, metric_3)
## $metric_1
## Call: psych::ICC(x = df)
## Intraclass correlation coefficients 
##                          type  ICC   F df1 df2       p lower bound upper bound
## Single_raters_absolute   ICC1 0.98 167   2   6 5.5e-06        0.91           1
## Single_random_raters     ICC2 0.98 167   2   4 1.4e-04        0.91           1
## Single_fixed_raters      ICC3 0.98 167   2   4 1.4e-04        0.88           1
## Average_raters_absolute ICC1k 0.99 167   2   6 5.5e-06        0.97           1
## Average_random_raters   ICC2k 0.99 167   2   4 1.4e-04        0.97           1
## Average_fixed_raters    ICC3k 0.99 167   2   4 1.4e-04        0.96           1
##  Number of subjects = 3     Number of Judges =  3
## $metric_2
## Call: psych::ICC(x = df)
## Intraclass correlation coefficients 
##                          type  ICC  F df1 df2      p lower bound upper bound
## Single_raters_absolute   ICC1 0.82 14   2   6 0.0051        0.38        0.99
## Single_random_raters     ICC2 0.82 14   2   4 0.0147        0.38        0.99
## Single_fixed_raters      ICC3 0.82 14   2   4 0.0147        0.27        0.99
## Average_raters_absolute ICC1k 0.93 14   2   6 0.0051        0.64        1.00
## Average_random_raters   ICC2k 0.93 14   2   4 0.0147        0.64        1.00
## Average_fixed_raters    ICC3k 0.93 14   2   4 0.0147        0.52        1.00
##  Number of subjects = 3     Number of Judges =  3
## $metric_3
## Call: psych::ICC(x = df)
## Intraclass correlation coefficients 
##                          type  ICC   F df1 df2       p lower bound upper bound
## Single_raters_absolute   ICC1 0.98 143   2   6 8.7e-06        0.90           1
## Single_random_raters     ICC2 0.98 143   2   4 1.9e-04        0.90           1
## Single_fixed_raters      ICC3 0.98 143   2   4 1.9e-04        0.87           1
## Average_raters_absolute ICC1k 0.99 143   2   6 8.7e-06        0.96           1
## Average_random_raters   ICC2k 0.99 143   2   4 1.9e-04        0.96           1
## Average_fixed_raters    ICC3k 0.99 143   2   4 1.9e-04        0.95           1
##  Number of subjects = 3     Number of Judges =  3

The output above is simply a list of the ICC’s for each metric. Based on the ICC’s, metric_1 and metric_3 each exhibit nearly perfect reliability, while metric_2 is noticeably less reliable, which is the same conclusion that can be drawn from the CV’s of the metrics. The choice of which of the six ICC forms to use for the data entered is left to the user. Criteria for making this decision is discussed in Shrout & Fleiss (1979).

3. Setting Benchmarks for Athletes (What Constitutes Meaningful Change?)

The SWC() and MDC() functions allow practitioners to set benchmarks for athletes.

3.1. Smallest Worthwhile Change: SWC()

SWC is a measure of the smallest change an athlete would need to exhibit from one trial (or set of trials) to another trial (or set of trials) in order for it to be considered worthwhile (i.e. practically meaningful), rather than merely the result of random or measurement error. It is computed by multiplying the between-athlete standard deviation by an effect size specified by the user (the default effect size is 0.2), as explained in Bernards et al. (2017). Other studies also use the TE in its calculation of the SWC, but multiply it by 1.65 (or 1.96) * the square root of 2, which represents a substantially larger effect size than the one we use (Mann et al., 2016; Mann et al., 2014). This package uses the interpretation of the statistic found in Bernards et al. (2017), as it represents a way of examining change that is distinct from the MDC, which is also included in this package. This approach is found in other studies as well, including Swinton et al. (2018).

For the SWC function, the user can choose which method should be used for the computation of the between-athlete standard deviation involved in the formula. Does the user want to average each subject’s values and use the standard deviation of these averages (the default), or should only each subject’s maximum or minimum value be considered? This is the purpose of the group_by and summarize functions from the dplyr package that are used inside of the SWC function.

Computing the SWC for the same data as in the prior three functions looks like this:

SWC(subject, trial, metric_1, metric_2, metric_3, effect_size = 0.2, method = 'AVG')
##  metric_1  metric_2 metric_3
##  5.963221 0.8666667 42.44559

Thus, in order for an athlete in the data to improve by a worthwhile amount, he or she would need to improve by 5.96 units for metric_1, 0.87 units for metric_2, and 42.45 units for metric_3, respectively.

Another application of the SWC is comparing it to the TE as a method of determining whether or not a set of measurements is reliable. If the TE is greater than the SWC for a metric, then the instrument is considered to be unreliable, as one could not be certain that an improvement equal to the SWC is worthwhile change, or rather is just error (Conway, 2017). For the data example in the preceding sections, the SWC is greater than the TE for both metric_1 and metric_3, while for metric_2, its TE is greater than its SWC. This indicates that the measurement instrument being used in this data is reliable for measuring metric_1 and metric_3, but unreliable for measuring metric_2.

3.2. Minimal Detectable Change: MDC()

With the same goal in mind as the SWC of assessing individual change, the Minimal Detectable Change (MDC) is simply the SEM multiplied by root 2 and then multiplied by a critical value that corresponds to a confidence level that the user chooses. It is described in Riemann & Lininger (2018) and appears elsewhere in the literature as well (Beekhuizen et al., 2009; de Vet et al., 2006; Bernards et al., 2017).

This function is similar to SWC(), but has some important differences. The main such difference is that it includes the ICC in its calculation, and it is based on a critical value of the normal distribution. This critical value is based on how confident the user wants to be that the MDC value generated cannot be explained solely by chance or error. The default confidence level is 0.95, meaning that, if the MDC indicated by the function were to be achieved by an athlete in the data, one could 95% confident is a legitimate change in performance, based on the corresponding critical value from the standard normal distribution. In other words, such a change would be statistically significant at the 5% level.

Using this formula with the usual data, the resulting output is:

MDC(subject, trial, metric_1, metric_2, metric_3, ICC = c(0.99, 0.93, 0.99), confidence = 0.95)
##  metric_1 metric_2 metric_3
##  7.221344 3.023685 51.47747

Note, the MDC is larger than the SWC for each metric, which sheds light on the fact that the MDC represents a high threshold for improvement. If an athlete were to achieve a change equal to the MDC, with the default confidence level of 95%, one could be 95% confident that the change was not the result of error. One might not need to have that degree of certainty, and might prefer to set a benchmark for their athletes that is more achievable. Changing the confidence level from 95% to 80% lowers the MDC of each metric considerably, as shown in the code and output below:

MDC(subject, trial, metric_1, metric_2, metric_3, ICC = c(0.99, 0.93, 0.99), confidence = 0.80)
##  metric_1 metric_2 metric_3
##  4.721783 1.977081 33.65931

These MDC’s represent changes that, if achieved, a practitioner could be 80% confident is an actual improvement, rather than being due to random chance alone.

4. Standard Ten Scores: STEN()

Practitioners may wish to analyze how each athlete compares to his or her teammates or competitors in the same data. For this purpose, STEN() is part of psr, and it represents the Standard Ten (STEN) scores of the athletes for each metric. The STEN scores range from one to ten, with a mean of 5.5. This concept is often used in psychological and personality tests, but it can be applied to performance science as well (Glen, 2015). The function has the simplest form of any function in the package (similar to TE() and CV()), as it takes just the subject, trial, and metric vectors as inputs. It produces simple output that looks as follows for the same data that was used to demonstrate all of the previous functions:

STEN(subject, trial, metric_1, metric_2, metric_3)
##  subject   trial metric_1 metric_2 metric_3
##        1 Trial 1     5.63     4.85     5.07
##        1 Trial 2     6.24     3.40     5.74
##        1 Trial 3     5.78     4.85     5.88
##        2 Trial 1     7.85     6.79     3.17
##        2 Trial 2     7.16     8.73     2.87
##        2 Trial 3     7.70     8.25     3.51
##        3 Trial 1     3.17     5.34     7.95
##        3 Trial 2     2.79     3.40     7.77
##        3 Trial 3     3.17     3.88     7.54

This output lends insight into how well each athlete performed on each trial, compared to all of the measurements recorded in the data for the same metric. One can see that subject 1 recorded close to the mean of each metric for most of his or her measurements. In Trials 1 and 2, subject 2 recorded high measurements for all three metrics but low measurements in Trial 3, while the opposite was the case for subject 3. This function makes it easy to see how each athlete is performing relative to the other athletes in the data for each of his or her measurements, and how this performance differs across trials and metrics, respectively.

5. Error Examples

Finally, we illustrate potential pitfalls and resulting error messages produced by functions in the psr package. TE() is used to illustrate each potential error, but the functionality applies to all other functions as well.

5.1. Trials Not Labeled Uniformly Across Athletes

One potential error that could be made is the trials not being labeled in the same way across each athlete. To see an example of this, consider the following code:

subject <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
trial <- c('Trial 1', 'Trial 2', 'Trial 3', 'Trial 1', 'Trial 2', 'Trial 3', 'Trial 1', 'Trial 2', 'Trial 4')
metric_1 <- c(250, 258, 252, 279, 270, 277, 218, 213, 218)
metric_2 <- c(10, 7, 10, 14, 18, 17, 11, 7, 8)
metric_3 <- c(1214, 1276, 1289, 1037, 1010, 1069, 1481, 1465, 1443)
TE(subject, trial, metric_1, metric_2, metric_3)
## Each athlete has not recorded a measurement for each trial
##  metric_1 metric_2 metric_3
##  5.346338 2.522124 28.01835

The problem is that subject 3’s third measurement is labeled as Trial 4, rather than Trial 3, as is the case for the other two subjects. Subject 3 does not record a measurement for Trial 3, while neither Subject 1 nor Subject 2 record a measurement for Trial 4. This results in the above error message.

This message precisely explains the issue, which is that each athlete has not recorded a measurement for each trial. This message precedes the unhelpful R message that is produced when any function does not work properly, meaning that it is visible to the user directly under the code that had just been entered.

5.2. Duplicate Trials for an Athlete

Occasionally an error in processing the data may result in an athlete having multiple measurements for one trial. An example of this is shown below:

subject <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
trial <- c('Trial 1', 'Trial 2', 'Trial 3', 'Trial 1', 'Trial 2', 'Trial 3', 'Trial 1', 'Trial 2', 'Trial 2')
metric_1 <- c(250, 258, 252, 279, 270, 277, 218, 213, 218)
metric_2 <- c(10, 7, 10, 14, 18, 17, 11, 7, 8)
metric_3 <- c(1214, 1276, 1289, 1037, 1010, 1069, 1481, 1465, 1443)
TE(subject, trial, metric_1, metric_2, metric_3)
## One or more athletes have recorded multiple measurements for a single trial
##  metric_1 metric_2 metric_3
##  4.824417 2.192791 27.45496

As one can see, Subject 3 has recorded two measurements for Trial 2. One would be correct to note that this subject has also not recorded a measurement for Trial 3, hence Error 1 has also occurred in this code. If both errors occur, only the message for Error 2 is shown to the user, as it does in this example. The above message is shown and offers a clear and concise indicator of what has gone wrong in the code that the user had just entered.

5.3. Character Vector for Metric

In the following error example, the trials are all labeled correctly, but the first metric is entered as a character vector, when it needs to be a numeric vector, and the helpful error message is displayed:

# subject <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
# trial <- c('Trial 1', 'Trial 2', 'Trial 3', 'Trial 1', 'Trial 2', 'Trial 3', 'Trial 1', 'Trial 2', 'Trial 3')
# metric_1 <- c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yzz")
# metric_2 <- c(10, 7, 10, 14, 18, 17, 11, 7, 8)
# metric_3 <- c(1214, 1276, 1289, 1037, 1010, 1069, 1481, 1465, 1443)
# TE(subject, trial, metric_1, metric_2, metric_3)

When this code is run, the above helpful error message is displayed, directly below the code that the user had just entered. Yet again, the user knows exactly why the code did not run properly.

6. References

Atkinson, G., & Nevill, A. M. (1998). Statistical Methods For Assessing Measurement Error (Reliability) in Variables.

Beekhuizen, K., Davis, M. D., Kolber, M. J., Cheng, M. (2009). Test-Retest Reliability and Minimal Detectable Change of the Hexagon Agility Test. The Journal of Strength and Conditioning Research, 23(7), 2167-2171.

Bernards, J., Sato, K., Haff, G., & Bazyler, C. (2017). Current Research and Statistical Practices in Sport Science and a Need for Change. Sports, 5(4), 87.

Conway, B. (2017). Smallest Worthwhile Change. https://www.scienceforsport.com/smallest-worthwhile-change/

de Vet, H. C., Terwee, C. B., Ostelo, R.W., Beckerman, H., Knol D. L., Bouter, L. M. (2006). Minimal changes in health status questionnaires: distinction between minimally detectable change and minimally important change. Health and Quality of Life Outcomes, 4(54).

Ferger, K., & Büsch, D. (2018). Individual measurement of performance change in sports. Deutsche Zeitschrift Für Sportmedizin, 2018(02), 45-52.

Glen, S. (2015). Stephanie Glen. “STEN Score” From StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/sten-score/

Hopkins, W. G. (2000). A new view of statistics. Internet Society for Sport Science: http://www.sportsci.org/resource/stats/.

Hopkins, W. G. (2005). Competitive Performance of Elite Track-and-Field Athletes: Variability and Smallest Worthwhile Enhancements. SportScience, 9, 17-20.

Hopkins, W. G. (2000). Measures of Reliability in Sports Medicine and Science. Sports Medicine, 30(5), 375-381.

Mann J. B., Ivey, P. A., Brechue, W. F., Mayhew, J. L. (2014). Reliability and Smallest Worthwhile Difference of the NFL-225 Test in NCAA Division I Football Players. Journal of Strength and Conditioning Research, 28(5), 1427-1432.

Mann J. B., Ivey, P. A., Mayhew, J. L., Schumacher, R. M., Brechue, W. F. (2016). Relationship Between Agility Tests and Short Sprints: Reliability and Smallest Worthwhile Difference in National Collegiate Athletic Association Division-I Football Players. Journal of Strength and Conditioning Research, 30(4), 893-900.

Revelle W (2020). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R package version 2.0.12, https://CRAN.R-project.org/package=psych.

Riemann, B. L., & Lininger, M. R. (2018). Statistical Primer for Athletic Trainers: The Essentials of Understanding Measures of Reliability and Minimal Important Change. Journal of Athletic Training, 53(1), 98-103.

Paton, C. D., & Hopkins, W. G. (2005). Competitive Performance of Elite Olympic-Distance Triathletes: Reliability and Smallest Worthwhile Enhancement. SportScience, 9, 1-5.

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428.

Swinton, P. A., Hemingway, B. S., Saunders, B., Gualano, B., Dolan, E. (2018). A Statistical Framework to Interpret Individual Response to Intervention: Paving the Way for Personalized Nutrition and Exercise Prescription. Frontiers in Nutrition. 5(41), 1-14.

Wickham, H., & Henry, L. (2020). tidyr: Tidy Messy Data. R package version 1.1.0. https://CRAN.R-project.org/package=tidyr