Another approach to concordance (useful when there are only two advisors and the scale is continuous) is to calculate the differences between the observations of the two advisors. The average of these differences is called Bias and the reference interval (average ± 1.96 × standard deviation) is called the compliance limit. The limitations of the agreement provide an overview of how random variations can influence evaluations. The 15 × 15 of the Inter-Rater correlations among the 15 ratchets is presented in Table 1 and is used to retrieve the original intra-rater relics according to the following procedure: from all 15 spleens, we can create 455 triads [15!/(12! 3!) Each triad is a combination of three different advisors, each rat appearing in 91 (14-13/2). Thus, for each advisor, we can solve the system of equations (Eq. 4) 91 times, then the results on average to get a DAF estimate (and SD) of the commitments. We used R (version 3.6.1) to perform a performance analysis. Rotondi and Donner developed a confidence interval approach for estimating sample size in reliability studies and implemented the method in kappaSize, an R9 package. The functions of the package are limited to a maximum of five categories and six advisors. To estimate the minimum number of images, we called the CI5Cats feature of the kappaSize package (version 1.2).

The arguments of the function are the expected, the desired confidence interval, the relative frequency of each category, the number of raters and the type I error rate. The estimate decreases with a larger number of categories, a wider confidence interval, a wider desired confidence interval, uneven relative frequencies, more advisors and a higher I9 error rate. We placed the expected N-A at 0.4 and the desired confidence interval at 0.2-0.6. According to McHugh10, the values correspond to minimal to low reliability. We started with the same relative frequency, because we chose the images by a random sample. We set the number of advisors at two, since the reliability of intraraters is measured for two sets of ratings.