Evaluator | Story | Human Mean | Std. Dev. | ISAAC score | z-score |
1 | MEN | 78.00 | 18.60 | 96.00 | 0.97 |
1 | ZOO | 84.40 | 10.06 | 100.00 | 1.55 |
1 | LYC | 81.20 | 15.67 | 100.00 | 1.20 |
2 | MEN | 72.92 | 10.06 | 83.33 | 1.04 |
2 | ZOO | 84.37 | 10.31 | 62.50 | -2.12 |
2 | LYC | 67.22 | 15.59 | 83.33 | 1.03 |
3 | MEN | 76.00 | 23.19 | 80.00 | 0.17 |
3 | ZOO | 85.00 | 12.69 | 70.00 | -1.18 |
3 | LYC | 70.83 | 14.83 | 66.67 | -0.28 |
4 | MEN | 75.84 | 17.07 | 100.00 | 1.42 |
4 | ZOO | 74.45 | 17.41 | 66.67 | -0.45 |
4 | LYC | 67.14 | 20.26 | 100.00 | 1.62 |
Next, an ANOVA was performed on the data to determine the source of variance in the scores. There were three possibilities: the evaluators, the stories, and the agents (humans and ISAAC). The ANOVA results are shown in Table 2. The agents are the most significant factor contributing to the variance of the scores; the stories themselves contribute somewhat; finally, the evaluators were not a significant influence on the variance. This was the expected result. It can be interpreted to mean that the stories were fairly similar in difficulty, and the evaluators were similar in their judgments; the observed differences were due to the differences in the agents participating in the study.
Source | DF | F |
evaluator | 3 | 2.33 |
story | 2 | 3.48* |
agent | 10 | 2.92** |
Error | 116 | |
Total | 131 | |
3|l|* p < 0.05 | ||
3|l|** p < 0.01 |
Although the ANOVA determined that the agents provide the variance, we still need to determine how ISAAC relates to the other agents. First, consider the eleven agents. The average scores for the agents, across all evaluators and all stories are shown in Table 3.
Participant | Score |
1007 | 87.550 |
1008 | 84.838 |
1009 | 84.422 |
1011 | 84.042 |
1006 | 79.097 |
1001 | 78.683 |
1004 | 71.155 |
1002 | 70.977 |
1003 | 70.005 |
1010 | 69.475 |
1005 | 68.275 |
While there are eleven distinct scores in the table, it is likely that some are indistinguishable from others, from the standpoint of statistical significance. It would be beneficial to know what equivalence classes the eleven scores fall into; as ISAAC is subject 1011, it would then be possible to determine how well the model did in relation to the other students. In order to determine the number and composition of the equivalence classes, a Duncan range analysis was performed on the data. This test is designed to separate a data set into various equivalence classes, based on how different the various elements are. The test revealed that the data falls into two equivalence classes (at a confidence level of p < 0.05); the lower five scores belong to one and the other six belong to the other. Thus, ISAAC is indistinguishable from five other humans in the higher-ability category.