next up previous index
Next: Discussion Up: Baseline model performance Previous: Design and Procedure

Results

 The evaluation results are summarized in Figure 36 and Table 1. The boxplot graph (Figure 36) provides an overview of the human data--it allows the range of human performance to be seen, while also concisely presenting the outliers. The range of human performance is shown for each story; the boxes represent the scores which fell between the first and third quartiles in the range; the asterisks represent outliers in the data. Evaluators 1 and 4 are the English teachers, evaluator 2 is the computer science professor, and evaluator 3 is the research scientist. The table (Table 1) show how the ISAAC system compares in comprehension scores to the human data. For each story, the human average and standard deviation is presented. ISAAC's performance (a single score) is then presented, along with a z-score describing how far from the average   ISAAC's performance fell, in terms of standard deviations. As the table shows, ISAAC scored better than the average human in eight of the twelve cases.

 

  
Figure 36: Human performance for each story per evaluator. Stories are Men Are Different (M), Zoo (Z), and Lycanthrope (L). Score represents the percentage correct on the reading evaluation tests devised by the four evaluators. The horizontal lines in the boxes represent the median score; the asterisks represent outliers in the data.
\begin{figure}
\centerline{\ 
\psfig {figure=boxplot.eps,height=4.5in}
}\end{figure}

 

 
Table 1: ISAAC evaluation results
Evaluator Story Human Mean Std. Dev. ISAAC score z-score
1 MEN 78.00 18.60 96.00 0.97
1 ZOO 84.40 10.06 100.00 1.55
1 LYC 81.20 15.67 100.00 1.20
2 MEN 72.92 10.06 83.33 1.04
2 ZOO 84.37 10.31 62.50 -2.12
2 LYC 67.22 15.59 83.33 1.03
3 MEN 76.00 23.19 80.00 0.17
3 ZOO 85.00 12.69 70.00 -1.18
3 LYC 70.83 14.83 66.67 -0.28
4 MEN 75.84 17.07 100.00 1.42
4 ZOO 74.45 17.41 66.67 -0.45
4 LYC 67.14 20.26 100.00 1.62

Next, an ANOVA was performed on the data to determine the source of variance in the scores. There were three possibilities: the evaluators, the stories, and the agents (humans and ISAAC). The ANOVA results are shown in Table 2. The agents are the most significant factor contributing to the variance of the scores; the stories themselves contribute somewhat; finally, the evaluators were not a significant influence on the variance. This was the expected result. It can be interpreted to mean that the stories were fairly similar in difficulty, and the evaluators were similar in their judgments; the observed differences were due to the differences in the agents participating in the study.

 

 
Table 2: Results of the ANOVA on performance data
Source DF F
evaluator 3 2.33
story 2 3.48*
agent 10 2.92**
Error 116  
Total 131  
3|l|* p < 0.05    
3|l|** p < 0.01    

Although the ANOVA determined that the agents provide the variance, we still need to determine how   ISAAC relates to the other agents. First, consider the eleven agents. The average scores for the agents, across all evaluators and all stories are shown in Table 3.


 
Table 3: Scores for the eleven participants
Participant Score
1007 87.550
1008 84.838
1009 84.422
1011 84.042
1006 79.097
1001 78.683
1004 71.155
1002 70.977
1003 70.005
1010 69.475
1005 68.275

  While there are eleven distinct scores in the table, it is likely that some are indistinguishable from others, from the standpoint of statistical significance. It would be beneficial to know what equivalence classes the eleven scores fall into; as ISAAC is subject 1011, it would then be possible to determine how well the model did in relation to the other students. In order to determine the number and composition of the equivalence classes, a Duncan range analysis was performed on the data. This test is designed to separate a data set into various equivalence classes, based on how different the various elements are. The test revealed that the data falls into two equivalence classes (at a confidence level of p < 0.05); the lower five scores belong to one and the other six belong to the other. Thus, ISAAC is indistinguishable from five other humans in the higher-ability category.


next up previous index
Next: Discussion Up: Baseline model performance Previous: Design and Procedure
Kenneth Moorman
11/4/1997