Next: Evaluators Up: Evaluation Previous: Evaluation

Baseline model performance

Although the ISAAC model appears to handle a range of example stories, requiring various elements of understanding for comprehension to be successful, it is not yet clear to what level of competence ISAAC is capable of performing. For this, some formal empirical evaluation must be undertaken. Traditional artificial intelligence reading systems were evaluated in a fashion analogous to how human readers would be evaluated--the programs would generate summaries of the stories they read or they would answer questions created by the researcher which were designed to evaluate the level of comprehension achieved. However, something beyond this level is required to fully demonstrate my theory. Initially, I believed it would be possible to appeal to the reading education literature and ``pull out'' a set of guidelines to use in developing a motivated evaluation of the reading capabilities of ISAAC. Unfortunately, the reading education literature leaves precise evaluation issues up to the individual teacher. While specific reading comprehension tests exist, these depend on using a set of passages and questions which are provided to the instructor; only general guidelines are given for how a reading teacher should test the comprehension level of an arbitrary piece of text. Since my research uses science fiction stories with the specific goal of understanding novel concepts, these ``pre-packaged'' tests are inadequate. Therefore, the problem is that I do not possess the experience necessary to produce accurate evaluation criteria of the set of stories and the literature of the field where I would expect to find such expertise is also lacking. However, I did have access to experienced reading educators.

The technique I eventually settled on was a modification of the classic Turing Test ([#!core:turing-test!#]). First, I developed ISAAC to a point which I felt was sufficient for it to read and comprehend the stories I was using. Then, I froze the development of the system at that level. I gave the stories to a group of reading experts. They provided me with a set of questions which they felt was sufficient for testing a person's comprehension of the material. I then had a group of humans read the stories and answer the questions. At the same time, I allowed ISAAC to read the stories and answer the same questions. Then, the human evaluators were given the answered questions and asked to grade them, unaware of which were human and which were ISAAC. Examining the final scores provided evidence for how well ISAAC fares as a reader. By then analyzing the knowledge and processes it had to work with, I am able to substantiate my claims concerning the relationship of creativity, understanding, and reading. In particular, with an instantiation of the theory capable of reading and comprehending well, it is possible to determine the power of the theory by observing what processes are implemented and what knowledge is available to the model.

Next: Evaluators Up: Evaluation Previous: Evaluation

Kenneth Moorman
11/4/1997