The structuring of discourse into segments larger than utterances both explains and is explained by a wide variety of linguistic phenomena. For example, cue phrases are words and phrases such as ``now'' and ``that reminds me'' that may be used to explicitly convey discourse segment structure. However, while the need to model the relation between discourse segments and linguistic features of utterances is almost universally acknowledged in the literature, there is weak consensus on the nature of segments and the criteria for recognizing or generating them from linguistic features.
This research concerns the development of methods for obtaining examples of classified discourse phenomena from subjects, and the use of machine learning and other methods to investigate the relationship between discourse segments and linguistic phenomena. For example, in the area of discourse segmentation, we found that when naive subjects performed discourse segmentation using speaker intention as the segmentation criterion, highly significant statistical results were obtained. We then used the data from the subjects' segmentations as a target for evaluating two sets of segmentation algorithms, using information retrieval metrics. One set of algorithms was based on inductive machine learning. Machine learning takes as input a set of preclassified examples coded with respect to a set of user-defined features, and outputs an algorithm that predicts the class of any input given its set of feature values. To date I have used machine learning to answer two particular questions in discourse analysis: 1) when does a given usage of a cue word signal discourse structure, and 2) when does a segment boundary occur between two contiguous utterances? My results have been that both quantitative and qualitative evaluations of a wide variety of experiments suggests that machine learning is an effective technique for automating the generation of algorithms for discourse analysis, as well as for providing further insight into the data. Furthermore, the performance of the learned algorithms is often superior to the performance of algorithms that have been developed manually, and approaches the performance of human subjects.