Supervised categorization of habitual versus episodic sentences
Mathew, Thomas A.
Thesis (M.S.)--Georgetown University, 2009.; Includes bibliographical references. In natural language, there are commonly used sentence constructions which express a form of genericity, a general property which summarizes groups of particular episodes or behavior - such sentences are referred to as habitual sentences. The availability of such constructions in human discourse serves a specific communication function which is to provide a mechanism to convey knowledge on common or regular behavior that defines and characterizes the environment we live in. This can be contrasted with episodic sentences which express some degree of detail surrounding irregular events in time and serve more of a reporting purpose in human communication. Given the different linguistic function of habitual sentences and episodic sentences and the different nature of information communicated by them, it can be argued that there is significance in a repeatable method, based on internal and possibly external sentence characteristics, which can make a distinction between these two sentence categories where applicable.; This research is conducted with the primary goal of category disambiguation in situations where the verbal predicate of a sentence is known to be used in both a habitual and an episodic context; sentences for which the verbal predicate provides explicit categorization are not considered to prevent undue skew on the results. A secondary objective of the research effort is to attempt to statistically study and report the individual and collective impact of lexical and syntactic features which are known to influence the categorization of a sentence as either habitual or episodic. I have focused on the influence of syntactic and lexical features such as tense, aspect, noun phrase features, temporal modifiers, specific adverb modifiers, and specific verb auxiliaries on genericity. Another secondary objective is to attempt to build and evaluate a supervised decision-tree based machine learning classifier that disambiguates between habitual and episodic sentences using selected syntactic and lexical features and a previously categorized set of sentences for the purpose of training and evaluating the classifier. Using such features I have created a machine classifier that provides 86.3% precision in disambiguating habitual and episodic sentences. This compares against a baseline of 73.1% precision where every sentence is blindly categorized as belonging to the more commonly occurring episodic category. In order to support these objectives, a representative corpus sample was hand-annotated to provide a human perspective on an appropriate category per sentence.; The final results support a claim that a machine based classifier trained on the features discussed can out-perform the baseline model. Implications for successful markup of habitual sentences include application in building knowledge-bases that describe general world behavior. Successful markup of episodic sentences can find use in more sensitive information extraction systems.
Showing items related by title, author, creator and subject.