Referenceless Evaluation of Natural Language Generation from Meaning Representations
Automatic evaluation of NLG usually involves comparison to human-authored references which are treated as the ground truth. However, these references often fail to adequately capture the range of valid output for an NLG task, and many validation studies have shown that reference-based metrics do not reliably correlate well with human judgments. Focusing on the generation of English text from Abstract Meaning Representation (AMR), I explore referenceless approaches to evaluation, including both human and automatic methods.First, I conduct a new human evaluation study comparing five different AMR-to-English generation systems. Human annotators give numerical judgments for fluency and adequacy, as well as broad error type annotations. I discuss the relative quality of these systems and how these results compare to those of automatic metrics, finding that while the metrics are mostly successful in ranking systems overall, collecting human judgments allows for more nuanced comparisons. I also perform a qualitative analysis of common errors made by these systems.Next, I explore the possibility of automatically evaluating AMR-to-English generation by comparing a parse of the generated sentence to the input. I find that the quality of AMR parsers substantially impacts the performance of this approach, and that even with a state-of-the-art parser, the resulting metric underperforms popular reference-based metrics. However, when automatic parses are manually edited for accuracy, this evaluation approach improves greatly, outperforming most fully-automatic metrics and approaching the quality of a state-of-the-art learned metric. These results indicate that fully-automatic parser-based metrics are likely to prove more reliable in the future as the state of the art in AMR parsing continues to improve.
MetadataShow full item record
Showing items related by title, author, creator and subject.
Referenceless Evaluation of Natural Language Generation from Meaning Representations Manning, Emma (Georgetown University, 2021)Automatic evaluation of NLG usually involves comparison to human-authored references which are treated as the ground truth. However, these references often fail to adequately capture the range of valid output for an NLG ...