Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets


Abstract in English

Cross-document coreference resolution (CDCR) datasets, such as ECB+, contain manually annotated event-centric mentions of events and entities that form coreference chains with identity relations. ECB+ is a state-of-the-art CDCR dataset that focuses on the resolution of events and their descriptive attributes, i.e., actors, location, and date-time. NewsWCL50 is a dataset that annotates coreference chains of both events and entities with a strong variance of word choice and more loosely-related coreference anaphora, e.g., bridging or near-identity relations. In this paper, we qualitatively and quantitatively compare annotation schemes of ECB+ and NewsWCL50 with multiple criteria. We propose a phrasing diversity metric (PD) that compares lexical diversity within coreference chains on a more detailed level than previously proposed metric, e.g., a number of unique lemmas. We discuss the different tasks that both CDCR datasets create, i.e., lexical disambiguation and lexical diversity challenges, and propose a direction for further CDCR evaluation.

Download