By Courtney Napoles, Aasish Pappu, Joel Tetreault
Comment threads following online news articles often range from vacuous to hateful. That said, good conversations do occur online, with people expressing different viewpoints and attempting to inform, convince, or better understand the other side, even if they can get lost among the multitude of unconstructive comments. At Yahoo Research, we show in recent statistical experiments that automatically identifying and ranking good conversations on top will cultivate a more civil and constructive atmosphere in online communities and potentially encourage participation from more users [1].
In an effort to foster more respectful online discussions and encourage more research among academics surrounding comments, we present the Yahoo News Annotated Comments Corpus (YNACC) via our data sharing program, Webscope. The corpus contains 522K comments from 140K comment threads posted in response to online news articles, and contains manual annotations for a subset of 2.4K comment threads and 9.2K comments. The annotations include 6 attributes of individual comments: sentiment, tone, agreement with other commenters, topic of the comment, intended audience, and persuasiveness. The annotations also include 3 attributes of threads: constructiveness, agreeability within the conversation, and type of conversation, i.e., flamewars vs positive/respectful [2].
Annotated conversations in the YNACC corpus were used to create a predictive algorithm and train statistical models to automatically detect “good” conversations. We call these good conversations ERICs: Engaging, Respectful, and/or Informative Conversations, and they are characterized by:
- A respectful exchange of ideas, opinions, and/or information in response to a given topic or topics.
- Opinions expressed as an attempt to elicit a dialogue or persuade.
- Comments that seek to contribute some new information or perspective on the relevant topic.
ERICs have no single identifying attribute. A good conversation is determined by how many respectful, engaging, and persuading comments are present. For instance, an exchange where communicants are in total agreement throughout can be an ERIC, as can an exchange with a heated disagreement. Our algorithm ranks either of these types of exchanges higher than those that lack ERICs. Many of the labels for the ERICs in our dataset are the result of a new coding scheme (annotation taxonomy) we developed and are for characteristics of online conversations not captured by traditional argumentation or dialogue features. Some of the labels we collected have been annotated in previous work [3,4], and this is the first time they are aggregated in a single corpus at the dialogue level.
Additionally, we collected annotations on 1K threads from the Internet Argument Corpus, representing another domain of online debates. Our corpus and annotation scheme is the first exploration of how characteristics of individual comments contribute to the dialogue-level classification of an exchange. We hope YNACC will facilitate research to understand ERICs and other aspects of dialogue in general.
The technical contributions of this dataset are described in two scientific papers:
[1] Courtney Napoles, Aasish Pappu, and Joel Tetreault. “Automatically Identifying Good Conversations Online (Yes, they do exist!)”. In Proceedings of ICWSM'17.
[2] Courtney Napoles, Joel Tetreault, Aasish Pappu, Enrica Rosato and Brian Provenzale. 2017. “Finding Good Conversations Online: The Yahoo News Annotated Comments Corpus”. In Proceedings of The 11th Linguistic Annotation Workshop (LAW-XI).
References:
[3] Rob Abbott, Brian Ecker, Pranav Anand, and Marilyn Walker. Internet Argument Corpus 2.0: An SQL schema for dialogic social media and the corpora to go with it. LREC 2016.
[4] Marilyn Walker, Jean Fox Tree, Pranav Anand, Rob Abbott, and Joseph King. A corpus for research on deliberation and debate. LREC 2012.