Resources | Aemami

Common-sense Reasoning Datasets

One of our research goals is to introduce new benchmarks that test for a system's grasp of common-sense reasoning (CSR) that are more difficult, realistic, and larger than predecessors. As machines have been raising the bar of competency on a variety of NLP tasks, it is important to continue to challenge them with more varied and potentially more challenging task scenarios. Below are few of the datasets we have introduced as well as the works in the literature that have explored them.

Knowref

Knowref60K

ADEPT

Knowref

Ali Emami, Paul Trichelair, Adam Trischler, Kaheer Suleman, Hannes Schulz, Jackie Chi Kit Cheung (2018)

Earlier common-sense benchmarks exhibit size limitations, structural regularities, and variable instance difficulties. We developed a novel corpus creation process towards a corpus that partially overcomes these limitations. As such, we introduce a new benchmark for coreference resolution and NLI, KnowRef, that targets common-sense understanding and world knowledge.

Example Instance:

Wanda tries to apologize to Rose, but [she] refuses to accept. Who is [she]?

Incorrect!

Correct!

Citations in the Literature

Sakaguchi, Keisuke, et al. "Winogrande: An adversarial winograd schema challenge at scale." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 05. 2020.
Wu, Wei, et al. "CorefQA: Coreference resolution as query-based span prediction." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
Dinan, Emily, et al. "Queens Are Powerful Too: Mitigating Gender Bias in Dialogue Generation." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
Zhang, Hongming, Yan Song, and Yangqiu Song. "Incorporating Context and External Knowledge for Pronoun Coreference Resolution." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Vol. 1. 2019.
Dinan, Emily, et al. "Multi-Dimensional Gender Bias Classification." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
Zhang, Hongming, et al. "Knowledge-aware Pronoun Coreference Resolution." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
Kocijan, Vid, et al. "WikiCREM: A Large Unsupervised Corpus for Coreference Resolution." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
Klein, Tassilo, and Moin Nabi. "Contrastive Self-Supervised Learning for Commonsense Reasoning." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
Wu, Wei, et al. "Coreference resolution as query-based span prediction." arXiv preprint arXiv:1911.01746 (2019).
Stylianou, Nikolaos, and Ioannis Vlahavas. "A neural entity coreference resolution review." Expert Systems with Applications 168 (2021): 114466.
Agarwal, Oshin, et al. "Entity-switched datasets: An approach to auditing the in-domain robustness of named entity recognition models." arXiv preprint arXiv:2004.04123 (2020).
Emami, Ali, et al. "An Analysis of Dataset Overlap on Winograd-Style Tasks." Proceedings of the 28th International Conference on Computational Linguistics. 2020.
Varkel, Yuval, and Amir Globerson. "Pre-training of Mention Representations in Coreference Models." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
Lata, Kusum, Pardeep Singh, and Kamlesh Dutta. "A comprehensive review on feature set used for anaphora resolution." Artificial Intelligence Review (2020): 1-90.
Kocijan, Vid, Oana-Maria Camburu, and Thomas Lukasiewicz. "The Gap on Gap: Tackling the Problem of Differing Data Distributions in Bias-Measuring Datasets." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 14. 2021.
Zhang, Hongming, Xinran Zhao, and Yangqiu Song. "A brief survey and comparative study of recent development of pronoun coreference resolution." arXiv preprint arXiv:2009.12721 (2020).
Isaak, Nicos, and Loizos Michael. "Experience and prediction: a metric of hardness for a novel litmus test." Journal of Logic and Computation (2021).
Atkinson, John, and Alex Escudero. "Evolutionary Discovery of Natural-Language Coreference Chains for Social Media Analysis." (2021).
Mosharraf, Turash. An approach to the Winograd Schema Challenge based on semantic classification of events and adjectives. Diss. Applied Sciences: School of Computing Science, 2019.
Spooner, Jordan, et al. "Using Answer Set Grammars to Learn Explanations for Winograd Schemas." (2020).
Shen, Ming, Pratyay Banerjee, and Chitta Baral. "Unsupervised Pronoun Resolution via Masked Noun-Phrase Prediction." arXiv preprint arXiv:2105.12392 (2021).

Knowref

Knowref60k

Ali Emami, Adam Trischler, Kaheer Suleman, Jackie Chi Kit Cheung (2020)

Our recent research revealed that a large number of test instances in various common-sense reasoning tasks overlap considerably with the corpora on which state-of-the-art models are trained, and that a significant drop in classification accuracy occurs when we evaluate models on instances with minimal overlap. Based on these results, we develop the KnowRef-60K dataset, which consists of over 60k pronoun disambiguation problems scraped from web data. KnowRef-60K is the largest corpus to date for WSC-style common-sense reasoning and exhibits a significantly lower proportion of overlaps with current pretraining corpora.

Example Instance:

Steven certainly manipulates Gregory, but [he] also has the best interest of the world at heart. Who is [he]?

Correct!

Incorrect!

Citations in the Literature

Rogers, Anna. "Changing the World by Changing the Data." arXiv preprint arXiv:2105.13947 (2021).
Khashabi, Daniel, et al. "GooAQ: Open Question Answering with Diverse Answer Types." arXiv preprint arXiv:2104.08727 (2021).
Elazar, Yanai, et al. "Back to Square One: Bias Detection, Training and Commonsense Disentanglement in the Winograd Schema." arXiv preprint arXiv:2104.08161 (2021).

ADEPT

Ali Emami, Ian Porada, Alexandra Olteanu, Kaheer Suleman, Adam Trischler and Jackie Chi Kit Cheung (2021)

The previous two benchmarks correspond exclusively to the NLP task of coreference resolution. Another task that can be show to require a significant degree of common-sense and background knowledge is correctly interpreting the semantic plausibility of events, which can be the basis behind a new series of difficult CSR benchmarks. As such, we introduce Adjective Dependant Plausibility Task (ADEPT)– a large-scale semantic plausibility task consisting of over 16 thousand sentences that are paired with slightly modified versions obtained by adding an adjective to a noun.

Example Instance:

Compared with the original statement a story is for reading, please assess the plausibility of the following modified versions:

A solid story is for reading.

A written story is for reading.

A mistaken story is for reading.

Impossible Less Likely Equally Likely More Likely Necessarily True

Citations in the Literature

Lyu, Q., Hua, Z., Li, D., Zhang, L., Apidianaki, M., & Callison-Burch, C. (2022, July). Is “My Favorite New Movie” My Favorite Movie? Probing the Understanding of Recursive Noun Phrases. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 5286-5302).
Eichel, Annerose, Helena Schlipf, and Sabine Schulte im Walde. "Made of Steel? Learning Plausible Materials for Components in the Vehicle Repair Domain." Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023.

Knowref60k

ADEPT