COLDataset
1.0.0
本文的官方存储库:冷:中国进攻语言检测的基准
中文冒犯语言检测数据集
纸链接:https://arxiv.org/abs/2201.06025
检测器:我们在Huggingface中发布了Roberta-Base-Cold的版本。
我们的论文已被EMNLP 2022接受!
ColdataSet包含37,480条评论,带有二元进攻标签,涵盖了种族,性别和地区的各种主题。为了获得对数据类型和特征的进一步见解,我们将测试集注释为具有四个类别的细粒度:攻击个人,攻击组,反偏见和其他非犯罪性。
train.csv和dev.csv中的标签:
test.csv中的细粒标签:
如果本文和数据集很有帮助,请请我们的论文。
@article{deng2022cold,
title="Cold: A benchmark for chinese offensive language detection",
author= "Deng, Jiawen and Zhou, Jingyan and Sun, Hao and Mi, Fei and Huang, Minlie",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.796",
pages = "11580--11599"
}