ConfliBERT下载 - ConfliBERT源代码下载

ConfliBERT

Ai源码

1.0.0

下载

ConfliBERT ：政治冲突和暴力的预训练语言模型 (NAACL 2022)

该存储库包含论文ConfliBERT ：政治冲突和暴力的预训练语言模型 (NAACL 2022) 的基本代码。

先决条件

代码是在Linux系统下使用Python 3.6编写的。 cuda版本是10.2。必要的包包括：

 torch==1.7.1 
transformers==4.17.0 
numpy==1.19.2 
scikit-learn==0.24.2
pandas==1.5.3
simpletransformers

ConfliBERT检查点

我们提供了ConfliBERT的四个版本：

ConfliBERT -scr-uncased：使用我们自己的 uncased 词汇从头开始预训练（首选）
ConfliBERT -scr-cased：使用我们自己的 cased 词汇从头开始预训练
ConfliBERT -cont-uncased：使用原始 BERT 的 uncased 词汇进行持续预训练
ConfliBERT -cont-cased：使用原始 BERT 的 cased 词汇进行持续预训练

您可以通过 Huggingface API 直接导入以上四种模型：

ConfliBERT-scr-uncased", use_auth_token=True) model = AutoModelForMaskedLM.from_pretrained("snowood1/ ConfliBERT -scr-uncased", use_auth_token=True)">

 from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("snowood1/ ConfliBERT -scr-uncased", use_auth_token=True)
model = AutoModelForMaskedLM.from_pretrained("snowood1/ ConfliBERT -scr-uncased", use_auth_token=True)

评估

ConfliBERT的用法与 Huggingface 中其他 BERT 模型相同。

我们提供了多个使用简单转换器的示例。你可以运行：

 CUDA_VISIBLE_DEVICES=0 python finetune_data.py --dataset IndiaPoliceEvents_sents --report_per_epoch

点击 Colab demo 查看评估示例：

评估数据集

以下是公开数据集的摘要：

数据集	链接
20个新闻组	https://www.kaggle.com/crawford/20-newsgroups
英国广播公司新闻	https://www.kaggle.com/c/learn-ai-bbc/overview
事件状态语料库	https://catalog.ldc.upenn.edu/LDC2017T09
全球竞争	https://github.com/emerging-welfare/glocongold/tree/master/sample
全球恐怖主义数据库	https://www.start.umd.edu/gtd/
枪支暴力数据库	http://gun-violence.org/download/
印度警察事件	https://github.com/slanglab/IndiaPoliceEvents
洞察犯罪	https://figshare.com/s/73f02ab8423bb83048aa
MUC-4	https://github.com/xinyadu/grit_doc_event_entity/tree/master/data/muc
重新3D	https://github.com/juand-r/entity-recognition-datasets/tree/master/data/re3d
六磷酸腺苷	https://github.com/javierosorio/SATP
客串	https://dl.acm.org/doi/abs/10.1145/3514094.3534178

要使用您自己的数据集，第一步是将数据集预处理为./data中所需的格式。例如，

IndiaPoliceEvents_sents 用于分类任务。格式为句子+标签，用制表符分隔。
re3d 用于 CONLL 格式的 NER 任务

第二步是使用 ["binary", "multiclass", "multilabel", "ner"] 中的正确任务在 ./configs 中创建相应的配置文件。

预训练语料库

我们收集了政治和冲突领域的大型语料库（33 GB）用于预训练ConfliBERT 。文件夹 ./pretrain-corpora/Crawlers and Processes 包含用于生成本研究中使用的语料库的示例脚本。由于版权问题，我们在./pretrain-corpora/Samples中提供了一些示例。这些样本遵循“每行一句格式”的格式。有关预训练语料库的更多详细信息，请参阅我们论文的第 2 部分和附录。

预训练脚本

我们遵循 Huggingface 中的相同预训练脚本 run_mlm.py （原始链接）。下面是使用 8 个 GPU 的示例。我们在附录中提供了我们的参数。不过，您应该根据自己的设备更改参数：

	export NGPU=8; nohup python -m torch.distributed.launch --master_port 12345 
	--nproc_per_node=$NGPU run_mlm.py 
	--model_type bert 
	--config_name ./bert_base_cased 
	--tokenizer_name ./bert_base_cased 
	--output_dir ./bert_base_cased 
	--cache_dir ./cache_cased_128 
	--use_fast_tokenizer 
	--overwrite_output_dir 
	--train_file YOUR_TRAIN_FILE 
	--validation_file YOUR_VALID_FILE 
	--max_seq_length 128 
	--preprocessing_num_workers 4 
	--dataloader_num_workers 2 
	--do_train --do_eval 
	--learning_rate 5e-4 
	--warmup_steps=10000 
	--save_steps 1000 
	--evaluation_strategy steps 
	--eval_steps 10000 
	--prediction_loss_only  
	--save_total_limit 3 
	--per_device_train_batch_size 64 --per_device_eval_batch_size 64 
	--gradient_accumulation_steps 4 
	--logging_steps=100 
	--max_steps 100000 
	--adam_beta1 0.9 --adam_beta2 0.98 --adam_epsilon 1e-6 
	--fp16 True --weight_decay=0.01

引文

如果您发现此存储库对您的研究有用，请考虑引用：

ConfliBERT, title={ ConfliBERT : A Pre-trained Language Model for Political Conflict and Violence}, author={Hu, Yibo and Hosseini, MohammadSaleh and Parolin, Erick Skorupa and Osorio, Javier and Khan, Latifur and Brandt, Patrick and D’Orazio, Vito}, booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, pages={5469--5482}, year={2022} }">

 @inproceedings{hu2022 ConfliBERT ,
  title={ ConfliBERT : A Pre-trained Language Model for Political Conflict and Violence},
  author={Hu, Yibo and Hosseini, MohammadSaleh and Parolin, Erick Skorupa and Osorio, Javier and Khan, Latifur and Brandt, Patrick and D’Orazio, Vito},
  booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  pages={5469--5482},
  year={2022}
}

展开

附加信息