ConfliBERT下載 - ConfliBERT原始碼下載

ConfliBERT

Ai源碼

1.0.0

下載

ConfliBERT ：政治衝突與暴力的預訓練語言模式 (NAACL 2022)

此儲存庫包含論文ConfliBERT ：政治衝突和暴力的預訓練語言模型 (NAACL 2022) 的基本程式碼。

先決條件

程式碼是在Linux系統下使用Python 3.6編寫的。 cuda版本是10.2。必要的包裝包括：

 torch==1.7.1 
transformers==4.17.0 
numpy==1.19.2 
scikit-learn==0.24.2
pandas==1.5.3
simpletransformers

ConfliBERT檢查點

我們提供了ConfliBERT的四個版本：

ConfliBERT -scr-uncased：使用我們自己的 uncased 詞彙從頭開始預先訓練（首選）
ConfliBERT -scr-cased：使用我們自己的 cased 詞彙從頭開始預先訓練
ConfliBERT -cont-uncased：使用原始 BERT 的 uncased 詞彙進行持續預訓練
ConfliBERT -cont-cased：使用原始 BERT 的 cased 詞彙進行持續預訓練

您可以透過 Huggingface API 直接匯入以上四種型號：

ConfliBERT-scr-uncased", use_auth_token=True) model = AutoModelForMaskedLM.from_pretrained("snowood1/ ConfliBERT -scr-uncased", use_auth_token=True)">

 from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("snowood1/ ConfliBERT -scr-uncased", use_auth_token=True)
model = AutoModelForMaskedLM.from_pretrained("snowood1/ ConfliBERT -scr-uncased", use_auth_token=True)

評估

ConfliBERT的用法與 Huggingface 中其他 BERT 模型相同。

我們提供了多個使用簡單轉換器的範例。你可以運行：

 CUDA_VISIBLE_DEVICES=0 python finetune_data.py --dataset IndiaPoliceEvents_sents --report_per_epoch

點擊 Colab demo 查看評估範例：

評估數據集

以下是公開資料集的摘要：

數據集	連結
20個新聞組	https://www.kaggle.com/crawford/20-newsgroups
英國廣播公司新聞	https://www.kaggle.com/c/learn-ai-bbc/overview
事件狀態語料庫	https://catalog.ldc.upenn.edu/LDC2017T09
全球競爭	https://github.com/emerging-welfare/glocongold/tree/master/sample
全球恐怖主義資料庫	https://www.start.umd.edu/gtd/
槍枝暴力資料庫	http://gun-violence.org/download/
印度警察事件	https://github.com/slanglab/IndiaPoliceEvents
洞察犯罪	https://figshare.com/s/73f02ab8423bb83048aa
MUC-4	https://github.com/xinyadu/grit_doc_event_entity/tree/master/data/muc
重新3D	https://github.com/juand-r/entity-recognition-datasets/tree/master/data/re3d
六磷酸腺苷	https://github.com/javierosorio/SATP
客串	https://dl.acm.org/doi/abs/10.1145/3514094.3534178

要使用您自己的資料集，第一步是將資料集預處理為./data中所需的格式。例如，

IndiaPoliceEvents_sents 用於分類任務。格式為句子+標籤，以製表符分隔。
re3d 用於 CONLL 格式的 NER 任務

第二步是使用 ["binary", "multiclass", "multilabel", "ner"] 中的正確任務在 ./configs 中建立對應的設定檔。

預訓練語料庫

我們收集了政治和衝突領域的大型語料庫（33 GB）用於預訓練ConfliBERT 。資料夾 ./pretrain-corpora/Crawlers and Processes 包含用於生成本研究中使用的語料庫的範例腳本。由於版權問題，我們在./pretrain-corpora/Samples中提供了一些範例。這些樣本遵循「每行一句格式」的格式。有關預訓練語料庫的更多詳細信息，請參閱我們論文的第 2 部分和附錄。

預訓練腳本

我們遵循 Huggingface 中的相同預訓練腳本 run_mlm.py （原始連結）。以下是使用 8 個 GPU 的範例。我們在附錄中提供了我們的參數。不過，您應該根據自己的設備更改參數：

	export NGPU=8; nohup python -m torch.distributed.launch --master_port 12345 
	--nproc_per_node=$NGPU run_mlm.py 
	--model_type bert 
	--config_name ./bert_base_cased 
	--tokenizer_name ./bert_base_cased 
	--output_dir ./bert_base_cased 
	--cache_dir ./cache_cased_128 
	--use_fast_tokenizer 
	--overwrite_output_dir 
	--train_file YOUR_TRAIN_FILE 
	--validation_file YOUR_VALID_FILE 
	--max_seq_length 128 
	--preprocessing_num_workers 4 
	--dataloader_num_workers 2 
	--do_train --do_eval 
	--learning_rate 5e-4 
	--warmup_steps=10000 
	--save_steps 1000 
	--evaluation_strategy steps 
	--eval_steps 10000 
	--prediction_loss_only  
	--save_total_limit 3 
	--per_device_train_batch_size 64 --per_device_eval_batch_size 64 
	--gradient_accumulation_steps 4 
	--logging_steps=100 
	--max_steps 100000 
	--adam_beta1 0.9 --adam_beta2 0.98 --adam_epsilon 1e-6 
	--fp16 True --weight_decay=0.01

引文

如果您發現此儲存庫對您的研究有用，請考慮引用：

ConfliBERT, title={ ConfliBERT : A Pre-trained Language Model for Political Conflict and Violence}, author={Hu, Yibo and Hosseini, MohammadSaleh and Parolin, Erick Skorupa and Osorio, Javier and Khan, Latifur and Brandt, Patrick and D’Orazio, Vito}, booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, pages={5469--5482}, year={2022} }">

 @inproceedings{hu2022 ConfliBERT ,
  title={ ConfliBERT : A Pre-trained Language Model for Political Conflict and Violence},
  author={Hu, Yibo and Hosseini, MohammadSaleh and Parolin, Erick Skorupa and Osorio, Javier and Khan, Latifur and Brandt, Patrick and D’Orazio, Vito},
  booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  pages={5469--5482},
  year={2022}
}

展開

附加信息

版本 1.0.0
類型 Ai源碼
更新時間 2024-12-30
大小 50MB
來自於 Github

相關應用

node telegram bot api

2024-12-14
typebot.io

2024-12-14
python wechaty getting started

2024-12-14
TranscriberBot

2024-12-14
genal chat

2024-12-14
Facemoji

2024-12-14

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
node telegram bot api

Ai源碼

v0.50.0
typebot.io

Ai源碼

v3.1.2
python wechaty getting started

Ai源碼

1.0.0
waymo open dataset

其他源碼

December 2023 Update
termwind

其他類別

v2.3.0
wp functions

其他類別

1.0.0

相關資訊全部