ดาวน์โหลด ConfliBERT - ดาวน์โหลดซอร์สโค้ด ConfliBERT

ConfliBERT

โค้ดแหล่งที่มา AI

1.0.0

ดาวน์โหลด

ConfliBERT : โมเดลภาษาที่ได้รับการฝึกอบรมล่วงหน้าสำหรับความขัดแย้งทางการเมืองและความรุนแรง (NAACL 2022)

พื้นที่เก็บข้อมูลนี้มีรหัสที่จำเป็นสำหรับรายงาน ConfliBERT : A Pre-trained Language Model for Political Conflict and Violence (NAACL 2022)

ข้อกำหนดเบื้องต้น

รหัสนี้เขียนโดย Python 3.6 ในระบบ Linux เวอร์ชัน cuda คือ 10.2. แพ็คเกจที่จำเป็นได้แก่:

 torch==1.7.1 
transformers==4.17.0 
numpy==1.19.2 
scikit-learn==0.24.2
pandas==1.5.3
simpletransformers

จุดตรวจ ConfliBERT

เรามี ConfliBERT สี่เวอร์ชัน:

ConfliBERT -scr-uncased: ฝึกฝนตั้งแต่ต้นด้วยคำศัพท์ที่ไม่มีกรณีของเราเอง (แนะนำ)
ConfliBERT -scr-cased: ฝึกฝนตั้งแต่ต้นด้วยคำศัพท์แบบ cased ของเราเอง
ConfliBERT -cont-uncased: การฝึกอบรมล่วงหน้าอย่างต่อเนื่องด้วยคำศัพท์ที่ไม่มีกรณีของ BERT ดั้งเดิม
ConfliBERT -cont-cased: การฝึกอบรมล่วงหน้าอย่างต่อเนื่องด้วยคำศัพท์แบบ cased ดั้งเดิมของ BERT

คุณสามารถนำเข้าสี่โมเดลข้างต้นได้โดยตรงผ่าน Huggingface API:

ConfliBERT-scr-uncased", use_auth_token=True) model = AutoModelForMaskedLM.from_pretrained("snowood1/ ConfliBERT -scr-uncased", use_auth_token=True)">

 from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("snowood1/ ConfliBERT -scr-uncased", use_auth_token=True)
model = AutoModelForMaskedLM.from_pretrained("snowood1/ ConfliBERT -scr-uncased", use_auth_token=True)

การประเมิน

การใช้งาน ConfliBERT จะเหมือนกับรุ่น BERT อื่นๆ ใน Huggingface

เราได้จัดเตรียมตัวอย่างไว้หลายตัวอย่างโดยใช้ Simple Transformers คุณสามารถเรียกใช้:

 CUDA_VISIBLE_DEVICES=0 python finetune_data.py --dataset IndiaPoliceEvents_sents --report_per_epoch

คลิกการสาธิต Colab เพื่อดูตัวอย่างการประเมิน:

ชุดข้อมูลการประเมินผล

ด้านล่างนี้คือบทสรุปของชุดข้อมูลที่เปิดเผยต่อสาธารณะ:

ชุดข้อมูล	ลิงค์
20กลุ่มข่าว	https://www.kaggle.com/crawford/20-newsgroups
บีบีซีนิวส์	https://www.kaggle.com/c/learn-ai-bbc/overview
เหตุการณ์สถานะคอร์ปัส	https://catalog.ldc.upenn.edu/LDC2017T09
GlobalContention	https://github.com/emerging-welfare/glocongold/tree/master/sample
ฐานข้อมูลการก่อการร้ายทั่วโลก	https://www.start.umd.edu/gtd/
ฐานข้อมูลความรุนแรงของปืน	http://gun-violence.org/download/
เหตุการณ์ตำรวจอินเดีย	https://github.com/slanglab/IndiaPoliceEvents
InsightCrime	https://figshare.com/s/73f02ab8423bb83048aa
MUC-4	https://github.com/xinyadu/grit_doc_event_entity/tree/master/data/muc
re3d	https://github.com/juand-r/entity-recognition-datasets/tree/master/data/re3d
สทพ	https://github.com/javierosorio/SATP
คามีโอ	https://dl.acm.org/doi/abs/10.1145/3514094.3534178

หากต้องการใช้ชุดข้อมูลของคุณเอง ขั้นตอนที่ 1 คือการประมวลผลชุดข้อมูลล่วงหน้าให้อยู่ในรูปแบบที่ต้องการใน ./data ตัวอย่างเช่น,

IndiaPoliceEvents_sents สำหรับงานการจัดประเภท รูปแบบเป็นประโยค + ป้ายกำกับคั่นด้วยแท็บ
re3d สำหรับงาน NER ในรูปแบบ CONLL

ขั้นตอนที่ 2 คือการสร้างไฟล์ปรับแต่งที่เกี่ยวข้องใน ./configs ด้วยงานที่ถูกต้องจาก ["binary", "multiclass", "multilabel", "ner"]

การฝึกคอร์ปัสล่วงหน้า

เราได้รวบรวมคลังข้อมูลขนาดใหญ่ในด้านการเมืองและความขัดแย้ง (33 GB) เพื่อฝึกอบรม ConfliBERT ล่วงหน้า โฟลเดอร์ ./pretrain-corpora/Crawlers and Processes มีสคริปต์ตัวอย่างที่ใช้ในการสร้างคลังข้อมูลที่ใช้ในการศึกษานี้ เนื่องจากลิขสิทธิ์ เราจึงจัดเตรียมตัวอย่างบางส่วนไว้ใน ./pretrain-corpora/Samples ตัวอย่างเหล่านี้เป็นไปตามรูปแบบ "หนึ่งประโยคต่อรูปแบบบรรทัด" ดูรายละเอียดเพิ่มเติมของการฝึกเตรียมร่างกายในรายงานของเราในส่วนที่ 2 และภาคผนวก

สคริปต์การฝึกอบรมล่วงหน้า

เราติดตามสคริปต์การฝึกอบรมเบื้องต้น run_mlm.py จาก Huggingface (ลิงก์ต้นฉบับ) ด้านล่างนี้เป็นตัวอย่างการใช้ 8 GPU เราได้ระบุพารามิเตอร์ของเราไว้ในภาคผนวก อย่างไรก็ตาม คุณควรเปลี่ยนพารามิเตอร์ตามอุปกรณ์ของคุณเอง:

	export NGPU=8; nohup python -m torch.distributed.launch --master_port 12345 
	--nproc_per_node=$NGPU run_mlm.py 
	--model_type bert 
	--config_name ./bert_base_cased 
	--tokenizer_name ./bert_base_cased 
	--output_dir ./bert_base_cased 
	--cache_dir ./cache_cased_128 
	--use_fast_tokenizer 
	--overwrite_output_dir 
	--train_file YOUR_TRAIN_FILE 
	--validation_file YOUR_VALID_FILE 
	--max_seq_length 128 
	--preprocessing_num_workers 4 
	--dataloader_num_workers 2 
	--do_train --do_eval 
	--learning_rate 5e-4 
	--warmup_steps=10000 
	--save_steps 1000 
	--evaluation_strategy steps 
	--eval_steps 10000 
	--prediction_loss_only  
	--save_total_limit 3 
	--per_device_train_batch_size 64 --per_device_eval_batch_size 64 
	--gradient_accumulation_steps 4 
	--logging_steps=100 
	--max_steps 100000 
	--adam_beta1 0.9 --adam_beta2 0.98 --adam_epsilon 1e-6 
	--fp16 True --weight_decay=0.01

การอ้างอิง

หากคุณพบว่าการซื้อคืนนี้มีประโยชน์ในการวิจัยของคุณ โปรดพิจารณาการอ้างอิง:

ConfliBERT, title={ ConfliBERT : A Pre-trained Language Model for Political Conflict and Violence}, author={Hu, Yibo and Hosseini, MohammadSaleh and Parolin, Erick Skorupa and Osorio, Javier and Khan, Latifur and Brandt, Patrick and D’Orazio, Vito}, booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, pages={5469--5482}, year={2022} }">

 @inproceedings{hu2022 ConfliBERT ,
  title={ ConfliBERT : A Pre-trained Language Model for Political Conflict and Violence},
  author={Hu, Yibo and Hosseini, MohammadSaleh and Parolin, Erick Skorupa and Osorio, Javier and Khan, Latifur and Brandt, Patrick and D’Orazio, Vito},
  booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  pages={5469--5482},
  year={2022}
}

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน 1.0.0
ประเภท โค้ดแหล่งที่มา AI
เวลาอัปเดต 2024-12-30
ขนาด 50MB
มาจาก Github

แอปที่เกี่ยวข้อง

node telegram bot api

2024-12-14
typebot.io

2024-12-14
python wechaty getting started

2024-12-14
TranscriberBot

2024-12-14
genal chat

2024-12-14
Facemoji

2024-12-14

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
node telegram bot api

โค้ดแหล่งที่มา AI

v0.50.0
typebot.io

โค้ดแหล่งที่มา AI

v3.1.2
python wechaty getting started

โค้ดแหล่งที่มา AI

1.0.0
waymo open dataset

ซอร์สโค้ดอื่น ๆ

December 2023 Update
termwind

หมวดหมู่อื่นๆ

v2.3.0
wp functions

หมวดหมู่อื่นๆ

1.0.0

ข้อมูลที่เกี่ยวข้อง ทั้งหมด