ACL anthology corpus
1.0.0
该存储库提供 ACL 选集的全文和元数据(截至 2022 年 9 月,有 8 万篇文章/海报),还包括 .pdf 文件和 pdf 的 grobid 提取。
数据现在托管在 Huggingface 上!请从那里下载。这是最新的。 https://huggingface.co/datasets/ACL-OCL/acl-anthology-corpus
目标是保持该语料库的更新并提供完整 ACL 集合的综合存储库。
该存储库提供80,013
ACL 文章/海报的数据 -
栏目名称 | 描述 |
---|---|
acl_id | 唯一的ACL ID |
abstract | GROBID 提取的摘要 |
full_text | GROBID 提取的全文 |
corpus_paper_id | 语义学者ID |
pdf_hash | pdf 的 sha1 哈希值 |
numcitedby | S2 的引用次数 |
url | 出版链接 |
publisher | - |
address | 会议地址 |
year | - |
month | - |
booktitle | - |
author | 作者名单 |
title | 论文标题 |
pages | - |
doi | - |
number | - |
volume | - |
journal | - |
editor | - |
isbn | - |
>> > import pandas as pd
>> > df = pd . read_parquet ( 'acl-publication-info.74k.parquet' )
>> > df
acl_id abstract full_text corpus_paper_id pdf_hash ... number volume journal editor isbn
0 O02 - 2002 There is a need to measure word similarity whe ... There is a need to measure word similarity whe ... 18022704 0b0 9178 ac8d17a92f16140365363d8df88c757d0 ... None None None None None
1 L02 - 1310 8220988 8 d5e31610bc82c2abc86bc20ceba684c97e66024 ... None None None None None
2 R13 - 1042 Thread disentanglement is the task of separati ... Thread disentanglement is the task of separati ... 16703040 3 eb736b17a5acb583b9a9bd99837427753632cdb ... None None None None None
3 W05 - 0819 In this paper , we describe a word alignment al ... In this paper , we describe a word alignment al ... 1215281 b20450f67116e59d1348fc472cfc09f96e348f55 ... None None None None None
4 L02 - 1309 18078432 011e943 b64a78dadc3440674419821ee080f0de3 ... None None None None None
... ... ... ... ... ... ... ... ... ... ... ...
73280 P99 - 1002 This paper describes recent progress and the a ... This paper describes recent progress and the a ... 715160 ab17a01f142124744c6ae425f8a23011366ec3ee ... None None None None None
73281 P00 - 1009 We present an LFG - DOP parser which uses fragme ... We present an LFG - DOP parser which uses fragme ... 1356246 ad005b3fd0c867667118482227e31d9378229751 ... None None None None None
73282 P99 - 1056 The processes through which readers evoke ment ... The processes through which readers evoke ment ... 7277828 924 cf7a4836ebfc20ee094c30e61b949be049fb6 ... None None None None None
73283 P99 - 1051 This paper examines the extent to which verb d ... This paper examines the extent to which verb d ... 1829043 6 b1f6f28ee36de69e8afac39461ee1158cd4d49a ... None None None None None
73284 P00 - 1013 Spoken dialogue managers have benefited from u ... Spoken dialogue managers have benefited from u ... 10903652 483 c818c09e39d9da47103fbf2da8aaa7acacf01 ... None None None None None
[ 73285 rows x 21 columns ]
提供的 ACL id 也与 S2 API 一致 -
https://api.semanticscholar.org/graph/v1/paper/ACL:P83-1025
该 API 可用于获取语料库中每篇论文的更多信息。
我们使用该语料库的全文对 Huggingface 的 distilgpt2 模型进行了微调。该模型针对生成任务进行训练。
文本生成演示:https://huggingface.co/shaurya0512/distilgpt2-finetune-acl22
例子:
>> > from transformers import AutoTokenizer , AutoModelForCausalLM
>> > tokenizer = AutoTokenizer . from_pretrained ( "shaurya0512/distilgpt2-finetune-acl22" )
>> > model = AutoModelForCausalLM . from_pretrained ( "shaurya0512/distilgpt2-finetune-acl22" )
>> >
>> > input_context = "We introduce a new language representation"
>> > input_ids = tokenizer . encode ( input_context , return_tensors = "pt" ) # encode input context
>> > outputs = model . generate (
... input_ids = input_ids , max_length = 128 , temperature = 0.7 , repetition_penalty = 1.2
... ) # generate sequences
>> > print ( f"Generated: { tokenizer . decode ( outputs [ 0 ], skip_special_tokens = True ) } " )
Generated: We introduce a new language representation for the task of sentiment classification. We propose an approach to learn representations from
unlabeled data, which is based on supervised learning and can be applied in many applications such as machine translation (MT) or information retrieval
systems where labeled text has been used by humans with limited training time but no supervision available at all. Our method achieves state-oftheart
results using only one dataset per domain compared to other approaches that use multiple datasets simultaneously, including BERTScore(Devlin et al.,
2019; Liu & Lapata, 2020b ) ; RoBERTa+LSTM + L2SRC -
请引用/星标 ?如果您使用此语料库,则此页面
如果您在研究中使用此语料库,请使用以下 BibTeX 条目:
@Misc{acl_anthology_corpus,
author = {Shaurya Rohatgi},
title = {ACL Anthology Corpus with Full Text},
howpublished = {Github},
year = {2022},
url = {https://github.com/shauryr/ACL-anthology-corpus}
}
我们感谢语义学者提供对该语料库中引文相关数据的访问。
ACL 选集语料库在 CC BY-NC 4.0 下发布。使用此语料库即表示您同意其使用条款。