ACL anthology corpus下载 - ACL anthology corpus源码下载

ACL anthology corpus

其他源码

1.0.0

下载

ACL OCL 语料库：推进计算语言学的开放科学

该存储库提供 ACL 选集的全文和元数据（截至 2022 年 9 月，有 8 万篇文章/海报），还包括 .pdf 文件和 pdf 的 grobid 提取。

这与 ACL 选集提供的和现有的有何不同？

我们提供 grobid 从 PDF 中提取的 pdf、全文、参考文献和其他详细信息，而 ACL Anthology 仅提供摘要。
存在一个名为 ACL Anthology Network 的类似语料库，但从 2016 年 12 月开始，该语料库仅包含 23,000 篇论文，已显示出其过时性。

更新

数据现在托管在 Huggingface 上！请从那里下载。这是最新的。 https://huggingface.co/datasets/ACL-OCL/acl-anthology-corpus

目标是保持该语料库的更新并提供完整 ACL 集合的综合存储库。

该存储库提供80,013 ACL 文章/海报的数据 -

ACL 选集中的所有 PDF：大小 45G在此处下载
？ ACL 选集中所有带有摘要的书目文件：大小 172M在此处下载
️ 所有 ACL 选集 pdf 的原始 grobid 提取结果，包括全文和参考文献：大小 3.6G在此处下载
？包含提取的元数据的数据框（下表包含详细信息）和用于分析的集合的全文：大小 489M在此处下载

栏目名称	描述
`acl_id`	唯一的ACL ID
`abstract`	GROBID 提取的摘要
`full_text`	GROBID 提取的全文
`corpus_paper_id`	语义学者ID
`pdf_hash`	pdf 的 sha1 哈希值
`numcitedby`	S2 的引用次数
`url`	出版链接
`publisher`	-
`address`	会议地址
`year`	-
`month`	-
`booktitle`	-
`author`	作者名单
`title`	论文标题
`pages`	-
`doi`	-
`number`	-
`volume`	-
`journal`	-
`editor`	-
`isbn`	-

 >> > import pandas as pd
>> > df = pd . read_parquet ( 'acl-publication-info.74k.parquet' )
>> > df
         acl_id                                           abstract                                          full_text  corpus_paper_id                                  pdf_hash  ...  number volume journal editor  isbn
0      O02 - 2002  There is a need to measure word similarity whe ...  There is a need to measure word similarity whe ...         18022704  0b0 9178 ac8d17a92f16140365363d8df88c757d0  ...    None   None    None   None  None
1      L02 - 1310                                                                                                                8220988  8 d5e31610bc82c2abc86bc20ceba684c97e66024  ...    None   None    None   None  None
2      R13 - 1042  Thread disentanglement is the task of separati ...  Thread disentanglement is the task of separati ...         16703040  3 eb736b17a5acb583b9a9bd99837427753632cdb  ...    None   None    None   None  None
3      W05 - 0819  In this paper , we describe a word alignment al ...  In this paper , we describe a word alignment al ...          1215281  b20450f67116e59d1348fc472cfc09f96e348f55  ...    None   None    None   None  None
4      L02 - 1309                                                                                                               18078432  011e943 b64a78dadc3440674419821ee080f0de3  ...    None   None    None   None  None
...         ...                                                ...                                                ...              ...                                       ...  ...     ...    ...     ...    ...   ...
73280  P99 - 1002  This paper describes recent progress and the a ...  This paper describes recent progress and the a ...           715160  ab17a01f142124744c6ae425f8a23011366ec3ee  ...    None   None    None   None  None
73281  P00 - 1009  We present an LFG - DOP parser which uses fragme ...  We present an LFG - DOP parser which uses fragme ...          1356246  ad005b3fd0c867667118482227e31d9378229751  ...    None   None    None   None  None
73282  P99 - 1056  The processes through which readers evoke ment ...  The processes through which readers evoke ment ...          7277828  924 cf7a4836ebfc20ee094c30e61b949be049fb6  ...    None   None    None   None  None
73283  P99 - 1051  This paper examines the extent to which verb d ...  This paper examines the extent to which verb d ...          1829043  6 b1f6f28ee36de69e8afac39461ee1158cd4d49a  ...    None   None    None   None  None
73284  P00 - 1013  Spoken dialogue managers have benefited from u ...  Spoken dialogue managers have benefited from u ...         10903652  483 c818c09e39d9da47103fbf2da8aaa7acacf01  ...    None   None    None   None  None

[ 73285 rows x 21 columns ]

提供的 ACL id 也与 S2 API 一致 -

https://api.semanticscholar.org/graph/v1/paper/ACL:P83-1025

该 API 可用于获取语料库中每篇论文的更多信息。

Huggingface 上的文本生成

我们使用该语料库的全文对 Huggingface 的 distilgpt2 模型进行了微调。该模型针对生成任务进行训练。

文本生成演示：https://huggingface.co/shaurya0512/distilgpt2-finetune-acl22

例子：

 >> > from transformers import AutoTokenizer , AutoModelForCausalLM
>> > tokenizer = AutoTokenizer . from_pretrained ( "shaurya0512/distilgpt2-finetune-acl22" )
>> > model = AutoModelForCausalLM . from_pretrained ( "shaurya0512/distilgpt2-finetune-acl22" )
>> >
>> > input_context = "We introduce a new language representation"
>> > input_ids = tokenizer . encode ( input_context , return_tensors = "pt" )  # encode input context
>> > outputs = model . generate (
...     input_ids = input_ids , max_length = 128 , temperature = 0.7 , repetition_penalty = 1.2
... )  # generate sequences
>> > print ( f"Generated: { tokenizer . decode ( outputs [ 0 ], skip_special_tokens = True ) } " )

 Generated: We introduce a new language representation for the task of sentiment classification. We propose an approach to learn representations from   
unlabeled data, which is based on supervised learning and can be applied in many applications such as machine translation (MT) or information retrieval   
systems where labeled text has been used by humans with limited training time but no supervision available at all. Our method achieves state-oftheart   
results using only one dataset per domain compared to other approaches that use multiple datasets simultaneously, including BERTScore(Devlin et al.,   
2019; Liu & Lapata, 2020b ) ; RoBERTa+LSTM + L2SRC -

待办事项

~~将 acl 语料库链接到语义学者（S2），来源如 S2ORC~~
使用 pdffigures 从 ACL 语料库中提取图形和标题 - science-figure-captioning
制定发布时间表以保持语料库更新。
ACL 引用图
~~通过书目文件映射增强元数据 - 包括作者~~
~~添加论文的引用计数~~
使用 ForeCite 从语料库中提取有影响力的关键字
使用论文和代码链接数据集？ - 不知道这有什么用
有一些关于数据的统计数据 - 语言多样性；地理多样性；如果可能的话探险家
零样本分类我们希望这个语料库能够对 ACL 社区相关的分析有所帮助。

请引用/星标 ?如果您使用此语料库，则此页面

引用 ACL 选集语料库

如果您在研究中使用此语料库，请使用以下 BibTeX 条目：

    @Misc{acl_anthology_corpus,
        author =       {Shaurya Rohatgi},
        title =        {ACL Anthology Corpus with Full Text},
        howpublished = {Github},
        year =         {2022},
        url =          {https://github.com/shauryr/ACL-anthology-corpus}
    }