ACL anthology corpusダウンロード - ACL anthology corpusソースコードのダウンロード

ACL anthology corpus

その他のソースコード

1.0.0

ダウンロード

ACL OCL コーパス: 計算言語学におけるオープンサイエンスの推進

このリポジトリは、ACL アンソロジーコレクション (2022 年 9 月時点で 80,000 件の記事/ポスター) にフルテキストとメタデータを提供します。これには、.pdf ファイルと pdf の grobid 抽出も含まれます。

これは、ACL アンソロジーが提供するものや既存のものとどのように異なりますか?

ACL Anthology は抄録のみを提供しますが、PDF、フルテキスト、参考文献、および grobid によって PDF から抽出されたその他の詳細を提供します。
同様のコーパスと呼ばれる ACL Anthology Network が存在しますが、2016 年 12 月のわずか 23,000 件の論文のみで時代遅れを示しています。

アップデート

データは今、huggingface でホストされています。そこからダウンロードしてください。それは最新のものです。 https://huggingface.co/datasets/ACL-OCL/acl-anthology-corpus

目標は、このコーパスを常に最新の状態に保ち、完全な ACL コレクションの包括的なリポジトリを提供することです。

このリポジトリは、 80,013の ACL 記事/ポスターのデータを提供します -

ACL アンソロジーのすべての PDF :サイズ 45Gここからダウンロード
? ACLアンソロジーのすべてのbibファイル（抄録付き）：サイズ172Mここからダウンロード
?️ 全文と参考文献を含むすべての ACL アンソロジー PDF の生の grobid 抽出結果:サイズ 3.6Gここからダウンロード
?抽出されたメタデータを含むデータフレーム (詳細を含む以下の表) と分析用のコレクションの全文:サイズ 489Mここからダウンロード

列名	説明
`acl_id`	一意の ACL ID
`abstract`	GROBID によって抽出された要約
`full_text`	GROBID によって抽出された全文
`corpus_paper_id`	セマンティック・スカラーID
`pdf_hash`	PDF の sha1 ハッシュ
`numcitedby`	S2からの引用数
`url`	出版物のリンク
`publisher`	-
`address`	会議のアドレス
`year`	-
`month`	-
`booktitle`	-
`author`	著者のリスト
`title`	論文のタイトル
`pages`	-
`doi`	-
`number`	-
`volume`	-
`journal`	-
`editor`	-
`isbn`	-

 >> > import pandas as pd
>> > df = pd . read_parquet ( 'acl-publication-info.74k.parquet' )
>> > df
         acl_id                                           abstract                                          full_text  corpus_paper_id                                  pdf_hash  ...  number volume journal editor  isbn
0      O02 - 2002  There is a need to measure word similarity whe ...  There is a need to measure word similarity whe ...         18022704  0b0 9178 ac8d17a92f16140365363d8df88c757d0  ...    None   None    None   None  None
1      L02 - 1310                                                                                                                8220988  8 d5e31610bc82c2abc86bc20ceba684c97e66024  ...    None   None    None   None  None
2      R13 - 1042  Thread disentanglement is the task of separati ...  Thread disentanglement is the task of separati ...         16703040  3 eb736b17a5acb583b9a9bd99837427753632cdb  ...    None   None    None   None  None
3      W05 - 0819  In this paper , we describe a word alignment al ...  In this paper , we describe a word alignment al ...          1215281  b20450f67116e59d1348fc472cfc09f96e348f55  ...    None   None    None   None  None
4      L02 - 1309                                                                                                               18078432  011e943 b64a78dadc3440674419821ee080f0de3  ...    None   None    None   None  None
...         ...                                                ...                                                ...              ...                                       ...  ...     ...    ...     ...    ...   ...
73280  P99 - 1002  This paper describes recent progress and the a ...  This paper describes recent progress and the a ...           715160  ab17a01f142124744c6ae425f8a23011366ec3ee  ...    None   None    None   None  None
73281  P00 - 1009  We present an LFG - DOP parser which uses fragme ...  We present an LFG - DOP parser which uses fragme ...          1356246  ad005b3fd0c867667118482227e31d9378229751  ...    None   None    None   None  None
73282  P99 - 1056  The processes through which readers evoke ment ...  The processes through which readers evoke ment ...          7277828  924 cf7a4836ebfc20ee094c30e61b949be049fb6  ...    None   None    None   None  None
73283  P99 - 1051  This paper examines the extent to which verb d ...  This paper examines the extent to which verb d ...          1829043  6 b1f6f28ee36de69e8afac39461ee1158cd4d49a  ...    None   None    None   None  None
73284  P00 - 1013  Spoken dialogue managers have benefited from u ...  Spoken dialogue managers have benefited from u ...         10903652  483 c818c09e39d9da47103fbf2da8aaa7acacf01  ...    None   None    None   None  None

[ 73285 rows x 21 columns ]

提供された ACL ID は S2 API とも一致します -

https://api.semanticscholar.org/graph/v1/paper/ACL:P83-1025

API を使用すると、コーパス内の各論文の詳細情報を取得できます。

Huggingface でのテキスト生成

このコーパスの全文を使用して、huggingface からの distilgpt2 モデルを微調整しました。モデルは生成タスク用にトレーニングされます。

テキスト生成デモ: https://huggingface.co/shaurya0512/distilgpt2-finetune-acl22

例：

 >> > from transformers import AutoTokenizer , AutoModelForCausalLM
>> > tokenizer = AutoTokenizer . from_pretrained ( "shaurya0512/distilgpt2-finetune-acl22" )
>> > model = AutoModelForCausalLM . from_pretrained ( "shaurya0512/distilgpt2-finetune-acl22" )
>> >
>> > input_context = "We introduce a new language representation"
>> > input_ids = tokenizer . encode ( input_context , return_tensors = "pt" )  # encode input context
>> > outputs = model . generate (
...     input_ids = input_ids , max_length = 128 , temperature = 0.7 , repetition_penalty = 1.2
... )  # generate sequences
>> > print ( f"Generated: { tokenizer . decode ( outputs [ 0 ], skip_special_tokens = True ) } " )

 Generated: We introduce a new language representation for the task of sentiment classification. We propose an approach to learn representations from   
unlabeled data, which is based on supervised learning and can be applied in many applications such as machine translation (MT) or information retrieval   
systems where labeled text has been used by humans with limited training time but no supervision available at all. Our method achieves state-oftheart   
results using only one dataset per domain compared to other approaches that use multiple datasets simultaneously, including BERTScore(Devlin et al.,   
2019; Liu & Lapata, 2020b ) ; RoBERTa+LSTM + L2SRC -

TODO

~~acl コーパスをセマンティック学者 (S2)、S2ORC などのソースにリンクします。~~
pdffigures を使用して ACL コーパスから図とキャプションを抽出する - Scientific-figure-captioning
コーパスを常に最新の状態に保つためにリリーススケジュールを立てます。
ACL引用グラフ
~~bib ファイルマッピングによるメタデータの強化 - 著者を含める~~
~~論文の引用数を追加する~~
ForeCite を使用してコーパスからインパクトのあるキーワードを抽出する
paperswithcodeを使用してデータセットをリンクしますか? - これがどれほど役立つかわかりません
データに関する統計情報 (言語の多様性) を用意します。地理的多様性。できれば探検家
ゼロショット分類このコーパスが ACL コミュニティに関連する分析に役立つことを期待しています。

引用/スターを付けてください?このコーパスを使用する場合はこのページ

ACL アンソロジーコーパスの引用

このコーパスを研究で使用する場合は、次の BibTeX エントリを使用してください。

    @Misc{acl_anthology_corpus,
        author =       {Shaurya Rohatgi},
        title =        {ACL Anthology Corpus with Full Text},
        howpublished = {Github},
        year =         {2022},
        url =          {https://github.com/shauryr/ACL-anthology-corpus}
    }