Unduhan ACL anthology corpus - Unduh kode sumber ACL anthology corpus

ACL anthology corpus

Kode sumber lainnya

1.0.0

Unduh

ACL OCL Corpus: Memajukan Sains Terbuka dalam Linguistik Komputasi

Repositori ini menyediakan teks lengkap dan metadata ke koleksi antologi ACL (80 ribu artikel/poster per September 2022) juga termasuk file .pdf dan ekstraksi grobid dari pdf.

Apa bedanya dengan yang disediakan antologi ACL dan yang sudah ada?

Kami menyediakan pdf, teks lengkap, referensi, dan detail lainnya yang diambil oleh grobid dari PDF sementara ACL Anthology hanya menyediakan abstrak.
Terdapat korpus serupa yang disebut ACL Anthology Network tetapi sekarang menunjukkan usianya hanya dengan 23 ribu makalah mulai Desember 2016.

MEMPERBARUI

Data sekarang dihosting di huggingface! Silakan unduh dari sana. Ini adalah yang paling mutakhir. https://huggingface.co/datasets/ACL-OCL/acl-anthology-corpus

Tujuannya adalah untuk terus memperbarui korpus ini dan menyediakan repositori komprehensif dari koleksi ACL lengkap.

Repositori ini menyediakan data untuk 80,013 artikel/poster ACL -

Semua PDF dalam antologi ACL : ukuran 45G unduh di sini
? Semua file bib dalam antologi ACL dengan abstrak : ukuran 172M unduh di sini
?️ Hasil ekstraksi grobid mentah pada semua pdf antologi ACL yang memuat teks lengkap dan referensi : ukuran 3.6G unduh di sini
? Kerangka data dengan metadata yang diekstraksi (tabel di bawah dengan detail) dan teks lengkap koleksi untuk dianalisis: ukuran 489M unduh di sini

Nama kolom	Keterangan
`acl_id`	ID ACL unik
`abstract`	abstrak diekstraksi oleh GROBID
`full_text`	teks lengkap diekstraksi oleh GROBID
`corpus_paper_id`	ID Cendekiawan Semantik
`pdf_hash`	sha1 hash dari pdf
`numcitedby`	jumlah sitasi dari S2
`url`	tautan publikasi
`publisher`	-
`address`	Alamat konferensi
`year`	-
`month`	-
`booktitle`	-
`author`	daftar penulis
`title`	judul makalah
`pages`	-
`doi`	-
`number`	-
`volume`	-
`journal`	-
`editor`	-
`isbn`	-

 >> > import pandas as pd
>> > df = pd . read_parquet ( 'acl-publication-info.74k.parquet' )
>> > df
         acl_id                                           abstract                                          full_text  corpus_paper_id                                  pdf_hash  ...  number volume journal editor  isbn
0      O02 - 2002  There is a need to measure word similarity whe ...  There is a need to measure word similarity whe ...         18022704  0b0 9178 ac8d17a92f16140365363d8df88c757d0  ...    None   None    None   None  None
1      L02 - 1310                                                                                                                8220988  8 d5e31610bc82c2abc86bc20ceba684c97e66024  ...    None   None    None   None  None
2      R13 - 1042  Thread disentanglement is the task of separati ...  Thread disentanglement is the task of separati ...         16703040  3 eb736b17a5acb583b9a9bd99837427753632cdb  ...    None   None    None   None  None
3      W05 - 0819  In this paper , we describe a word alignment al ...  In this paper , we describe a word alignment al ...          1215281  b20450f67116e59d1348fc472cfc09f96e348f55  ...    None   None    None   None  None
4      L02 - 1309                                                                                                               18078432  011e943 b64a78dadc3440674419821ee080f0de3  ...    None   None    None   None  None
...         ...                                                ...                                                ...              ...                                       ...  ...     ...    ...     ...    ...   ...
73280  P99 - 1002  This paper describes recent progress and the a ...  This paper describes recent progress and the a ...           715160  ab17a01f142124744c6ae425f8a23011366ec3ee  ...    None   None    None   None  None
73281  P00 - 1009  We present an LFG - DOP parser which uses fragme ...  We present an LFG - DOP parser which uses fragme ...          1356246  ad005b3fd0c867667118482227e31d9378229751  ...    None   None    None   None  None
73282  P99 - 1056  The processes through which readers evoke ment ...  The processes through which readers evoke ment ...          7277828  924 cf7a4836ebfc20ee094c30e61b949be049fb6  ...    None   None    None   None  None
73283  P99 - 1051  This paper examines the extent to which verb d ...  This paper examines the extent to which verb d ...          1829043  6 b1f6f28ee36de69e8afac39461ee1158cd4d49a  ...    None   None    None   None  None
73284  P00 - 1013  Spoken dialogue managers have benefited from u ...  Spoken dialogue managers have benefited from u ...         10903652  483 c818c09e39d9da47103fbf2da8aaa7acacf01  ...    None   None    None   None  None

[ 73285 rows x 21 columns ]

Id ACL yang diberikan juga konsisten dengan S2 API -

https://api.semanticscholar.org/graph/v1/paper/ACL:P83-1025

API dapat digunakan untuk mengambil lebih banyak informasi untuk setiap makalah di korpus.

Pembuatan teks di Huggingface

Kami menyempurnakan model distilgpt2 dari huggingface menggunakan teks lengkap dari korpus ini. Model dilatih untuk tugas pembangkitan.

Demo Pembuatan Teks: https://huggingface.co/shaurya0512/distilgpt2-finetune-acl22

Contoh:

 >> > from transformers import AutoTokenizer , AutoModelForCausalLM
>> > tokenizer = AutoTokenizer . from_pretrained ( "shaurya0512/distilgpt2-finetune-acl22" )
>> > model = AutoModelForCausalLM . from_pretrained ( "shaurya0512/distilgpt2-finetune-acl22" )
>> >
>> > input_context = "We introduce a new language representation"
>> > input_ids = tokenizer . encode ( input_context , return_tensors = "pt" )  # encode input context
>> > outputs = model . generate (
...     input_ids = input_ids , max_length = 128 , temperature = 0.7 , repetition_penalty = 1.2
... )  # generate sequences
>> > print ( f"Generated: { tokenizer . decode ( outputs [ 0 ], skip_special_tokens = True ) } " )

 Generated: We introduce a new language representation for the task of sentiment classification. We propose an approach to learn representations from   
unlabeled data, which is based on supervised learning and can be applied in many applications such as machine translation (MT) or information retrieval   
systems where labeled text has been used by humans with limited training time but no supervision available at all. Our method achieves state-oftheart   
results using only one dataset per domain compared to other approaches that use multiple datasets simultaneously, including BERTScore(Devlin et al.,   
2019; Liu & Lapata, 2020b ) ; RoBERTa+LSTM + L2SRC -

TODO

~~Tautkan korpus acl ke sarjana semantik (S2), sumber seperti S2ORC~~
Ekstrak gambar dan keterangan dari korpus ACL menggunakan pdffigures - ilmiah-gambar-captioning
Miliki jadwal rilis agar korpus tetap diperbarui.
Grafik kutipan ACL
~~Tingkatkan metadata dengan pemetaan file bib - sertakan penulis~~
~~Tambahkan jumlah kutipan untuk makalah~~
Gunakan ForeCite untuk mengekstrak kata kunci yang berdampak dari korpus
Tautkan kumpulan data menggunakan paperswithcode? - tidak tahu betapa bermanfaatnya ini
Miliki beberapa statistik tentang data - keragaman linguistik; keanekaragaman geografis; jika memungkinkan penjelajah
klasifikasi zero-shot Kami berharap korpus ini dapat bermanfaat untuk analisis yang relevan dengan komunitas ACL.

Silakan kutip/bintangi? halaman ini jika Anda menggunakan korpus ini

Mengutip Korpus Antologi ACL

Jika Anda menggunakan korpus ini dalam penelitian Anda, harap gunakan entri BibTeX berikut:

    @Misc{acl_anthology_corpus,
        author =       {Shaurya Rohatgi},
        title =        {ACL Anthology Corpus with Full Text},
        howpublished = {Github},
        year =         {2022},
        url =          {https://github.com/shauryr/ACL-anthology-corpus}
    }