aanrelease2013.tar.gz
is a mess.
papers_text/
files (plain text extracted from the PDF) have issues:
P00-1032
, W06-3709
)T75-2033
, to unusable, e.g., J79-1013
)C73-2029
)L08-1302
)Makefile
declaratively provides some documentation of the issues and the cleanup work involved.
This repository does not contain any of the original data, only a programmatic description of how to fix it.
To run, call make
in the root directory.
The University of Michigan CLAIR Group's ACL Anthology Network interface reports the following statistics:
Measure | Value |
---|---|
Number of papers | 21,212 |
Number of authors | 17,792 |
Number of venues | 342 |
Number of paper citations | 110,975 |
Number of author collaborations | 142,450 |
Citation network diameter | 22 |
Collaboration network diameter | 15 |
Some of these are inaccurate, or describe only one of the data sources.
Different sources in the dataset contain different subsets of the data;
for example, citations are reported for some papers that do not have a corresponding papers_text/
file (e.g., L08-1098
).
aan/release/2013/acl.txt
Measure | Value |
---|---|
citing→cited relationships | 110,930 |
unique citing papers | 16,554 |
avg. cited per citing | 6.7011 |
unique cited papers | 12,840 |
avg. citing per cited | 8.6394 |
unique papers | 18,160 |
unique papers that both cite and are cited | 11,234 |
Top 10 most-cited papers | # of papers citing | authors | title |
---|---|---|---|
J93-2004 | 928 | Mitchell et al. | Building A Large Annotated Corpus Of English: The Penn Treebank Computational Linguistics |
P02-1040 | 891 | Papineni et al. | Bleu: A Method For Automatic Evaluation Of Machine Translation |
J93-2003 | 729 | Brown et al. | The Mathematics Of Statistical Machine Translation: Parameter Estimation |
P03-1021 | 667 | Och & Josef | Minimum Error Rate Training In Statistical Machine Translation |
J03-1002 | 656 | Och & Josef | A Systematic Comparison Of Various Statistical Alignment Models |
P07-2045 | 591 | Koehn et al. | Moses: Open Source Toolkit for Statistical Machine Translation |
N03-1017 | 556 | Koehn et al. | Statistical Phrase-Based Translation |
P03-1054 | 394 | Klein & Manning | Accurate Unlexicalized Parsing |
J96-1002 | 376 | Berger et al. | A Maximum Entropy Approach To Natural Language Processing |
A00-2018 | 371 | Charniak | A Maximum-Entropy-Inspired Parser |
Top 10 most-citing papers | # of papers cited |
---|---|
P10-1142 | 88 |
J10-3003 | 80 |
W13-4917 | 71 |
W13-2201 | 65 |
J12-1006 | 62 |
J98-1001 | 59 |
J13-2003 | 59 |
J07-4004 | 57 |
J11-2002 | 52 |
D11-1108 | 52 |
aan/release/2013/acl-metadata.txt
The formatting of this file is, frankly, befuddling.
The general structure is BibTeX-esque, but no BibTeX parser could possibly handle it.
Worse, the mixture of encodings is insane!
If ftfy
was ever looking for a great real-world case study, this would be it.
author
, W10-4238
, and 16,308 unique author
sequences (author
lists all authors for that paper).aan/papers_text/???-????.txt
There are a lot of other files in this directory;
some of the papers are segmented into body and references sections;
there are some files that seem like they were intended to go in aan/release/2013/
;
and many of the files that match this pattern are empty.
papers_text/
.papers_text/
.Despite these flaws, the ACL Anthology Network is a great resource; many thanks to the many contributors.
Dragomir R. Radev, Pradeep Muthukrishnan, Vahed Qazvinian, Amjad Abu-Jbara. 2013. The ACL Anthology Network Corpus. Language Resources and Evaluation 47 (4), pp. 919–944. 10.1007/s10579-012-9211-2.
Copyright 2016–2018 Christopher Brown. MIT Licensed.