25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.
http://goo.gl/language/query-wellformedness
Google's query wellformedness dataset was created by crowdsourcing well-formedness annotations for 25,100 queries from the Paralex corpus. Every query was annotated by five raters each with 1/0 rating of whether or not the query is well-formed. For further details please read our paper: Identifying Well-formed Natural Language Questions
For each query we provide the average of the 5 binary judgements as the wellformedness score for the query. Following are some examples of queries present in the dataset:
Query | Wellformedness rating |
---|---|
Which form of government is still in place in greece ? | 1.0 |
Population of owls just in north america ? | 0.0 |
Is johnny depp a celtic fan ? | 0.8 |
Where did Roald Dahl live in his teenaged years ? | 0.6 |
The dataset is divided into three files: train.tsv, dev.tsv and test.tsv each containing rated queries. The size of the files is as follows:
File | No. of queries |
---|---|
train.tsv | 17,500 |
dev.tsv | 3,750 |
test.tsv | 3,850 |
The examples in each file are tab separated containing the following columns:
Column | Content |
---|---|
1 | The European Union includes how many ? |
2 | 0.2 |
If you use or discuss this dataset in your work, please cite our paper:
@InProceedings{FaruquiDas2018,
title = {{Identifying Well-formed Natural Language Questions}},
author = {Faruqui, Manaal and Das, Dipanjan},
booktitle = {Proc. of EMNLP},
year = {2018}
}
Query-wellformedness dataset is licensed under CC BY-SA 4.0. Any third party content or data is provided “As Is” without any warranty, express or implied.
If you have a technical question regarding the dataset or publication, please create an issue in this repository.