VisEval is a benchmark designed to evaluate visualization generation methods. In this repository, we provide both the toolkit to support the benchmarking, as well as the data used for benchmarks.
VisEval evaluates generated visualizations from three dimensions:
pip install --upgrade vis-evaluator
# or `git clone https://github.com/microsoft/VisEval.git && cd VisEval && pip install --upgrade -e .`
To access the dataset, please follow these steps:
For additional information about the dataset, please refer to the dataset documentation.
After installation, you can use VisEval by referring to examples/evaluate.py
or a follow:
Agent
Class. You can find three examples in the examples/agent
directory.from viseval.agent import Agent, ChartExecutionResult
class YourAgent(Agent):
def __init__(self, llm):
self.llm = llm
def generate(
self, nl_query: str, tables: list[str], config: dict
) -> Tuple[str, dict]:
"""Generate code for the given natural language query."""
pass
def execute(
self, code: str, context: dict, log_name: str = None
) -> ChartExecutionResult:
"""Execute the given code with context and return the result"""
pass
evaluator = Evaluator(webdriver_path, vision_model)
(You can configure the Evaluator without a webdriver and vision model, in which case the evaluation of the readability of the generated visualizations will be skipped.)
Install webdriver.
# download
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
# install
apt install google-chrome-stable_current_amd64.deb
# verify
google-chrome --version
Load vision model (e.g., GPT4-v).
from langchain_openai import AzureChatOpenAI
import dotenv
# Copy .env.example to .env and put your API keys in the file.
dotenv.load_dotenv()
vision_model = AzureChatOpenAI(
model_name="gpt-4-turbo-v",
max_retries=999,
temperature=0.0,
request_timeout=20,
max_tokens=4096,
)
from viseval import Dataset
# Configure dataset with the benchmark dataset folder path ( folder),
# specify the number of tables required to generate visualizations (table_type`: all, single, or multiple),
# and indicate whether to include irrelevant tables (`with_irrelevant_tables`).
dataset = Dataset(folder, table_type, with_irrelevant_tables)
config = {"library": args.library}
result = evaluator.evaluate(agent, dataset, config)
score = result.score()
print(f"Score: {score}")
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct.For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
This project has adopted the Microsoft Privacy Statement.
If you find that VisEval helps your research, please consider citing it:
@misc{chen2024viseval,
title={VisEval: A Benchmark for Data Visualization in the Era of Large Language Models},
author={Nan Chen and Yuge Zhang and Jiahang Xu and Kan Ren and Yuqing Yang},
year={2024},
eprint={2407.00981},
archivePrefix={arXiv},
primaryClass={cs.HC},
}