[Benchmark Contents] ♦ [Getting Started] ♦ [Live Demo] ♦ [BrowserGym] ♦ [Citing This Work]
[ICML 2024] WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks? [Paper]
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [Paper]
WorkArena
is a suite of browser-based tasks tailored to gauge web agents' effectiveness in supporting routine tasks for knowledge workers.
By harnessing the ubiquitous ServiceNow platform, this benchmark will be instrumental in assessing the widespread state of such automations in modern knowledge work environments.
WorkArena is included in BrowserGym, a conversational gym environment for the evaluation of web agents.
To setup WorkArena, you will need to get your own ServiceNow instance, install our Python package, and upload some data to your instance. Follow the steps below to achieve this.
Request an instance
and select the Washington
release (initializing the instance will take a few minutes)SNOW_INSTANCE_URL
: The URL of your ServiceNow developer instanceSNOW_INSTANCE_UNAME
: The username, should be "admin"SNOW_INSTANCE_PWD
: The password, make sure you place the value in quotes "" and be mindful of escaping special shell characters. Running echo $SNOW_INSTANCE_PWD
should print the correct password.Warning: Feel free to look around the platform, but please make sure you revert any changes (e.g., changes to list views, pinning some menus, etc.) as these changes will be persistent and affect the benchmarking process.
Run the following command to install WorkArena in the BrowswerGym environment:
pip install browsergym
Then, install Playwright:
playwright install
Finally, run this command in a terminal to upload the benchmark data to your ServiceNow instance:
workarena-install
Your installation is now complete! ?
At the moment, WorkArena-L1 includes 19,912
unique instances drawn from 33
tasks that cover the main components of the ServiceNow user interface, otherwise referred to as "atomic" tasks. WorkArena++ contains 682 tasks, each one sampling among thousands of potential configurations. WorkArena++ uses the atomic components presented in WorkArena, and composes them into real-world use cases evaluating planning, reasoning, and memorizing abilities of agents.
The following videos show an agent built on GPT-4-vision
interacting with every atomic component of the benchmark. As emphasized by our results, this benchmark is not solved and thus, the performance of the agent is not always on point.
Goal: The agent must search for specific information in the company knowledge base.
The agent interacts with the user via BrowserGym's conversational interface.
Goal: The agent must fill a complex form with specific values for each field.
Goal: The agent must order items with specific configurations from the company's service catalog.
Goal: The agent must filter a list according to some specifications.
In this example, the agent struggles to manipulate the UI and fails to create the filter.
Goal: The agent must navigate to a specific application using the main menu.
Goal: The agent must answer a question that requires reading charts and (optionally) performing simple reasoning over them.
Note: For demonstration purposes, a human is controlling the cursor since this is a pure retrieval task
To setup WorkArena, you will need to get your own ServiceNow instance, install our Python package, and upload some data to your instance. Follow the steps below to achieve this.
Request an instance
and select the Washington
release (initializing the instance will take a few minutes)SNOW_INSTANCE_URL
: The URL of your ServiceNow developer instanceSNOW_INSTANCE_UNAME
: The username, should be "admin"SNOW_INSTANCE_PWD
: The password, make sure you place the value in single quotes '' and be mindful of escaping special shell characters. Running echo $SNOW_INSTANCE_PWD
should print the correct password.Warning: Feel free to look around the platform, but please make sure you revert any changes (e.g., changes to list views, pinning some menus, etc.) as these changes will be persistent and affect the benchmarking process.
Run the following command to install WorkArena in the BrowswerGym environment:
pip install browsergym-workarena
Then, install Playwright:
playwright install
Finally, run this command in a terminal to upload the benchmark data to your ServiceNow instance:
workarena-install
Your installation is now complete! ?
Run this code to see WorkArena in action.
Note: the following example executes WorkArena's oracle (cheat) function to solve each task. To evaluate an agent, calls to env.step()
must be used instead.
import random
from browsergym.core.env import BrowserEnv
from browsergym.workarena import ALL_WORKARENA_TASKS
from time import sleep
random.shuffle(ALL_WORKARENA_TASKS)
for task in ALL_WORKARENA_TASKS:
print("Task:", task)
# Instantiate a new environment
env = BrowserEnv(task_entrypoint=task,
headless=False)
env.reset()
# Cheat functions use Playwright to automatically solve the task
env.chat.add_message(role="assistant", msg="On it. Please wait...")
cheat_messages = []
env.task.cheat(env.page, cheat_messages)
# Send cheat messages to chat
for cheat_msg in cheat_messages:
env.chat.add_message(role=cheat_msg["role"], msg=cheat_msg["message"])
# Post solution to chat
env.chat.add_message(role="assistant", msg="I'm done!")
# Validate the solution
reward, stop, message, info = env.task.validate(env.page, cheat_messages)
if reward == 1:
env.chat.add_message(role="user", msg="Yes, that works. Thanks!")
else:
env.chat.add_message(role="user", msg=f"No, that doesn't work. {info.get('message', '')}")
sleep(3)
env.close()
l3
to sample L3 tasks.import random
from browsergym.core.env import BrowserEnv
from browsergym.workarena import get_all_tasks_agents
AGENT_L2_SAMPLED_SET = get_all_tasks_agents(filter="l2")
AGENT_L2_SAMPLED_TASKS, AGENT_L2_SEEDS = [sampled_set[0] for sampled_set in AGENT_L2_SAMPLED_SET], [
sampled_set[1] for sampled_set in AGENT_L2_SAMPLED_SET
]
from time import sleep
for (task, seed) in zip(AGENT_L2_SAMPLED_TASKS, AGENT_L2_SEEDS):
print("Task:", task)
# Instantiate a new environment
env = BrowserEnv(task_entrypoint=task,
headless=False)
env.reset()
# Cheat functions use Playwright to automatically solve the task
env.chat.add_message(role="assistant", msg="On it. Please wait...")
for i in range(len(env.task)):
sleep(1)
env.task.cheat(page=env.page, chat_messages=env.chat.messages, subtask_idx=i)
sleep(1)
reward, done, message, info = env.task.validate(page=env.page, chat_messages=env.chat.messages)
if reward == 1:
env.chat.add_message(role="user", msg="Yes, that works. Thanks!")
else:
env.chat.add_message(role="user", msg=f"No, that doesn't work. {info.get('message', '')}")
sleep(3)
env.close()
Note: the following example executes WorkArena's oracle (cheat) function to solve each task. To evaluate an agent, calls to env.step()
must be used instead.
Please use the following BibTeX to cite our work:
@misc{workarena2024,
title={WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?},
author={Alexandre Drouin and Maxime Gasse and Massimo Caccia and Issam H. Laradji and Manuel Del Verme and Tom Marty and Léo Boisvert and Megh Thakkar and Quentin Cappart and David Vazquez and Nicolas Chapados and Alexandre Lacoste},
year={2024},
eprint={2403.07718},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{boisvert2024workarenacompositionalplanningreasoningbased,
title={WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks},
author={Léo Boisvert and Megh Thakkar and Maxime Gasse and Massimo Caccia and Thibault Le Sellier De Chezelles and Quentin Cappart and Nicolas Chapados and Alexandre Lacoste and Alexandre Drouin},
year={2024},
eprint={2407.05291},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2407.05291},
}