ClearML Agent - MLOps/LLMOps made easy
MLOps/LLMOps scheduler & orchestration solution supporting Linux, macOS and Windows
? ClearML is open-source - Leave a star to support the project! ?
It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.
Full Automation in 5 steps
pip install clearml-agent
(install the ClearML Agent on any GPU machine:
on-premises / cloud / ...)"All the Deep/Machine-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"
Try ClearML now Self Hosted or Free tier Hosting
The ClearML Agent was built to address the DL/ML R&D DevOps needs:
Using the ClearML Agent, you can now set up a dynamic cluster with *epsilon DevOps
*epsilon - Because we are ? and nothing is really zero work
We think Kubernetes is awesome, but it is not a must to get started with remote execution agents and cluster management.
We designed clearml-agent
so you can run both bare-metal and on top of Kubernetes, in any combination that fits your environment.
You can find the Dockerfiles in the docker folder and the helm Chart in https://github.com/allegroai/clearml-helm-charts
Run the agent in Kubernetes Glue mode an map ClearML jobs directly to K8s jobs:
Yes! Slurm integration is available, check the documentation for further details
Full scale HPC with a click of a button
The ClearML Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.
Any 'Draft' experiment can be scheduled for execution by a ClearML agent.
A previously run experiment can be put into 'Draft' state by either of two methods:
An experiment is scheduled for execution using the 'Enqueue' action from the experiment right-click context menu in the ClearML UI and selecting the execution queue.
See creating an experiment and enqueuing it for execution.
Once an experiment is enqueued, it will be picked up and executed by a ClearML Agent monitoring this queue.
The ClearML UI Workers & Queues page provides ongoing execution information:
The ClearML Agent executes experiments using the following process:
pip install clearml-agent
Full Interface and capabilities are available with
clearml-agent --help
clearml-agent daemon --help
clearml-agent init
Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default
ClearML Agent cache folder is ~/.clearml
.
See full details in your configuration file at ~/clearml.conf
.
Note: The ClearML Agent extends the ClearML configuration file ~/clearml.conf
.
They are designed to share the same configuration file, see example here
For debug and experimentation, start the ClearML agent in foreground
mode, where all the output is printed to screen:
clearml-agent daemon --queue default --foreground
For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe).
Notice: with --detached
flag, the clearml-agent will be running in the background
clearml-agent daemon --detached --queue default
GPU allocation is controlled via the standard OS environment NVIDIA_VISIBLE_DEVICES
or --gpus
flag (or disabled
with --cpu-only
).
If no flag is set, and NVIDIA_VISIBLE_DEVICES
variable doesn't exist, all GPUs will be allocated for
the clearml-agent
.
If --cpu-only
flag is set, or NVIDIA_VISIBLE_DEVICES="none"
, no gpu will be allocated for
the clearml-agent
.
Example: spin two agents, one per GPU on the same machine:
Notice: with --detached
flag, the clearml-agent will run in the background
clearml-agent daemon --detached --gpus 0 --queue default
clearml-agent daemon --detached --gpus 1 --queue default
Example: spin two agents, pulling from dedicated dual_gpu
queue, two GPUs per agent
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu
For debug and experimentation, start the ClearML agent in foreground
mode, where all the output is printed to screen
clearml-agent daemon --queue default --docker --foreground
For actual service mode, all the stdout will be stored automatically into a file (no need to pipe).
Notice: with --detached
flag, the clearml-agent will run in the background
clearml-agent daemon --detached --queue default --docker
Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
docker:
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
Example: spin two agents, pulling from dedicated dual_gpu
queue, two GPUs per agent, with default
nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
docker:
clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
Priority Queues are also supported, example use case:
High priority queue: important_jobs
, low priority queue: default
clearml-agent daemon --queue important_jobs default
The ClearML Agent will first try to pull jobs from the important_jobs
queue, and only if it is empty, the agent
will try to pull from the default
queue.
Adding queues, managing job order within a queue, and moving jobs between queues, is available using the Web UI, see example on our free server
To stop a ClearML Agent running in the background, run the same command line used to start the agent with --stop
appended. For example, to stop the first of the above shown same machine, single gpu agents:
clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 --stop
Integrate ClearML with your code
Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)
As your code is running, ClearML creates an experiment logging all the necessary execution information:
You now have a 'template' of your experiment with everything required for automated execution
In the ClearML UI, right-click on the experiment and select 'clone'. A copy of your experiment will be created.
You now have a new draft experiment cloned from your original experiment, feel free to edit it
Schedule the newly created experiment for execution: right-click the experiment and select 'enqueue'
ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks) for different use cases:
ClearML-Agent Services mode will spin any task enqueued into the specified queue. Every task launched by ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities. Currently, clearml-agent in services-mode supports CPU only configuration. ClearML-Agent services mode can be launched alongside GPU agents.
clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
Note: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue.
The ClearML Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the ClearML package.
Sample AutoML & Orchestration examples can be found in the ClearML example/automation folder.
AutoML examples:
Experiment Pipeline examples:
Apache License, Version 2.0 (see the LICENSE for more information)