An open-source framework to evaluate, test and monitor ML and LLM-powered systems.
Documentation | Discord Community | Blog | Twitter | Evidently Cloud
Evidently 0.4.25. LLM evaluation -> Tutorial
Evidently is an open-source Python library for ML and LLM evaluation and observability. It helps evaluate, test, and monitor AI-powered systems and data pipelines from experimentation to production.
Evidently is very modular. You can start with one-off evaluations using Reports
or Test Suites
in Python or get a real-time monitoring Dashboard
service.
Reports compute various data, ML and LLM quality metrics. You can start with Presets or customize.
Reports |
---|
Test Suites check for defined conditions on metric values and return a pass or fail result.
gt
(greater than), lt
(less than), etc.Test Suite |
---|
Monitoring UI service helps visualize metrics and test results over time.
You can choose:
Evidently Cloud offers a generous free tier and extra features like user management, alerting, and no-code evals.
Dashboard |
---|
Evidently is available as a PyPI package. To install it using pip package manager, run:
pip install evidently
To install Evidently using conda installer, run:
conda install -c conda-forge evidently
This is a simple Hello World. Check the Tutorials for more: Tabular data or LLM evaluation.
Import the Test Suite, evaluation Preset and toy tabular dataset.
import pandas as pd
from sklearn import datasets
from evidently.test_suite import TestSuite
from evidently.test_preset import DataStabilityTestPreset
iris_data = datasets.load_iris(as_frame=True)
iris_frame = iris_data.frame
Split the DataFrame
into reference and current. Run the Data Stability Test Suite that will automatically generate checks on column value ranges, missing values, etc. from the reference. Get the output in Jupyter notebook:
data_stability= TestSuite(tests=[
DataStabilityTestPreset(),
])
data_stability.run(current_data=iris_frame.iloc[:60], reference_data=iris_frame.iloc[60:], column_mapping=None)
data_stability
You can also save an HTML file. You'll need to open it from the destination folder.
data_stability.save_html("file.html")
To get the output as JSON:
data_stability.json()
You can choose other Presets, individual Tests and set conditions.
Import the Report, evaluation Preset and toy tabular dataset.
import pandas as pd
from sklearn import datasets
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
iris_data = datasets.load_iris(as_frame=True)
iris_frame = iris_data.frame
Run the Data Drift Report that will compare column distributions between current
and reference
:
data_drift_report = Report(metrics=[
DataDriftPreset(),
])
data_drift_report.run(current_data=iris_frame.iloc[:60], reference_data=iris_frame.iloc[60:], column_mapping=None)
data_drift_report
Save the report as HTML. You'll later need to open it from the destination folder.
data_drift_report.save_html("file.html")
To get the output as JSON:
data_drift_report.json()
You can choose other Presets and individual Metrics, including LLM evaluations for text data.
This launches a demo project in the Evidently UI. Check tutorials for Self-hosting or Evidently Cloud.
Recommended step: create a virtual environment and activate it.
pip install virtualenv
virtualenv venv
source venv/bin/activate
After installing Evidently (pip install evidently
), run the Evidently UI with the demo projects:
evidently ui --demo-projects all
Access Evidently UI service in your browser. Go to the localhost:8000.
Evidently has 100+ built-in evals. You can also add custom ones. Each metric has an optional visualization: you can use it in Reports
, Test Suites
, or plot on a Dashboard
.
Here are examples of things you can check:
? Text descriptors | LLM outputs |
Length, sentiment, toxicity, language, special symbols, regular expression matches, etc. | Semantic similarity, retrieval relevance, summarization quality, etc. with model- and LLM-based evals. |
? Data quality | Data distribution drift |
Missing values, duplicates, min-max ranges, new categorical values, correlations, etc. | 20+ statistical tests and distance metrics to compare shifts in data distribution. |
Classification | ? Regression |
Accuracy, precision, recall, ROC AUC, confusion matrix, bias, etc. | MAE, ME, RMSE, error distribution, error normality, error bias, etc. |
? Ranking (inc. RAG) | ? Recommendations |
NDCG, MAP, MRR, Hit Rate, etc. | Serendipity, novelty, diversity, popularity bias, etc. |
We welcome contributions! Read the Guide to learn more.
For more information, refer to a complete Documentation. You can start with the tutorials:
See more examples in the Docs.
Explore the How-to guides to understand specific features in Evidently.
If you want to chat and connect, join our Discord community!