- Pulling metrics into spreadsheets or notebooks for custom analysis and visualization.
- Feeding evaluation results into CI/CD pipelines to gate deployments.
- Sharing results through BI tools like Looker or internal dashboards with stakeholders who don’t have W&B seats.
- Building automated reporting pipelines that aggregate scores across projects.
API endpoints used
The snippets on this page use the following endpoints from the v2 Evaluation REST API:GET /v2/{entity}/{project}/evaluation_runs: Lists evaluation runs in a project, with optional filters by evaluation reference, model reference, or run ID.GET /v2/{entity}/{project}/evaluation_runs/{evaluation_run_id}: Reads a single evaluation run to retrieve its model, evaluation reference, status, timestamps, and summary.POST /v2/{entity}/{project}/eval_results/query: Retrieves grouped evaluation result rows for one or more evaluations. Returns per-row trials with model output, scores, and optionally resolved dataset row inputs. Also returns aggregated scorer statistics when requested.GET /v2/{entity}/{project}/predictions/{prediction_id}: Reads an individual prediction with its inputs, output, and model reference.
api as the username and your W&B API key as the password.
Prerequisites
The examples on this page use Python, but the Evaluation REST API is language-agnostic: you can call the same endpoints from TypeScript or any HTTP client. Before you begin, make sure you have the following:- Python 3.7 or later.
- The
requestslibrary. Install it withpip install requests. - A W&B API key, set as the
WANDB_API_KEYenvironment variable. Get your key at wandb.ai/settings.
Set up authentication
The following snippet imports the libraries used throughout this page and configures the base URL, authentication tuple, and target entity and project. Every later example reuses these variables.List evaluation runs
A list of evaluation runs is usually the first thing you need in an export workflow, because it gives you theevaluation_run_id values that the other endpoints require. Retrieve recent evaluation runs in a project and list details for each run, such as ID and status.
Read a single evaluation run
After you have anevaluation_run_id, you can fetch the full record for that run. Retrieve details for a specific evaluation run, including its model, evaluation reference, status, and timestamps. Replace [EVALUATION_RUN_ID] with the ID of the evaluation run you want to fetch.
Get predictions and scores
When you need the underlying data behind a run, such as for spreadsheet exports or row-level analysis, use theeval_results/query endpoint to retrieve per-row results for an evaluation run. Each row includes the resolved dataset inputs, model output, and individual scorer results. Set include_rows, include_raw_data_rows, and resolve_row_refs to get the full per-row detail. Replace [EVALUATION_RUN_ID] with the ID of the evaluation run you want to query.
Get aggregated scores
When you only need high-level metrics, such as for dashboards or CI/CD gating, request summary statistics instead of per-row data. The sameeval_results/query endpoint can also return aggregated scorer statistics instead of per-row data. Set include_summary to get summary-level metrics like pass rates for binary scorers and means for continuous scorers.
Read a single prediction
To inspect a single row in isolation, for example when investigating an unexpected score, you can fetch a prediction directly by its ID. Retrieve the full details of an individual prediction, including its inputs, output, and model reference. Replace[PREDICTION_ID] with the ID of the prediction you want to retrieve.
Row digests
Beyond the raw data each endpoint returns, the response fromeval_results/query includes an additional identifier that helps you correlate rows across runs. Each result row from the eval_results/query endpoint includes a row_digest, a content hash that uniquely identifies a specific input in the evaluation dataset based on its contents, not its position. Row digests are useful for:
- Cross-evaluation comparison: When you run two different models against the same dataset, rows with the same digest represent the same input. You can join on
row_digestto compare how different models performed on the exact same task. - Deduplication: If the same task appears in multiple evaluation suites, the digest lets you identify it.
- Reproducibility: The digest is content-addressable, so if someone modifies a dataset row (changes the instruction text, rubric, or other fields), it gets a new digest. You can verify whether two evaluation runs used identical inputs or different versions.