Skip to main content
Custom code scorers and classifiers let you write evaluation logic with full control over the result. A scorer returns a numeric score, while a classifier returns a categorical label. They can use any packages you need and are best when you have specific rules, patterns, or calculations to implement. You can define custom code scorers in three places:
  • Inline in SDK code: Define scorers directly in your evaluation scripts for local development or application-specific logic.
  • Pushed via CLI: Define scorers in TypeScript or Python files and push them to Braintrust for team-wide sharing and automatic evaluation of production logs.
  • Created in UI: Build scorers in the Braintrust web interface using the built-in code editor.
Most teams prototype in the UI, then push production-ready scorers via the CLI. See Scorers overview for guidance.

Score spans

Span-level scorers evaluate individual operations or outputs. Use them for measuring single LLM responses, checking specific tool calls, or validating individual outputs. Each matching span receives an independent score. Your scorer function receives these parameters:
  • input: The input to your task
  • output: The output from your task
  • expected: The expected output (optional)
  • metadata: Custom metadata from the test case
Return a number between 0 and 1, or an object with score and optional metadata. In Ruby, declare only the parameters you need as keyword arguments. The runner automatically filters out the rest: |output:, expected:|.
Use scorers inline in your evaluation code:
equality_scorer.eval.ts
import { Eval, type EvalScorer } from "braintrust";
import OpenAI from "openai";

const client = new OpenAI();

const DATASET = [
  {
    input: "What is 2+2?",
    expected: "4",
  },
  {
    input: "What is the capital of France?",
    expected: "Paris",
  },
];

async function task(input: string): Promise<string> {
  const response = await client.responses.create({
    model: "gpt-5-mini",
    input: [
      { role: "user", content: input },
    ],
  });
  return response.output_text ?? "";
}

const equalityScorer: EvalScorer<string, string, string> = ({ output, expected }) => {
  if (!expected) return null;
  const matches = output === expected;
  return {
    name: "Equality",
    score: matches ? 1 : 0,
    metadata: { exact_match: matches },
  };
};

const containsScorer: EvalScorer<string, string, string> = ({ output, expected }) => {
  if (!expected) return null;
  const contains = output.toLowerCase().includes(expected.toLowerCase());
  return {
    name: "Contains expected",
    score: contains ? 1 : 0,
  };
};

Eval("Custom Code Scorer Example", {
  data: DATASET,
  task,
  scores: [equalityScorer, containsScorer],
});

Score traces

Trace-level scorers evaluate entire execution traces including all spans and conversation history. Use these for assessing multi-turn conversation quality, agent behavior such as tool usage and trajectory, or overall workflow completion. Trace-level scorers are the right choice whenever a scorer needs the full execution context rather than a single span. The scorer runs once per trace. Your handler function receives the trace parameter, which provides methods for accessing execution data:
  • Get spans: Returns spans matching the filter. Each span includes input, output, expected, metadata, tags, scores, metrics, error (populated when the span failed), span_id, span_parents, and span_attributes. Omit the filter to get all spans, or pass multiple types like ["llm", "tool"].
    • TypeScript: trace.getSpans({ spanType: ["llm"] })
    • Python: trace.get_spans(span_type=["llm"])
    • Java: trace.getSpans("llm")
    • Ruby: trace.spans(span_type: "llm")
    • C#: trace.GetSpansAsync("llm")
  • Get thread: Returns an array of conversation messages extracted from LLM spans.
    • TypeScript: trace.getThread()
    • Python: trace.get_thread()
    • Java: trace.getLLMConversationThread()
    • Ruby: trace.thread
    • C#: trace.GetThreadAsync()
input, output, expected, and metadata are automatically populated from the root span and passed to your scorer function.
Trace-level scoring requires TypeScript SDK v2.2.1+, Python SDK v0.5.6+, Java SDK v0.3.8+, Ruby SDK v0.2.1+, or C# SDK v0.2.3+.
In the TypeScript SDK (v3.16.0 or later), LocalTrace is the concrete Trace implementation passed to trace-level scorers. Import it from braintrust to construct a Trace directly for advanced or manual scoring.
Use scorers inline in your evaluation code:
trace_code_scorer.eval.ts
import { Eval, wrapOpenAI, wrapTraced, type EvalScorer } from "braintrust";
import OpenAI from "openai";

const client = wrapOpenAI(new OpenAI());

const SUPPORT_DATASET = [
  { input: "My order hasn't arrived yet. Order #12345." },
  { input: "I need help resetting my password." },
];

const callLLM = wrapTraced(async function callLLM(messages: Array<{ role: string; content: string }>) {
  const response = await client.chat.completions.create({
    model: "gpt-5-mini",
    messages,
  });
  return response.choices[0].message.content || "";
});

async function supportTask(input: string): Promise<string> {
  const messages: Array<{ role: string; content: string }> = [
    { role: "system", content: "You are a helpful customer support agent." }
  ];

  messages.push({ role: "user", content: input });
  const response1 = await callLLM(messages);
  messages.push({ role: "assistant", content: response1 });

  messages.push({ role: "user", content: "Can you provide more details?" });
  const response2 = await callLLM(messages);
  messages.push({ role: "assistant", content: response2 });

  messages.push({ role: "user", content: "Thank you for your help!" });
  const response3 = await callLLM(messages);

  return response3;
}

const politenessScorer: EvalScorer<string, string, unknown> = async ({ trace }) => {
  if (!trace) return 0;

  const thread = await trace.getThread();
  const lastAssistantMsg = thread.reverse().find(msg => msg.role === "assistant");
  const content = lastAssistantMsg?.content?.toLowerCase() || "";

  const politeWords = ["welcome", "glad", "happy", "pleasure", "thank"];
  const isPolite = politeWords.some(word => content.includes(word));

  return {
    name: "Politeness",
    score: isPolite ? 1 : 0,
    metadata: { checked_message_preview: content.slice(0, 80) },
  };
};

const efficiencyScorer: EvalScorer<string, string, unknown> = async ({ trace }) => {
  if (!trace) return 0;

  const llmSpans = await trace.getSpans({ spanType: ["llm"] });
  const isEfficient = llmSpans.length >= 3 && llmSpans.length <= 5;

  return {
    name: "Efficiency",
    score: isEfficient ? 1 : 0,
    metadata: { llm_calls: llmSpans.length },
  };
};

Eval("Support Quality", {
  data: SUPPORT_DATASET,
  task: supportTask,
  scores: [politenessScorer, efficiencyScorer],
});

Trace scorer recipes

Use trace scorers for checks that depend on the agent’s trajectory, such as tool usage, tool failures, or step budgets. Add any of these scorers to the scores array in an Eval, or adapt the handler body for a CLI or UI scorer.
trace_scorer_recipes.eval.ts
import { type EvalScorer } from "braintrust";

function spanName(span: { span_attributes?: { name?: string } }): string {
  return span.span_attributes?.name ?? "unknown";
}

function stringField(value: unknown, fieldName: string): string | null {
  if (typeof value !== "object" || value === null) return null;

  const field = Object.getOwnPropertyDescriptor(value, fieldName)?.value;
  return typeof field === "string" ? field : null;
}

// Check if a specific tool was called at least once.
const requiredToolCalled: EvalScorer<string, string, unknown> = async ({
  trace,
}) => {
  if (!trace) return null;

  const toolSpans = await trace.getSpans({ spanType: ["tool"] });
  const editViewCalls = toolSpans.filter(
    (span) => span.span_attributes?.name === "edit_view",
  );

  return {
    name: "edit_view called",
    score: editViewCalls.length > 0 ? 1 : 0,
    metadata: { edit_view_calls: editViewCalls.length },
  };
};

// Check if a tool was called with an argument matching the expected value.
const requiredToolCalledWithArg: EvalScorer<
  string,
  string,
  unknown
> = async ({ expected, trace }) => {
  if (!trace) return null;

  const documentId = stringField(expected, "document_id");
  if (!documentId) return null;

  const toolSpans = await trace.getSpans({ spanType: ["tool"] });
  const searchCalls = toolSpans.filter(
    (span) => span.span_attributes?.name === "search_docs",
  );
  const matchedCall = searchCalls.some(
    (span) => stringField(span.input, "document_id") === documentId,
  );

  return {
    name: "searched expected document",
    score: matchedCall ? 1 : 0,
    metadata: {
      expected_document_id: documentId,
      search_docs_calls: searchCalls.length,
    },
  };
};

// Check that no tool from a denylist was called.
const noDisallowedTools: EvalScorer<string, string, unknown> = async ({
  trace,
}) => {
  if (!trace) return null;

  const disallowedToolNames = new Set(["send_email", "delete_record"]);
  const toolSpans = await trace.getSpans({ spanType: ["tool"] });
  const disallowedCalls = toolSpans.filter((span) => {
    const name = span.span_attributes?.name;
    return typeof name === "string" && disallowedToolNames.has(name);
  });

  return {
    name: "no disallowed tools",
    score: disallowedCalls.length === 0 ? 1 : 0,
    metadata: {
      disallowed_tools: disallowedCalls.map(spanName),
    },
  };
};

// Check that every tool call completed without error.
const allToolsSucceeded: EvalScorer<string, string, unknown> = async ({
  trace,
}) => {
  if (!trace) return null;

  const toolSpans = await trace.getSpans({ spanType: ["tool"] });
  const failedToolCalls = toolSpans.filter((span) => Boolean(span.error));

  return {
    name: "tool calls succeeded",
    score: failedToolCalls.length === 0 ? 1 : 0,
    metadata: {
      failed_tools: failedToolCalls.map(spanName),
      tool_calls: toolSpans.length,
    },
  };
};

// Check if the agent stayed within a step budget.
const trajectoryBudget: EvalScorer<string, string, unknown> = async ({
  trace,
}) => {
  if (!trace) return null;

  const maxSteps = 8;
  const agentSpans = await trace.getSpans({ spanType: ["llm", "tool"] });

  return {
    name: "trajectory budget",
    score: agentSpans.length <= maxSteps ? 1 : 0,
    metadata: {
      agent_steps: agentSpans.length,
      max_steps: maxSteps,
    },
  };
};

Set pass thresholds

Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as passing (green highlighting with checkmark), while scores below are marked as failing (red highlighting).
Pass thresholds apply only to scorers that output numeric scores. Classifiers, which output labels, don’t use them.
Add __pass_threshold to the scorer’s metadata (value between 0 and 1):
project.scorers.create({
  name: "Quality checker",
  slug: "quality-checker",
  handler: async ({ output, expected }) => {
    return output === expected ? 1 : 0;
  },
  metadata: {
    __pass_threshold: 0.8,
  },
});

Return multiple scores

A single scorer can return an array of score objects to emit multiple named metrics from one call. This is useful when several quality dimensions can be computed together or share computation. Each item appears as its own score column in the Braintrust UI. Each item requires name and score. metadata is optional.
Eval("Summary Quality", {
  data: DATASET,
  task,
  scores: [
    ({ output, expected }) => {
      const words = (output ?? "").toLowerCase().split(/\s+/);
      const keyTerms: string[] = expected.key_terms;
      const covered = keyTerms.filter((t) => words.includes(t)).length;
      return [
        {
          name: "coverage",
          score: keyTerms.length ? covered / keyTerms.length : 1,
          metadata: { missing: keyTerms.filter((t) => !words.includes(t)) },
        },
        {
          name: "conciseness",
          score: words.length <= expected.max_words ? 1 : 0,
          metadata: { word_count: words.length, limit: expected.max_words },
        },
      ];
    },
  ],
});

Apply classification labels

A classifier returns a categorical label instead of a numeric score. Define custom code classifiers inline in your eval code, as a function that evaluates a result and constructs one or more classifications. Each classification your function returns sets a name (the group it belongs to, such as intent), an id (the value you filter by, such as password_reset), an optional label for display (such as Password reset), and optional metadata. Unlike an LLM-as-a-judge classifier, custom code sets these fields independently and can return more than one classification at a time.
To create a classifier in the UI, build an LLM-as-a-judge classifier.
import { Eval } from "braintrust";

const DATASET = [
  {
    input: "Hello! Can you help me reset my password?",
    expected: "password_reset",
  },
];

async function task(input: string): Promise<string> {
  // Stand-in for your LLM call
  return `Thanks for reaching out. ${input}`;
}

function intentClassifier({ output }: { output: string }) {
  if (output.toLowerCase().includes("password")) {
    return {
      name: "intent",
      id: "password_reset",
      label: "Password reset",
    };
  }

  return {
    name: "intent",
    id: "other",
    label: "Other",
  };
}

Eval("Support intent", {
  data: DATASET,
  task,
  classifiers: [intentClassifier],
});
For the C# and Java examples, use the BRAINTRUST_DEFAULT_PROJECT_NAME environment variable to set a project name. Otherwise, the default project is default-dotnet-project (C#) or default-java-project (Java).
In a single evaluation, you can use scorers, classifiers, or both. Classifier failures do not stop the evaluation or affect other scorers and classifiers. Braintrust records classifier errors in the result metadata under classifier_errors. A classifier can also assign multiple labels at once:
function intentClassifier() {
  return [
    { name: "intent", id: "billing", label: "Billing" },
    { name: "intent", id: "login", label: "Login" },
  ];
}
Classifiers require TypeScript SDK v3.9.0+, Python SDK v0.16.0+, Go SDK v0.8.0+, Java SDK v0.3.12+, or C# SDK v0.2.8+.

Next steps