SDK features using concrete examples¶

These examples walk through some features of the platform in more detail.
Full list of examples is available here.

Root Signals evaluators¶

Root Signals provides over 30 ready-made evaluators that can be used to validate any textual content.

from root import RootSignals

# Connect to the Root Signals API
client = RootSignals()

result = client.evaluators.Helpfulness(
    request="Where can I find the application instructions?",
    response="You can find the instructions from our Careers page.",
)

print(f"Score: {result.score} / 1.0")  # A normalized score between 0 and 1
print(result.justification)  # The reasoning for the score

# Score: 0.1 / 1.0

# Clarity:
# The response is very brief and lacks detail. It simply directs the reader to another source without providing any specific information.
# The phrase "instructions from our Careers page" is vague and does not specify...

Custom evaluator¶

We can also create a custom evaluator. Evaluators return only floating point values between 0 and 1, based on how well the received text matches what the evaluator is described to look for.

from root import RootSignals

client = RootSignals()

network_troubleshooting_evaluator = client.evaluators.create(
    name="Network Troubleshooting",
    predicate="""Assess the response for technical accuracy and appropriateness in the context of network troubleshooting.
            Is the advice technically sound and relevant to the user's question?
            Does the troubleshooting process effectively address the likely causes of the issue?
            Is the proposed solution valid and safe to implement?

            User question: {{request}}

            Chatbot response: {{response}}
            """,
    intent="To measure the technical accuracy and appropriateness of network troubleshooting responses",
    model="gemini-2.0-flash",  # Check client.models.list() for all available models. You can also add your own model.
)

response = network_troubleshooting_evaluator.run(
    request="My internet is not working.",
    response="""
    I'm sorry to hear that your internet isn't working.
    Let's troubleshoot this step by step.
    """,
)

print(response.score)
print(response.justification)

# Score: 0.3

# METRIC: Technical Accuracy and Appropriateness
# 
# 1.  Relevance: The initial response is generic and lacks...

Adjust evaluator behavior¶

An evaluator behaviour can be adjusted by providing demonstrations.

from root import RootSignals
from root.skills import EvaluatorDemonstration

client = RootSignals()

# Create an evaluator
network_troubleshooting_evaluator = client.evaluators.create(
    name="Advanced Network Troubleshooting",
    predicate="""Assess the response for technical accuracy and appropriateness in the context of network troubleshooting.
                Is the advice technically sound and relevant to the user's question?
                Does the troubleshooting process effectively address the likely causes of the issue?
                Is the proposed solution valid and safe to implement?

                User question: {{request}}

                Chatbot response: {{response}}
                """,
    intent="To measure the technical accuracy and appropriateness of network troubleshooting responses",
    model="gemini-2.0-flash",
)


# Run first calibration (benchmarking).
test_result = client.evaluators.calibrate_existing(
    evaluator_id=network_troubleshooting_evaluator.id,
    # The test data is a list of lists, where each inner
    # list contains an expected score for a given request and response.
    test_data=[
        [
            "0.1",
            "My internet is not working.",
            "I'm sorry to hear that your internet isn't working. Let's troubleshoot this step by step.",
        ],
        [
            "0.95",
            "My internet is not working.",
            "Okay, let's check some basics. First, can you tell me what operating system your computer is running (Windows, macOS, etc.)? Also, can you check the Ethernet cable connecting your computer to the router to ensure it is securely plugged in at both ends? After confirming these steps, open a command prompt or terminal and run `ping 8.8.8.8`. Let me know the results. If you are using wireless connection, try to move closer to the router and see if that improves the connectivity. If the ping fails consistently, the issue might be with your ISP. If the connection improves closer to the router, consider improving your wireless coverage with a range extender or by repositioning the router.",
        ],
    ],
)

print(test_result[0].result)

# Improve the evaluator with demonstrations,
# penalize the vague "I'm sorry" response by setting an expected score of 0.1
client.evaluators.update(
    evaluator_id=network_troubleshooting_evaluator.id,
    evaluator_demonstrations=[
        EvaluatorDemonstration(
            response="I'm sorry to hear that your internet isn't working. Let's troubleshoot this step by step.",
            request="My internet is not working.",
            score=0.1,
        ),
    ],
)
# Run second calibration
test_result = client.evaluators.calibrate_existing(
    evaluator_id=network_troubleshooting_evaluator.id,
    test_data=[
        [
            "0.1",
            "My internet is not working.",
            "I'm sorry to hear that your internet isn't working. Let's troubleshoot this step by step.",
        ],
    ],
)

# Check the results. See that the vague "I'm sorry" response receives a lower score.
print(test_result[0].result)

Retrieval Augmented Generation (RAG) evaluation¶

For RAG, there are special evaluators that can separately measure the different intermediate components of a RAG pipeline, in addition to the final output.

from root import RootSignals

# Connect to the Root Signals API
client = RootSignals()

request = "Is the number of pensioners working more than 100k in 2023?"
response = "Yes, 150000 pensioners were working in 2024."

# Chunks retreived from a RAG pipeline
retreived_document_1 = """
While the work undertaken by seniors is often irregular and part-time, more than 150,000 pensioners were employed in 2023, the centre's statistics reveal. The centre noted that pensioners have increasingly continued to work for some time now.
"""
retreived_document_2 = """
According to the pension centre's latest data, a total of around 1.3 million people in Finland were receiving old-age pensions, with average monthly payments of 1,948 euros.
"""

# Measures is the answer faithful to my contexts (knowledge-base/documents)
faithfulness_result = client.evaluators.Faithfulness(
    request=request,
    response=response,
    contexts=[retreived_document_1, retreived_document_2],
)

print(faithfulness_result.score)  # 0.0 as the response does not match the retrieved documents
print(faithfulness_result.justification)

# Measures whether the retrieved context provides
# sufficient information to produce the ground truth response
context_recall_result = client.evaluators.Context_Recall(
    request="Was the number of pensioners who are working above 100k in 2023?",
    contexts=[retreived_document_1, retreived_document_2],
    expected_output="In 2023, 150k pensioners were still working.",  # Ground truth
)
print(context_recall_result.score)  # We expect a high score

Generate a judge¶

You can form a judge by describing your application and optionally the stage you want to evaluate. A judge is a collection of evaluators that can evaluate a component of your application.

from root import RootSignals

# Connect to the Root Signals API
client = RootSignals()

# Generate a judge by describing your application and the stage you want to evaluate.
judge_definition = client.judges.generate(
    intent="I'm building a returns handler and want to evaluate how it explains our 30-day policy, "
    "handles discount offers, and guides through the return process. Our policy is that we offer "
    "a 30-day return policy with a 20% discount on the next purchase.",
    stage="Explanation of the 30 day return policy",
)

# You can check the full definition, including the evaluators, by getting the judge.
judge = client.judges.get(judge_definition.judge_id)
print(judge)

# Run the judge and get the results. Results are a list of evaluator executions.
results = client.judges.run(
    judge_definition.judge_id,
    request="Can I return my order? I bought a pair of shoes and they don't fit.",
    response="Yes, you can return your order for a 20% discount on the next purchase.",
    # The signature of the run method is the same as the evaluator run method. You can pass in
    # contexts, tags etc...
)

print(results)

Create a judge¶

If you already have a set of evaluators you wish to use for a specific use case, you can create a judge by giving the name, intent and list of evaluators.

from root import RootSignals
from root.generated.openapi_client.models.evaluator_reference_request import EvaluatorReferenceRequest
from root.skills import Evaluators

# Connect to the Root Signals API
client = RootSignals()

evaluator_references = [
    EvaluatorReferenceRequest(id=Evaluators.Eval.Truthfulness.value),
    EvaluatorReferenceRequest(id=Evaluators.Eval.Relevance.value),
]

judge = client.judges.create(
    name="Custom Returns Policy Judge",
    intent="Evaluate customer service responses about return policies",
    evaluator_references=evaluator_references,
)

results = client.judges.run(
    judge.id,
    request="What's your return policy?",
    response="We have a 30-day return policy. If you're not satisfied with your purchase, "
    "you can return it within 30 days for a full refund.",
    contexts=[
        "Returns are accepted within thirty (30) calendar days of the delivery date. "
        "Eligible items accompanied by valid proof of purchase will receive a full refund, issued via the original method of payment."
    ],
)
print(results)

Monitoring LLM pipelines with tags¶

Evaluator runs can be tagged with free-form tags.

from root import RootSignals

# Connect to the Root Signals API
client = RootSignals()

# Run an evaluator with tags to track the execution.
result = client.evaluators.Faithfulness(
    request="What is your return policy for electronics?",
    response="""
    You can return electronics within 30 days of purchase, provided the item is unused and in its original packaging.
    A receipt or proof of purchase is required.",
    """,
    contexts=[
        """Our returns policy for electronics allows returns within 30 days of purchase.
        The item must be unused, in its original packaging, and accompanied by a valid receipt or proof of purchase.
        Refunds will be issued to the original payment method.""",
        """A receipt or proof of purchase is required.""",
    ],
    tags=["production", "v1.23"],
)

# Get the execution log for the evaluator run.
log = client.execution_logs.get(execution_result=result)
print(log)


# And get all the logs with the same tags.
# Also include the evaluation context field in the response which by default is not included.
logs = client.execution_logs.list(tags=["production"], include=["evaluation_context"])
for log in logs:
    print(log.score)
    print(log.evaluation_context.contexts)

Add a model¶

Adding a model is as simple as specifying the model name and an endpoint. The model can be a local model or a model hosted on a cloud service.

from root import RootSignals

# Connect to the Root Signals API
client = RootSignals()

# Add a self-hosted model using Ollama
model = client.models.create(
    name="ollama/llama3",
    # URL pointing to the model's endpoint. Replace this with your own endpoint.
    url="https://d65e-88-148-175-2.ngrok-free.app",
)

# Use the model in a evaluator
evaluator = client.evaluators.create(
    name="My model test",
    predicate="Hello, my model! {{response}}",
    model="ollama/llama3",
)