SDK features using concrete examples¶

These examples walk through some features of the platform in more detail.

Root Signals evaluators¶

Root Signals provides over 30 ready-made evaluators that can be used to validate any textual content.

from root import RootSignals

# Connect to the Root Signals API
client = RootSignals()

result = client.evaluators.Helpfulness(response="You can find the instructions from our Careers page.")

print(f"Score: {result.score} / 1.0")  # A normalized score between 0 and 1
print(result.justification)  # The reasoning for the score

# Score: 0.1 / 1.0

# Clarity:
# The response is very brief and lacks detail. It simply directs the reader to another source without providing any specific information.
# The phrase "instructions from our Careers page" is vague and does not specify...

Custom evaluator¶

We can also create a custom evaluator. Evaluators return only floating point values between 0 and 1, based on how well the received text matches what the evaluator is described to look for.

from root import RootSignals

client = RootSignals()

network_troubleshooting_evaluator = client.evaluators.create(
    name="Network Troubleshooting",
    predicate="""Assess the response for technical accuracy and appropriateness in the context of network troubleshooting.
            Is the advice technically sound and relevant to the user's question?
            Does the troubleshooting process effectively address the likely causes of the issue?
            Is the proposed solution valid and safe to implement?

            User question: {{request}}

            Chatbot response: {{response}}
            """,
    intent="To measure the technical accuracy and appropriateness of network troubleshooting responses",
    model="gemini-2.0-flash",  # Check client.models.list() for all available models. You can also add your own model.
)

response = network_troubleshooting_evaluator.run(
    request="My internet is not working.",
    response="""
    I'm sorry to hear that your internet isn't working.
    Let's troubleshoot this step by step.
    """,
)

print(response.score)
print(response.justification)

# Score: 0.3

# METRIC: Technical Accuracy and Appropriateness
# 
# 1.  Relevance: The initial response is generic and lacks...

Adjust evaluator behavior¶

An evaluator behaviour can be adjusted by providing demonstrations.

from root import RootSignals
from root.skills import EvaluatorDemonstration

client = RootSignals()

# Create an evaluator
network_troubleshooting_evaluator = client.evaluators.create(
    name="Advanced Network Troubleshooting",
    predicate="""Assess the response for technical accuracy and appropriateness in the context of network troubleshooting.
                Is the advice technically sound and relevant to the user's question?
                Does the troubleshooting process effectively address the likely causes of the issue?
                Is the proposed solution valid and safe to implement?

                User question: {{request}}

                Chatbot response: {{response}}
                """,
    intent="To measure the technical accuracy and appropriateness of network troubleshooting responses",
    model="gemini-2.0-flash",
)


# Run first calibration (benchmarking).
test_result = client.evaluators.calibrate_existing(
    evaluator_id=network_troubleshooting_evaluator.id,
    # The test data is a list of lists, where each inner
    # list contains an expected score for a given request and response.
    test_data=[
        [
            "0.1",
            "My internet is not working.",
            "I'm sorry to hear that your internet isn't working. Let's troubleshoot this step by step.",
        ],
        [
            "0.95",
            "My internet is not working.",
            "Okay, let's check some basics. First, can you tell me what operating system your computer is running (Windows, macOS, etc.)? Also, can you check the Ethernet cable connecting your computer to the router to ensure it is securely plugged in at both ends? After confirming these steps, open a command prompt or terminal and run `ping 8.8.8.8`. Let me know the results. If you are using wireless connection, try to move closer to the router and see if that improves the connectivity. If the ping fails consistently, the issue might be with your ISP. If the connection improves closer to the router, consider improving your wireless coverage with a range extender or by repositioning the router.",
        ],
    ],
)

print(test_result[0].result)

# Improve the evaluator with demonstrations,
# penalize the vague "I'm sorry" response by setting an expected score of 0.1
client.evaluators.update(
    evaluator_id=network_troubleshooting_evaluator.id,
    evaluator_demonstrations=[
        EvaluatorDemonstration(
            response="I'm sorry to hear that your internet isn't working. Let's troubleshoot this step by step.",
            request="My internet is not working.",
            score=0.1,
        ),
    ],
)
# Run second calibration
test_result = client.evaluators.calibrate_existing(
    evaluator_id=network_troubleshooting_evaluator.id,
    test_data=[
        [
            "0.1",
            "My internet is not working.",
            "I'm sorry to hear that your internet isn't working. Let's troubleshoot this step by step.",
        ],
    ],
)

# Check the results. See that the vague "I'm sorry" response receives a lower score.
print(test_result[0].result)

Retrieval Augmented Generation (RAG) evaluation¶

For RAG, there are special evaluators that can separately measure the different intermediate components of a RAG pipeline, in addition to the final output.

from root import RootSignals

# Connect to the Root Signals API
client = RootSignals()

request = "Is the number of pensioners working more than 100k in 2023?"
response = "Yes, 150000 pensioners were working in 2024."

# Chunks retreived from a RAG pipeline
retreived_document_1 = """
While the work undertaken by seniors is often irregular and part-time, more than 150,000 pensioners were employed in 2023, the centre's statistics reveal. The centre noted that pensioners have increasingly continued to work for some time now.
"""
retreived_document_2 = """
According to the pension centre's latest data, a total of around 1.3 million people in Finland were receiving old-age pensions, with average monthly payments of 1,948 euros.
"""

# Measures is the answer faithful to my contexts (knowledge-base/documents)
faithfulness_result = client.evaluators.Faithfulness(
    request=request,
    response=response,
    contexts=[retreived_document_1, retreived_document_2],
)

print(faithfulness_result.score)  # 0.0 as the response does not match the retrieved documents
print(faithfulness_result.justification)

# Measures whether the retrieved context provides
# sufficient information to produce the ground truth response
context_recall_result = client.evaluators.Context_Recall(
    request="Was the number of pensioners who are working above 100k in 2023?",
    contexts=[retreived_document_1, retreived_document_2],
    expected_output="In 2023, 150k pensioners were still working.",  # Ground truth
)
print(context_recall_result.score)  # We expect a high score

Monitoring LLM pipelines with tags¶

Evaluator runs can be tagged with free-form tags.

from root import RootSignals

# Connect to the Root Signals API
client = RootSignals()

# Run an evaluator with tags to track the execution.
result = client.evaluators.Faithfulness(
    request="What is your return policy for electronics?",
    response="""
    You can return electronics within 30 days of purchase, provided the item is unused and in its original packaging.
    A receipt or proof of purchase is required.",
    """,
    contexts=[
        """Our returns policy for electronics allows returns within 30 days of purchase.
        The item must be unused, in its original packaging, and accompanied by a valid receipt or proof of purchase.
        Refunds will be issued to the original payment method.""",
        """A receipt or proof of purchase is required.""",
    ],
    tags=["production", "v1.23"],
)

# Get the execution log for the evaluator run.
log = client.execution_logs.get(execution_result=result)
print(log)


# And get all the logs with the same tags.
# Also include the evaluation context field in the response which by default is not included.
logs = client.execution_logs.list(tags=["production"], include=["evaluation_context"])
for log in logs:
    print(log.score)
    print(log.evaluation_context.contexts)

Use OpenAI client for chat completions¶

Evaluators and monitoring can be added to your existing codebase using OpenAI client. To do this, retrieve base_url from the Root Signals SDK Skill, and then use the normal openai API client with it. There are two ways to do it:

Without streaming, the API returns whole response to the call:

# print(completion.choices[0].message.content)

The sky appears blue because of the way sunlight interacts ...

# print(completion.choices[0].message.content)

The sky appears blue because of the way sunlight interacts ...

// print(log.validation_results)
[
  "evaluator_name": "Truthfulness"
  "result": "0.9"
  "is_valid": "true"
  "..."
]

Do note that only models specified as either model or fallback_models to the created Skill are accepted by the API. Trying to use other model names will result in an error.

When streaming (stream=True), the API response will be provided as a generator which will provide a set of chunks over time:chunks :

# print(chunk.choices[0].delta.content)

The sky appears blue because of the way sunlight interacts ...

Do note that if validators are in use, it is not possible to stream the response as the response must be validated before returning it to the caller. In that case (and possibly for other reasons too), the platform will just return the final full response after validators are done evaluating it as a single chunk.

Evaluate your LLM pipeline by grouping validators to a Objective¶

We can group and track any LLM pipeline results using an Objective.

from root import RootSignals
from root.validators import Validator

# Connect to the Root Signals API
client = RootSignals()

# Create an objective which describes what we are trying to do
objective = client.objectives.create(
    intent="Child-safe clear response",
    validators=[
        Validator(evaluator_name="Clarity", threshold=0.2),
        Validator(evaluator_name="Safety for Children", threshold=0.3),
    ],
)


llm_response = "Some LLM response I got from my custom LLM pipeline."
response = objective.run(response=llm_response)

print(response)

// print(response)

"validation":
  "validation_results": [
    "evaluator_name": "Clarity"
    "result": "0.5"
    "is_valid": "true"
    "..."
  ]

Add a model¶

Adding a model is as simple as specifying the model name and an endpoint. The model can be a local model or a model hosted on a cloud service.

from root import RootSignals

# Connect to the Root Signals API
client = RootSignals()

# Add a self-hosted model using Ollama
model = client.models.create(
    name="ollama/llama3",
    # URL pointing to the model's endpoint. Replace this with your own endpoint.
    url="https://d65e-88-148-175-2.ngrok-free.app",
)

# Use the model in a skill
skill = client.skills.create(name="My model test", prompt="Hello, my model!", model="ollama/llama3")

Simple Skill¶

Skills are measurable units of automations powered by LLMs. The APIs typically respond with Python objects that can be used to chain requests or alternatively reuse previous calls’ results. It specifies explicitly the model to use, the descriptive intent, and the input variables that are referred to in the prompt.

from root import RootSignals

# Connect to the Root Signals API
client = RootSignals()

# Create a skill
skill = client.skills.create(
    name="My text classifier",
    intent="To classify text into arbitrary categories based on semantics",
    prompt="""
    Classify this text into one of the following: {{categories}}
    Text: {{text}}
    """,
    model="gpt-4",
)

# Execute
response = skill.run(
    {
        "text": "The expectation for rate cuts has been steadily declining.",
        "categories": "Finance, Sports, Politics",
    }
)

print(response)

# We can retrieve the skill by id
skill_2 = client.skills.get(skill_id=skill.id)
response = skill_2.run(
    {
        "text": "The expectation for rate cuts has been steadily declining.",
        "categories": "Finance, Sports, Politics",
    }
)

# We can also retrieve it by name
# (the list result is an iterator, so we just take first one)
#
# The name is not an unique identifier. Consequently, the .run method is not
# intentionally available. However, you can circumvent this restriction if you
# wish by using:
skill_3 = next(client.skills.list(name="My text classifier"))
response = client.skills.run(
    skill_3.id,
    {
        "text": "The expectation for rate cuts has been steadily declining.",
        "categories": "Finance, Sports, Politics",
    },
)

// print(response)

"llm_output": "Finance",
"validation": "Validation(is_valid=True, validator_results=[])",
"model": "gpt-4",
"execution_log_id": "1181e790-7b87-457f-a2cb-6b1dfc1eddf4",
"rendered_prompt": "Classify this text into ...",
"cost": "0.00093",