Orchestrating Multi-Agent Solutions with Semantic Kernel

Orchestrating Multi-Agent Solutions with Semantic Kernel


In the real world, complex problems aren't solved by one person. A DevOps crisis involves a monitoring expert, a root cause analyst, and a deployment specialist. We apply this same "Team" logic to AI using the Semantic Kernel AgentGroupChat.

1. Why Multi-Agent?

Single-agent systems often suffer from "context fatigue" when tasked with too many contradictory instructions. By breaking a problem down into specialized roles, we achieve:

  • Higher Precision: Each agent has a narrow, hyper-focused system prompt.
  • Separation of Concerns: One agent can be the "Thinker" (Incident Manager) while the other is the "Doer" (DevOps Assistant).
  • Robust Workflows: Agents can check each other's work, reducing hallucinations in critical infrastructure.

The DevOps Example:

  1. Monitoring Agent: Detects the anomaly.
  2. Root Cause Agent: Identifies why it happened.
  3. Deployment Agent: Rolls back the change.
  4. Reporting Agent: Notifies the stakeholders.

2. Managing the Conversation: Selection & Termination

In a group chat, you need a "Moderator." Semantic Kernel provides two critical strategies to manage the flow:

A. Selection Strategy

This determines who speaks next.

  • SequentialSelectionStrategy: A simple "round-robin" or fixed-order approach.
  • KernelFunctionSelectionStrategy: Uses an LLM to dynamically decide which agent is best suited to answer the current state of the chat.

B. Termination Strategy

This determines when the job is done. Without a termination strategy, agents might loop forever. We define a specific condition (e.g., the word "RESOLVED" or "NO ACTION NEEDED") to stop the cycle.


Lab Section: Building a Self-Healing DevOps Multi-Agent System

In this lab, we will build a two-agent system: an Incident Manager that analyzes logs and a DevOps Assistant that executes fixes.

1. Project Prerequisites

Create your requirements.txt and .env as defined below:

Plaintext
python-dotenv
semantic-kernel[azure]
azure-identity

File Structure:

  • /logs/: (Created automatically by script)
  • /sample_logs/: Contains log1.log, log2.log, log3.log, and log4.log.
  • .env: Containing your connection strings.
  • requirements.txt: python-dotenv, semantic-kernel, azure-identity.


2. Implementation (agent_chat.py)

This lab demonstrates how to subclass Selection and Termination strategies to create a controlled loop between two specialized agents.

Python
import asyncio
import os
import textwrap
import shutil
from datetime import datetime
from pathlib import Path

from azure.identity.aio import DefaultAzureCredential
from semantic_kernel.agents import AgentGroupChat, AzureAIAgent, AzureAIAgentSettings
from semantic_kernel.agents.strategies import TerminationStrategy, SequentialSelectionStrategy
from semantic_kernel.contents.chat_message_content import ChatMessageContent
from semantic_kernel.contents.utils.author_role import AuthorRole
from semantic_kernel.functions.kernel_function_decorator import kernel_function

# --- Role Definitions and Instructions ---
INCIDENT_MANAGER = "INCIDENT_MANAGER"
INCIDENT_MANAGER_INSTRUCTIONS = """
Analyze the given log file or the response from the devops assistant.
Recommend which one of the following actions should be taken:

- Restart service {service_name}
- Rollback transaction
- Redeploy resource {resource_name}
- Increase quota

If there are no issues or if the issue has already been resolved, respond with "INCIDENT_MANAGER > No action needed"
If none of the options resolve the issue, respond with "Escalate issue."

RULES:
- Do not perform any corrective actions yourself.
- Read the log file on every turn.
- Prepend your response with this text: "INCIDENT_MANAGER > {logfilepath} | "
- Only respond with the corrective action instructions.
"""

DEVOPS_ASSISTANT = "DEVOPS_ASSISTANT"
DEVOPS_ASSISTANT_INSTRUCTIONS = """
Read the instructions from the INCIDENT_MANAGER and apply the appropriate resolution function.
Return the response as "{function_response}"
If the instructions indicate there are no issues or actions needed, 
take no action and respond with "No action needed."

RULES:
- Use the instructions provided.
- Do not read any log files yourself.
- Prepend your response with this text: "DEVOPS_ASSISTANT > "
"""

# --- Custom Strategies ---

class SelectionStrategy(SequentialSelectionStrategy):
    """Determines which agent should take the next turn."""
    async def select_agent(self, agents, history):
        # The Incident Manager goes after the User or the DevOps Assistant
        if history[-1].name == DEVOPS_ASSISTANT or history[-1].role == AuthorRole.USER:
            return next((agent for agent in agents if agent.name == INCIDENT_MANAGER), None)
        
        # Otherwise, it is the DevOps Assistant's turn
        return next((agent for agent in agents if agent.name == DEVOPS_ASSISTANT), None)

class ApprovalTerminationStrategy(TerminationStrategy):
    """Determines when the conversation should end."""
    async def should_agent_terminate(self, agent, history):
        # End chat if the agent indicates no action is needed
        return "no action needed" in history[-1].content.lower()

# --- Plugins ---

class LogFilePlugin:
    """Plugin to allow agents to read log files."""
    @kernel_function(description="Reads the contents of a log file.")
    def read_log(self, file_path: str) -> str:
        with open(file_path, 'r') as f:
            return f.read()

class DevopsPlugin:
    """Plugin to simulate DevOps corrective actions."""
    @kernel_function(description="Restarts a service.")
    def restart_service(self, service_name: str) -> str:
        return f"{{Service {service_name} restarted successfully.}}"

    @kernel_function(description="Rolls back a transaction.")
    def rollback_transaction(self) -> str:
        return "{Transaction rolled back successfully.}"

    @kernel_function(description="Redeploys a resource.")
    def redeploy_resource(self, resource_name: str) -> str:
        return f"{{Resource {resource_name} redeployed successfully.}}"

    @kernel_function(description="Increases quota.")
    def increase_quota(self) -> str:
        return "{Quota increased successfully.}"

# --- Main Logic ---

async def main():
    # Clear the console
    os.system('cls' if os.name == 'nt' else 'clear')

    # Setup file paths
    print("Getting log files...\n")
    script_dir = Path(__file__).parent
    src_path = script_dir / "sample_logs"
    file_path = script_dir / "logs"
    shutil.copytree(src_path, file_path, dirs_exist_ok=True)

    # Get Azure AI Settings
    ai_agent_settings = AzureAIAgentSettings()

    async with (
        DefaultAzureCredential(exclude_environment_credential=True, exclude_managed_identity_credential=True) as creds,
        AzureAIAgent.create_client(credential=creds) as client,
    ):
        # Create Incident Manager Agent
        incident_agent_definition = await client.agents.create_agent(
            model=ai_agent_settings.model_deployment_name,
            name=INCIDENT_MANAGER,
            instructions=INCIDENT_MANAGER_INSTRUCTIONS
        )
        agent_incident = AzureAIAgent(
            client=client,
            definition=incident_agent_definition,
            plugins=[LogFilePlugin()]
        )

        # Create DevOps Assistant Agent
        devops_agent_definition = await client.agents.create_agent(
            model=ai_agent_settings.model_deployment_name,
            name=DEVOPS_ASSISTANT,
            instructions=DEVOPS_ASSISTANT_INSTRUCTIONS
        )
        agent_devops = AzureAIAgent(
            client=client,
            definition=devops_agent_definition,
            plugins=[DevopsPlugin()]
        )

        # Add agents to a group chat with custom strategies
        chat = AgentGroupChat(
            agents=[agent_incident, agent_devops],
            termination_strategy=ApprovalTerminationStrategy(
                agents=[agent_incident],
                maximum_iterations=10,
                automatic_reset=True
            ),
            selection_strategy=SelectionStrategy(agents=[agent_incident, agent_devops])
        )

        # Process log files in the directory
        for filename in os.listdir(file_path):
            current_log = file_path / filename
            logfile_msg = ChatMessageContent(
                role=AuthorRole.USER, 
                content=f"USER > {current_log} | Please analyze this log."
            )
            
            print(f"Ready to process log file: {filename}\n")
            await asyncio.sleep(2) # Buffer to reduce TPM/rate limits
            
            # Append the current log file message to the chat
            await chat.add_chat_message(logfile_msg)

            # Invoke the chat and display outputs
            async for response in chat.invoke():
                print(f"{response.name} > {response.content}")
            
            print("\n" + "="*50 + "\n")

if __name__ == "__main__":
    asyncio.run(main())

3. Deployment Steps

Step 1: Virtual Environment

PowerShell
python -m venv my-env
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\my-env\Scripts\activate

Step 2: Install and Run

PowerShell
pip install -r requirements.txt
python agent_chat.py

4. Key Learnings for the AI-102 Exam

  1. AgentGroupChat Hierarchy: Understand that the AgentGroupChat object is the container that manages the shared history (Thread) across multiple agents.

  2. Custom Selection Logic: You must know how to override select_agent. This allows for complex business logic, like "Agent B only speaks if Agent A mentions a specific keyword."

  3. State Management: In a multi-agent setup, the "History" is the source of truth. Each agent reads the entire history of the chat to understand what has already been attempted by its peers.


Lab Highlights

  • Sequential Logic: The system is designed for the INCIDENT_MANAGER to read logs first via LogFilePlugin, followed by the DEVOPS_ASSISTANT executing a command via DevopsPlugin.
  • Termination: The chat terminates when "No action needed" is detected in the message content, signifying the incident is resolved.
  • Automation: As seen in the final two screenshots, the agent successfully identifies a restart for ServiceX in log1.log and a rollback in log2.log, confirming the multi-agent logic works as expected.

Previous Post Next Post

Contact Form