This commit is contained in:
Gaetan Hurel
2025-06-26 18:02:43 +02:00
parent ea1519a208
commit d33cddef1e
13 changed files with 684 additions and 82 deletions

View File

@@ -0,0 +1,93 @@
# Enhanced Agent Results Communication
## Problem Identified
The agents were only sending "Successfully transferred control back to supervisor" messages without providing meaningful analysis results from their work.
## Root Cause
The agent prompts were too brief and didn't explicitly instruct agents to:
1. Summarize their findings after executing commands
2. Provide structured analysis before transferring back to supervisor
3. Include specific recommendations and insights
## Solution Implemented
### 1. Enhanced Agent Prompts
Updated all agent prompts to include:
- **Explicit task definitions** with required commands
- **Structured analysis requirements** with specific sections
- **Clear instructions** to provide comprehensive summaries
- **Always provide analysis summary before completing task**
### 2. Specific Improvements by Agent
#### System Agents
- **system_info_worker**: Now analyzes CPU, memory, disk, load, and top processes with structured summary
- **service_inventory_worker**: Provides service categorization, failed services analysis, security-relevant services
#### Service Agents
- **nginx_analyzer**: Comprehensive config validation, log analysis, specific 502/503/504 error troubleshooting
- **mariadb_analyzer**: Database status, configuration assessment, log analysis, performance indicators
- **phpfpm_analyzer**: Process analysis, memory limits, timeout configuration, socket connectivity
#### Network Agents
- **network_diag**: Connectivity testing, DNS analysis, port scanning with adaptive commands
- **cert_checker**: Certificate discovery, expiration monitoring, validation with 30-day alerts
#### Analysis Agents
- **risk_scorer**: Structured risk assessment with CRITICAL/HIGH/MEDIUM/LOW levels and prioritized recommendations
- **remediation_worker**: Safe remediation plans with dry-run commands and confirmation protocols
- **harmonizer_worker**: Security hardening assessment with best-practice implementation
### 3. Enhanced Supervisor Coordination
Updated supervisor prompt to:
- Better utilize agent expertise
- Synthesize findings from multiple agents
- Provide strategic insights beyond delegation
## Expected Behavior Change
### Before:
```
STEP 2: SYSTEM_INFO_WORKER
Successfully transferred control back to supervisor
STEP 3: SERVICE_INVENTORY_WORKER
Successfully transferred control back to supervisor
STEP 4: SUPERVISOR
Generic response not based on actual findings
```
### After:
```
STEP 2: SYSTEM_INFO_WORKER
📊 ANALYSIS SUMMARY:
- CPU: Intel i7, load average 0.8, 1.2, 1.5
- Memory: 8GB total, 6.2GB used (78% utilization)
- Disk: Root partition 85% full - WARNING LEVEL
- Top processes: nginx (2.1GB), mysql (1.8GB)
- Recommendation: Monitor disk space, investigate memory usage
STEP 3: SERVICE_INVENTORY_WORKER
📋 SERVICE ANALYSIS:
- 47 services running (normal range)
- Critical services: ✅ nginx, ✅ mysql, ❌ php-fpm (failed)
- Failed services: php8.1-fpm.service
- Security services: ✅ ssh, ✅ ufw
- Recommendation: Investigate php-fpm failure for potential 502 errors
STEP 4: SUPERVISOR
Based on system analysis showing high memory usage and service inventory
revealing php-fpm failure, this explains your 502 errors...
```
## Files Modified
- `agents/system_agents.py` - Enhanced system monitoring agents
- `agents/service_agents.py` - Enhanced service-specific agents
- `agents/network_agents.py` - Enhanced network and security agents
- `agents/analysis_agents.py` - Enhanced analysis and remediation agents
- `config.py` - Enhanced supervisor prompt and coordination strategy
## Result
Agents now provide meaningful, structured analysis that the supervisor can synthesize into comprehensive, actionable responses instead of generic outputs.

View File

@@ -0,0 +1,129 @@
# Dynamic Instructions for Agent Transfers - TODO
## Current Behavior
Currently, when the supervisor transfers control to an agent:
- ❌ No specific instructions are passed
- ❌ Agent only sees the original user query
- ❌ Agent uses its static, pre-defined prompt
## Proposed Enhancement: Dynamic Instructions
### Why It Matters
The supervisor often has context about WHY it's transferring to a specific agent. For example:
- "Transfer to network_diag because user mentioned DNS issues - focus on DNS diagnostics"
- "Transfer to cert_checker because certificates might be expiring - check all certs urgently"
### Implementation Approach
#### 1. Modify Transfer Tools
```python
def transfer_to_network_diag(instructions: str = "") -> str:
"""Transfer control to network diagnostics agent.
Args:
instructions: Specific guidance for the agent
"""
return f"Successfully transferred to network_diag. Instructions: {instructions}"
```
#### 2. Update State to Include Instructions
```python
class State(BaseModel):
messages: List[AnyMessage]
next_agent: str = "supervisor"
supervisor_instructions: Optional[str] = None # NEW FIELD
```
#### 3. Modify Agent Creation to Check for Instructions
```python
def create_network_worker():
return create_react_agent(
model="openai:gpt-4o-mini",
tools=[get_shell_tool()],
prompt="""
{base_prompt}
SUPERVISOR INSTRUCTIONS (if any): {supervisor_instructions}
Always prioritize supervisor instructions when provided.
""",
name="network_diag"
)
```
#### 4. Update Router Logic
```python
def route_agent(state):
# Extract supervisor instructions from last ToolMessage
last_message = state["messages"][-1]
if isinstance(last_message, ToolMessage) and "Instructions:" in last_message.content:
# Parse and store instructions
instructions = extract_instructions(last_message.content)
state["supervisor_instructions"] = instructions
return state["next_agent"]
```
### Example Flow
1. **User Query**: "My website is slow"
2. **Supervisor Analysis**:
```
"Website slowness could be DNS or certificate related.
Let me transfer to network_diag with specific focus."
```
3. **Supervisor Transfer**:
```python
transfer_to_network_diag(
instructions="Focus on DNS resolution times and latency to common websites.
Check if DNS servers are responding slowly."
)
```
4. **Network Agent Receives**:
- Original query: "My website is slow"
- Supervisor instructions: "Focus on DNS resolution times..."
- Can now prioritize DNS diagnostics over general network checks
### Benefits
1. **More Targeted Diagnostics**: Agents focus on what matters
2. **Better Context Sharing**: Supervisor's analysis isn't lost
3. **Efficient Execution**: Avoid running unnecessary commands
4. **Improved Results**: More relevant output for user's specific issue
### Alternative: Context in Messages
Instead of modifying tools, append supervisor analysis to the message history:
```python
# Before transfer, supervisor adds a system message
state["messages"].append(
SystemMessage(content=f"[SUPERVISOR GUIDANCE]: Focus on {specific_issue}")
)
```
### Decision Points
1. **Tool Parameters vs State**: Where to store instructions?
2. **Prompt Injection vs Message History**: How to pass instructions?
3. **Optional vs Required**: Should all transfers include instructions?
4. **Persistence**: Should instructions carry through multiple agent hops?
### Next Steps
1. [ ] Decide on implementation approach
2. [ ] Modify transfer tool signatures
3. [ ] Update state model
4. [ ] Enhance agent prompts to use instructions
5. [ ] Test with various scenarios
6. [ ] Document the new pattern
### Example Test Cases
- "Check network" → No specific instructions needed
- "Website is slow" → "Focus on DNS and latency"
- "Certificate expiring?" → "Check all certs, prioritize those expiring soon"
- "Port 443 issues" → "Focus on HTTPS connectivity and certificate validation"

View File

@@ -0,0 +1,90 @@
# Multi-Agent Sysadmin Assistant
A modular multi-agent system for system administration tasks using LangChain and LangGraph.
## Architecture
The system is organized into several modules for better maintainability:
### 📁 Project Structure
```
multi-agent-supervisor/
├── main-multi-agent.py # Main entry point
├── config.py # Configuration and settings
├── supervisor.py # Supervisor orchestration
├── utils.py # Utility functions
├── requirements.txt # Dependencies
├── custom_tools/ # Custom tool implementations
│ ├── __init__.py
│ ├── log_tail_tool.py # Log reading tool
│ └── shell_tool_wrapper.py # Shell tool wrapper
└── agents/ # Agent definitions
├── __init__.py
├── system_agents.py # System monitoring agents
├── service_agents.py # Service-specific agents
├── network_agents.py # Network and security agents
└── analysis_agents.py # Analysis and remediation agents
```
## Agents
### System Agents
- **System Info Worker**: Gathers CPU, RAM, and disk usage
- **Service Inventory Worker**: Lists running services
### Service Agents
- **MariaDB Analyzer**: Checks MariaDB configuration and logs
- **Nginx Analyzer**: Validates Nginx configuration and logs
- **PHP-FPM Analyzer**: Monitors PHP-FPM status and performance
### Network Agents
- **Network Diagnostics**: Uses ping, traceroute, and dig
- **Certificate Checker**: Monitors TLS certificate expiration
### Analysis Agents
- **Risk Scorer**: Aggregates findings and assigns severity levels
- **Remediation Worker**: Proposes safe fixes for issues
- **Harmonizer Worker**: Applies system hardening best practices
## Benefits of Modular Architecture
1. **Separation of Concerns**: Each module has a single responsibility
2. **Reusability**: Tools and agents can be easily reused across projects
3. **Maintainability**: Easy to update individual components
4. **Testability**: Each module can be tested independently
5. **Scalability**: Easy to add new agents or tools
6. **Code Organization**: Clear structure makes navigation easier
## Usage
```python
from supervisor import create_sysadmin_supervisor
# Create supervisor with all agents
supervisor = create_sysadmin_supervisor()
# Run analysis
query = {
"messages": [
{
"role": "user",
"content": "Check if my web server is running properly"
}
]
}
result = supervisor.invoke(query)
```
## Adding New Agents
1. Create agent function in appropriate module under `agents/`
2. Import and add to supervisor in `supervisor.py`
3. Update supervisor prompt in `config.py`
## Adding New Tools
1. Create tool class in `custom_tools/`
2. Export from `custom_tools/__init__.py`
3. Import and use in agent definitions

View File

@@ -0,0 +1,182 @@
# Understanding Multi-Agent Transfers
## What "Successfully transferred..." means
When you see messages like:
- `Successfully transferred to system_info_worker`
- `Successfully transferred back to supervisor`
These are **tool execution results** from the LangGraph supervisor pattern. Here's what's happening:
## 🔄 The Transfer Flow
1. **Supervisor receives user query**: "Nginx returns 502 Bad Gateway on my server. What can I do?"
2. **Supervisor analyzes and delegates**: Based on the `SUPERVISOR_PROMPT` in `config.py`, it decides to start with `system_info_worker`
3. **Transfer tool execution**: Supervisor calls `transfer_to_system_info_worker` tool
- **Result**: "Successfully transferred to system_info_worker"
- **Meaning**: Control is now handed to the system_info_worker agent
4. **Agent executes**: The `system_info_worker` gets:
- Full conversation context (including the original user query)
- Its own specialized prompt from `agents/system_agents.py`
- Access to its tools (shell commands for system info)
5. **Agent completes and returns**: Agent calls `transfer_back_to_supervisor`
- **Result**: "Successfully transferred back to supervisor"
- **Meaning**: Agent finished its task and returned control
- **Important**: Agent's results are now part of the conversation history
6. **Supervisor decides next step**: Based on **accumulated results**, supervisor either:
- Delegates to another agent (e.g., `service_inventory_worker`)
- Provides final response to user
- **Key**: Supervisor can see ALL previous agent results when making decisions
## 🧠 How Prompts Work
### Supervisor Prompt (config.py)
```python
SUPERVISOR_PROMPT = """
You are the supervisor of a team of specialised sysadmin agents.
Decide which agent to delegate to based on the user's query **or** on results already collected.
Available agents:
- system_info_worker: gather system metrics
- service_inventory_worker: list running services
- mariadb_analyzer: analyse MariaDB
...
Always start with `system_info_worker` and `service_inventory_worker` before drilling into a specific service.
"""
```
### Agent Prompts (agents/*.py)
Each agent has its own specialized prompt, for example:
```python
# system_info_worker prompt
"""
You are a Linux sysadmin. Use shell commands like `lscpu`, `free -h`, and `df -h` to gather CPU, RAM, and disk usage.
Return a concise plaintext summary. Only run safe, readonly commands.
"""
```
## 🎯 What Each Agent Receives
When an agent is activated via transfer:
- **Full conversation history**: All previous messages between user, supervisor, and other agents
- **Specialized prompt**: Guides how the agent should interpret and act on the conversation
- **Tools**: Shell access, specific analyzers, etc.
- **Context**: Results from previous agents in the conversation
## 🔄 How Agent Results Flow Back to Supervisor
**This is the key mechanism that makes the multi-agent system intelligent:**
1. **Agent produces results**: Each agent generates an `AIMessage` with its findings/analysis
2. **Results become part of conversation**: The agent's response is added to the shared message history
3. **Supervisor sees everything**: When control returns to supervisor, it has access to:
- Original user query
- All previous agent responses
- Tool execution results
- Complete conversation context
4. **Supervisor strategy updates**: Based on accumulated knowledge, supervisor can:
- Decide which agent to call next
- Skip unnecessary agents if enough info is gathered
- Synthesize results from multiple agents
- Provide final comprehensive response
### Example Flow:
```
User: "Nginx 502 error, help!"
├── Supervisor → system_info_worker
│ └── Returns: "502 usually means upstream server issues, check logs..."
├── Supervisor (now knows about upstream issues) → service_inventory_worker
│ └── Returns: "Check PHP-FPM status, verify upstream config..."
└── Supervisor (has both perspectives) → Final synthesis
└── "Based on system analysis and service inventory, here's comprehensive solution..."
```
## 📤 What Workers Pass Back to Supervisor
**Key Insight**: Workers don't explicitly "return" data. Instead, all their work becomes part of the shared conversation history that the supervisor can access.
### What Gets Added to the Message History
When a worker (like `network_diag`) executes:
1. **AIMessages** - Agent's reasoning and analysis
```
"I'll start by checking external connectivity..."
"DNS resolution appears to be working correctly..."
"Network Analysis Summary: All systems operational..."
```
2. **ToolMessages** - Raw command outputs
```
"PING 8.8.8.8 (8.8.8.8): 56 data bytes\n64 bytes from 8.8.8.8..."
"google.com. 300 IN A 142.250.80.46"
"tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN"
```
3. **Transfer Confirmation** - When worker completes
```
"Successfully transferred back to supervisor"
```
### Complete Message Flow Example
```python
# After network_diag completes, state["messages"] contains:
[
HumanMessage("My website is slow"), # Original query
AIMessage("I'll check network connectivity..."), # Supervisor decision
ToolMessage("Successfully transferred to network_diag"), # Transfer confirmation
AIMessage("Starting network diagnostics..."), # Worker starts
ToolMessage("PING 8.8.8.8: 64 bytes from 8.8.8.8..."), # Command result 1
AIMessage("External connectivity is good, checking DNS"), # Worker analysis
ToolMessage("google.com. 300 IN A 142.250.80.46"), # Command result 2
AIMessage("DNS working. Checking local services..."), # Worker continues
ToolMessage("tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN"), # Command result 3
AIMessage("Network Summary: All good, issue elsewhere"), # Worker's final analysis
ToolMessage("Successfully transferred back to supervisor") # Transfer back
]
```
### How Supervisor Uses This Information
The supervisor receives **ALL** these messages and can:
1. **Read command outputs** to understand technical details
2. **See agent reasoning** to understand what was checked
3. **Access final analysis** to make informed decisions
4. **Decide next steps** based on accumulated evidence
### Why This Design Works
- **Full Transparency**: Supervisor sees everything the worker did
- **Rich Context**: Both raw data and interpreted analysis available
- **Cumulative Knowledge**: Each agent builds on previous work
- **Intelligent Routing**: Supervisor can adapt strategy based on findings
### Example: Multi-Agent Collaboration
```
User: "Website is slow"
├── network_diag finds: "Network is fine"
├── cert_checker finds: "Certificate expires tomorrow!"
└── Supervisor synthesis: "Issue is expiring certificate, not network"
```
The supervisor can correlate findings across multiple workers because it sees all their work in the message history.
## 📋 Key Takeaways
- **"Successfully transferred"** = Control handoff confirmation, not data transfer
- **Each agent** gets the full conversation context INCLUDING previous agent results
- **Agent prompts** determine how they process that context
- **Supervisor** orchestrates the workflow based on its prompt strategy
- **The conversation** builds up context as each agent contributes their expertise
- **Results accumulate**: Each agent can see and build upon previous agents' work
- **Supervisor learns**: Strategy updates based on what agents discover
- **Dynamic workflow**: Supervisor can skip agents or change direction based on results