wip

2025-06-26 18:02:43 +02:00
parent ea1519a208
commit d33cddef1e
13 changed files with 684 additions and 82 deletions
--- a/multi-agent-supervisor/docs/AGENT_ENHANCEMENT_SUMMARY.md
+++ b/multi-agent-supervisor/docs/AGENT_ENHANCEMENT_SUMMARY.md
@@ -0,0 +1,93 @@
+# Enhanced Agent Results Communication
+
+## Problem Identified
+The agents were only sending "Successfully transferred control back to supervisor" messages without providing meaningful analysis results from their work.
+
+## Root Cause
+The agent prompts were too brief and didn't explicitly instruct agents to:
+1. Summarize their findings after executing commands
+2. Provide structured analysis before transferring back to supervisor
+3. Include specific recommendations and insights
+
+## Solution Implemented
+
+### 1. Enhanced Agent Prompts
+Updated all agent prompts to include:
+
+- **Explicit task definitions** with required commands
+- **Structured analysis requirements** with specific sections
+- **Clear instructions** to provide comprehensive summaries
+- **Always provide analysis summary before completing task**
+
+### 2. Specific Improvements by Agent
+
+#### System Agents
+- **system_info_worker**: Now analyzes CPU, memory, disk, load, and top processes with structured summary
+- **service_inventory_worker**: Provides service categorization, failed services analysis, security-relevant services
+
+#### Service Agents  
+- **nginx_analyzer**: Comprehensive config validation, log analysis, specific 502/503/504 error troubleshooting
+- **mariadb_analyzer**: Database status, configuration assessment, log analysis, performance indicators
+- **phpfpm_analyzer**: Process analysis, memory limits, timeout configuration, socket connectivity
+
+#### Network Agents
+- **network_diag**: Connectivity testing, DNS analysis, port scanning with adaptive commands
+- **cert_checker**: Certificate discovery, expiration monitoring, validation with 30-day alerts
+
+#### Analysis Agents
+- **risk_scorer**: Structured risk assessment with CRITICAL/HIGH/MEDIUM/LOW levels and prioritized recommendations
+- **remediation_worker**: Safe remediation plans with dry-run commands and confirmation protocols
+- **harmonizer_worker**: Security hardening assessment with best-practice implementation
+
+### 3. Enhanced Supervisor Coordination
+Updated supervisor prompt to:
+- Better utilize agent expertise
+- Synthesize findings from multiple agents
+- Provide strategic insights beyond delegation
+
+## Expected Behavior Change
+
+### Before:
+```
+STEP 2: SYSTEM_INFO_WORKER
+Successfully transferred control back to supervisor
+
+STEP 3: SERVICE_INVENTORY_WORKER  
+Successfully transferred control back to supervisor
+
+STEP 4: SUPERVISOR
+Generic response not based on actual findings
+```
+
+### After:
+```
+STEP 2: SYSTEM_INFO_WORKER
+📊 ANALYSIS SUMMARY:
+- CPU: Intel i7, load average 0.8, 1.2, 1.5
+- Memory: 8GB total, 6.2GB used (78% utilization)
+- Disk: Root partition 85% full - WARNING LEVEL
+- Top processes: nginx (2.1GB), mysql (1.8GB)
+- Recommendation: Monitor disk space, investigate memory usage
+
+STEP 3: SERVICE_INVENTORY_WORKER
+📋 SERVICE ANALYSIS:
+- 47 services running (normal range)
+- Critical services: ✅ nginx, ✅ mysql, ❌ php-fpm (failed)
+- Failed services: php8.1-fpm.service
+- Security services: ✅ ssh, ✅ ufw
+- Recommendation: Investigate php-fpm failure for potential 502 errors
+
+STEP 4: SUPERVISOR  
+Based on system analysis showing high memory usage and service inventory
+revealing php-fpm failure, this explains your 502 errors...
+```
+
+## Files Modified
+- `agents/system_agents.py` - Enhanced system monitoring agents
+- `agents/service_agents.py` - Enhanced service-specific agents  
+- `agents/network_agents.py` - Enhanced network and security agents
+- `agents/analysis_agents.py` - Enhanced analysis and remediation agents
+- `config.py` - Enhanced supervisor prompt and coordination strategy
+
+## Result
+Agents now provide meaningful, structured analysis that the supervisor can synthesize into comprehensive, actionable responses instead of generic outputs.
--- a/multi-agent-supervisor/docs/DYNAMIC_INSTRUCTIONS.md
+++ b/multi-agent-supervisor/docs/DYNAMIC_INSTRUCTIONS.md
@@ -0,0 +1,129 @@
+# Dynamic Instructions for Agent Transfers - TODO
+
+## Current Behavior
+Currently, when the supervisor transfers control to an agent:
+- ❌ No specific instructions are passed
+- ❌ Agent only sees the original user query
+- ❌ Agent uses its static, pre-defined prompt
+
+## Proposed Enhancement: Dynamic Instructions
+
+### Why It Matters
+The supervisor often has context about WHY it's transferring to a specific agent. For example:
+- "Transfer to network_diag because user mentioned DNS issues - focus on DNS diagnostics"
+- "Transfer to cert_checker because certificates might be expiring - check all certs urgently"
+
+### Implementation Approach
+
+#### 1. Modify Transfer Tools
+```python
+def transfer_to_network_diag(instructions: str = "") -> str:
+    """Transfer control to network diagnostics agent.
+    
+    Args:
+        instructions: Specific guidance for the agent
+    """
+    return f"Successfully transferred to network_diag. Instructions: {instructions}"
+```
+
+#### 2. Update State to Include Instructions
+```python
+class State(BaseModel):
+    messages: List[AnyMessage]
+    next_agent: str = "supervisor"
+    supervisor_instructions: Optional[str] = None  # NEW FIELD
+```
+
+#### 3. Modify Agent Creation to Check for Instructions
+```python
+def create_network_worker():
+    return create_react_agent(
+        model="openai:gpt-4o-mini",
+        tools=[get_shell_tool()],
+        prompt="""
+{base_prompt}
+
+SUPERVISOR INSTRUCTIONS (if any): {supervisor_instructions}
+
+Always prioritize supervisor instructions when provided.
+""",
+        name="network_diag"
+    )
+```
+
+#### 4. Update Router Logic
+```python
+def route_agent(state):
+    # Extract supervisor instructions from last ToolMessage
+    last_message = state["messages"][-1]
+    if isinstance(last_message, ToolMessage) and "Instructions:" in last_message.content:
+        # Parse and store instructions
+        instructions = extract_instructions(last_message.content)
+        state["supervisor_instructions"] = instructions
+    
+    return state["next_agent"]
+```
+
+### Example Flow
+
+1. **User Query**: "My website is slow"
+
+2. **Supervisor Analysis**: 
+   ```
+   "Website slowness could be DNS or certificate related. 
+    Let me transfer to network_diag with specific focus."
+   ```
+
+3. **Supervisor Transfer**:
+   ```python
+   transfer_to_network_diag(
+       instructions="Focus on DNS resolution times and latency to common websites. 
+                     Check if DNS servers are responding slowly."
+   )
+   ```
+
+4. **Network Agent Receives**:
+   - Original query: "My website is slow"
+   - Supervisor instructions: "Focus on DNS resolution times..."
+   - Can now prioritize DNS diagnostics over general network checks
+
+### Benefits
+
+1. **More Targeted Diagnostics**: Agents focus on what matters
+2. **Better Context Sharing**: Supervisor's analysis isn't lost
+3. **Efficient Execution**: Avoid running unnecessary commands
+4. **Improved Results**: More relevant output for user's specific issue
+
+### Alternative: Context in Messages
+
+Instead of modifying tools, append supervisor analysis to the message history:
+
+```python
+# Before transfer, supervisor adds a system message
+state["messages"].append(
+    SystemMessage(content=f"[SUPERVISOR GUIDANCE]: Focus on {specific_issue}")
+)
+```
+
+### Decision Points
+
+1. **Tool Parameters vs State**: Where to store instructions?
+2. **Prompt Injection vs Message History**: How to pass instructions?
+3. **Optional vs Required**: Should all transfers include instructions?
+4. **Persistence**: Should instructions carry through multiple agent hops?
+
+### Next Steps
+
+1. [ ] Decide on implementation approach
+2. [ ] Modify transfer tool signatures
+3. [ ] Update state model
+4. [ ] Enhance agent prompts to use instructions
+5. [ ] Test with various scenarios
+6. [ ] Document the new pattern
+
+### Example Test Cases
+
+- "Check network" → No specific instructions needed
+- "Website is slow" → "Focus on DNS and latency"  
+- "Certificate expiring?" → "Check all certs, prioritize those expiring soon"
+- "Port 443 issues" → "Focus on HTTPS connectivity and certificate validation"
--- a/multi-agent-supervisor/docs/README-modular.md
+++ b/multi-agent-supervisor/docs/README-modular.md
@@ -0,0 +1,90 @@
+# Multi-Agent Sysadmin Assistant
+
+A modular multi-agent system for system administration tasks using LangChain and LangGraph.
+
+## Architecture
+
+The system is organized into several modules for better maintainability:
+
+### 📁 Project Structure
+
+```
+multi-agent-supervisor/
+├── main-multi-agent.py      # Main entry point
+├── config.py                # Configuration and settings
+├── supervisor.py            # Supervisor orchestration
+├── utils.py                 # Utility functions
+├── requirements.txt         # Dependencies
+├── custom_tools/            # Custom tool implementations
+│   ├── __init__.py
+│   ├── log_tail_tool.py     # Log reading tool
+│   └── shell_tool_wrapper.py # Shell tool wrapper
+└── agents/                  # Agent definitions
+    ├── __init__.py
+    ├── system_agents.py     # System monitoring agents
+    ├── service_agents.py    # Service-specific agents
+    ├── network_agents.py    # Network and security agents
+    └── analysis_agents.py   # Analysis and remediation agents
+```
+
+## Agents
+
+### System Agents
+- **System Info Worker**: Gathers CPU, RAM, and disk usage
+- **Service Inventory Worker**: Lists running services
+
+### Service Agents  
+- **MariaDB Analyzer**: Checks MariaDB configuration and logs
+- **Nginx Analyzer**: Validates Nginx configuration and logs
+- **PHP-FPM Analyzer**: Monitors PHP-FPM status and performance
+
+### Network Agents
+- **Network Diagnostics**: Uses ping, traceroute, and dig
+- **Certificate Checker**: Monitors TLS certificate expiration
+
+### Analysis Agents
+- **Risk Scorer**: Aggregates findings and assigns severity levels
+- **Remediation Worker**: Proposes safe fixes for issues
+- **Harmonizer Worker**: Applies system hardening best practices
+
+## Benefits of Modular Architecture
+
+1. **Separation of Concerns**: Each module has a single responsibility
+2. **Reusability**: Tools and agents can be easily reused across projects
+3. **Maintainability**: Easy to update individual components
+4. **Testability**: Each module can be tested independently
+5. **Scalability**: Easy to add new agents or tools
+6. **Code Organization**: Clear structure makes navigation easier
+
+## Usage
+
+```python
+from supervisor import create_sysadmin_supervisor
+
+# Create supervisor with all agents
+supervisor = create_sysadmin_supervisor()
+
+# Run analysis
+query = {
+    "messages": [
+        {
+            "role": "user", 
+            "content": "Check if my web server is running properly"
+        }
+    ]
+}
+
+result = supervisor.invoke(query)
+```
+
+## Adding New Agents
+
+1. Create agent function in appropriate module under `agents/`
+2. Import and add to supervisor in `supervisor.py`
+3. Update supervisor prompt in `config.py`
+
+## Adding New Tools
+
+1. Create tool class in `custom_tools/`
+2. Export from `custom_tools/__init__.py`
+3. Import and use in agent definitions
--- a/multi-agent-supervisor/docs/UNDERSTANDING_TRANSFERS.md
+++ b/multi-agent-supervisor/docs/UNDERSTANDING_TRANSFERS.md
@@ -0,0 +1,182 @@
+# Understanding Multi-Agent Transfers
+
+## What "Successfully transferred..." means
+
+When you see messages like:
+- `Successfully transferred to system_info_worker`
+- `Successfully transferred back to supervisor`
+
+These are **tool execution results** from the LangGraph supervisor pattern. Here's what's happening:
+
+## 🔄 The Transfer Flow
+
+1. **Supervisor receives user query**: "Nginx returns 502 Bad Gateway on my server. What can I do?"
+
+2. **Supervisor analyzes and delegates**: Based on the `SUPERVISOR_PROMPT` in `config.py`, it decides to start with `system_info_worker`
+
+3. **Transfer tool execution**: Supervisor calls `transfer_to_system_info_worker` tool
+   - **Result**: "Successfully transferred to system_info_worker"
+   - **Meaning**: Control is now handed to the system_info_worker agent
+
+4. **Agent executes**: The `system_info_worker` gets:
+   - Full conversation context (including the original user query)
+   - Its own specialized prompt from `agents/system_agents.py`
+   - Access to its tools (shell commands for system info)
+
+5. **Agent completes and returns**: Agent calls `transfer_back_to_supervisor`
+   - **Result**: "Successfully transferred back to supervisor"
+   - **Meaning**: Agent finished its task and returned control
+   - **Important**: Agent's results are now part of the conversation history
+
+6. **Supervisor decides next step**: Based on **accumulated results**, supervisor either:
+   - Delegates to another agent (e.g., `service_inventory_worker`)
+   - Provides final response to user
+   - **Key**: Supervisor can see ALL previous agent results when making decisions
+
+## 🧠 How Prompts Work
+
+### Supervisor Prompt (config.py)
+```python
+SUPERVISOR_PROMPT = """
+You are the supervisor of a team of specialised sysadmin agents.
+Decide which agent to delegate to based on the user's query **or** on results already collected.
+Available agents:
+- system_info_worker: gather system metrics
+- service_inventory_worker: list running services  
+- mariadb_analyzer: analyse MariaDB
+...
+Always start with `system_info_worker` and `service_inventory_worker` before drilling into a specific service.
+"""
+```
+
+### Agent Prompts (agents/*.py)
+Each agent has its own specialized prompt, for example:
+
+```python
+# system_info_worker prompt
+"""
+You are a Linux sysadmin. Use shell commands like `lscpu`, `free -h`, and `df -h` to gather CPU, RAM, and disk usage. 
+Return a concise plain‑text summary. Only run safe, read‑only commands.
+"""
+```
+
+## 🎯 What Each Agent Receives
+
+When an agent is activated via transfer:
+- **Full conversation history**: All previous messages between user, supervisor, and other agents
+- **Specialized prompt**: Guides how the agent should interpret and act on the conversation
+- **Tools**: Shell access, specific analyzers, etc.
+- **Context**: Results from previous agents in the conversation
+
+## 🔄 How Agent Results Flow Back to Supervisor
+
+**This is the key mechanism that makes the multi-agent system intelligent:**
+
+1. **Agent produces results**: Each agent generates an `AIMessage` with its findings/analysis
+2. **Results become part of conversation**: The agent's response is added to the shared message history
+3. **Supervisor sees everything**: When control returns to supervisor, it has access to:
+   - Original user query
+   - All previous agent responses
+   - Tool execution results
+   - Complete conversation context
+
+4. **Supervisor strategy updates**: Based on accumulated knowledge, supervisor can:
+   - Decide which agent to call next
+   - Skip unnecessary agents if enough info is gathered
+   - Synthesize results from multiple agents
+   - Provide final comprehensive response
+
+### Example Flow:
+```
+User: "Nginx 502 error, help!"
+├── Supervisor → system_info_worker
+│   └── Returns: "502 usually means upstream server issues, check logs..."
+├── Supervisor (now knows about upstream issues) → service_inventory_worker  
+│   └── Returns: "Check PHP-FPM status, verify upstream config..."
+└── Supervisor (has both perspectives) → Final synthesis
+    └── "Based on system analysis and service inventory, here's comprehensive solution..."
+```
+
+## 📤 What Workers Pass Back to Supervisor
+
+**Key Insight**: Workers don't explicitly "return" data. Instead, all their work becomes part of the shared conversation history that the supervisor can access.
+
+### What Gets Added to the Message History
+
+When a worker (like `network_diag`) executes:
+
+1. **AIMessages** - Agent's reasoning and analysis
+   ```
+   "I'll start by checking external connectivity..."
+   "DNS resolution appears to be working correctly..."
+   "Network Analysis Summary: All systems operational..."
+   ```
+
+2. **ToolMessages** - Raw command outputs
+   ```
+   "PING 8.8.8.8 (8.8.8.8): 56 data bytes\n64 bytes from 8.8.8.8..."
+   "google.com. 300 IN A 142.250.80.46"
+   "tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN"
+   ```
+
+3. **Transfer Confirmation** - When worker completes
+   ```
+   "Successfully transferred back to supervisor"
+   ```
+
+### Complete Message Flow Example
+
+```python
+# After network_diag completes, state["messages"] contains:
+[
+    HumanMessage("My website is slow"),                        # Original query
+    AIMessage("I'll check network connectivity..."),          # Supervisor decision
+    ToolMessage("Successfully transferred to network_diag"),   # Transfer confirmation
+    AIMessage("Starting network diagnostics..."),             # Worker starts
+    ToolMessage("PING 8.8.8.8: 64 bytes from 8.8.8.8..."),  # Command result 1
+    AIMessage("External connectivity is good, checking DNS"), # Worker analysis
+    ToolMessage("google.com. 300 IN A 142.250.80.46"),       # Command result 2
+    AIMessage("DNS working. Checking local services..."),     # Worker continues
+    ToolMessage("tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN"),      # Command result 3
+    AIMessage("Network Summary: All good, issue elsewhere"),  # Worker's final analysis
+    ToolMessage("Successfully transferred back to supervisor") # Transfer back
+]
+```
+
+### How Supervisor Uses This Information
+
+The supervisor receives **ALL** these messages and can:
+
+1. **Read command outputs** to understand technical details
+2. **See agent reasoning** to understand what was checked
+3. **Access final analysis** to make informed decisions
+4. **Decide next steps** based on accumulated evidence
+
+### Why This Design Works
+
+- **Full Transparency**: Supervisor sees everything the worker did
+- **Rich Context**: Both raw data and interpreted analysis available  
+- **Cumulative Knowledge**: Each agent builds on previous work
+- **Intelligent Routing**: Supervisor can adapt strategy based on findings
+
+### Example: Multi-Agent Collaboration
+
+```
+User: "Website is slow"
+├── network_diag finds: "Network is fine"
+├── cert_checker finds: "Certificate expires tomorrow!" 
+└── Supervisor synthesis: "Issue is expiring certificate, not network"
+```
+
+The supervisor can correlate findings across multiple workers because it sees all their work in the message history.
+
+## 📋 Key Takeaways
+
+- **"Successfully transferred"** = Control handoff confirmation, not data transfer
+- **Each agent** gets the full conversation context INCLUDING previous agent results
+- **Agent prompts** determine how they process that context
+- **Supervisor** orchestrates the workflow based on its prompt strategy
+- **The conversation** builds up context as each agent contributes their expertise
+- **Results accumulate**: Each agent can see and build upon previous agents' work
+- **Supervisor learns**: Strategy updates based on what agents discover
+- **Dynamic workflow**: Supervisor can skip agents or change direction based on results