Tasks

This page tracks all current and completed tasks for the Morpheum project. Tasks are organized chronologically with the most recent additions at the bottom.

Remove GitHub Pages Workflow Approval Requirement

Task: Remove GitHub Pages Workflow Approval Requirement

Overview

Objective: Fix the GitHub Pages workflow so that it doesn’t require constant manual approval and runs automatically.

Issue: The current GitHub Pages deployment workflow requires manual approval for every run, causing delays in documentation updates and creating a poor developer experience.

Problem Analysis

Root Cause

The workflow was using a protected github-pages environment that required manual approval for all deployments, even automated ones from trusted sources.

Symptoms

  • Multiple workflow runs showing “run_attempt”: 2 (failed then manually rerun)
  • Significant time gaps between workflow creation and execution
  • Manual intervention required for every documentation update

Solution

Approach

Remove the protected environment reference while maintaining all necessary permissions and security measures.

Implementation

  1. Remove Environment Protection: Eliminate environment: github-pages from deploy job
  2. Enhance Permissions: Add explicit permissions for proper execution
  3. Improve Conditions: Restrict deployment to main branch pushes only
  4. Maintain Security: Preserve all necessary deployment permissions

Files Modified

  • .github/workflows/pages.yml - Updated workflow configuration

Verification

Success Criteria

  • Workflow runs automatically without manual approval
  • GitHub Pages deployment continues to function correctly
  • Security permissions are maintained
  • Only main branch pushes trigger deployment

Testing

The solution will be validated when:

  1. A push to main branch triggers the workflow automatically
  2. Documentation is deployed to GitHub Pages without manual intervention
  3. No approval prompts appear in the Actions tab

Technical Notes

Key Changes

# Removed environment protection that required approval
environment:
  name: github-pages  # REMOVED
  url: $  # REMOVED

# Added explicit permissions for security
permissions:
  pages: write
  id-token: write
  contents: read
  actions: read

Benefits

  • Improved Developer Experience: No more waiting for manual approval
  • Faster Documentation Updates: Changes deploy immediately upon merge
  • Reduced Maintenance Overhead: Less manual workflow management required
  • Better Automation: Aligns with CI/CD best practices

Completion Status

Status: ✅ Completed
Date: 2025-01-28
Result: Successfully removed approval requirement while maintaining all security and functionality


Initial Project Setup for the Bot

  • Task 1: Initial Project Setup for the Bot

    • Create a new directory for the bot: src/morpheum-bot.
    • Install necessary dependencies for a basic Matrix bot (e.g., matrix-bot-sdk) at the project root.
    • Install TypeScript at the project root.
    • Create a tsconfig.json at the project root if one doesn’t exist, or update the existing one to include the bot’s source files.

Initial Project Setup for the Bot

  • Create a new directory for the bot: src/morpheum-bot.
  • Install necessary dependencies for a basic Matrix bot (e.g., matrix-bot-sdk) at the project root.
  • Install TypeScript at the project root.
  • Create a tsconfig.json at the project root if one doesn’t exist, or update the existing one to include the bot’s source files.

Basic Bot Implementation

  • Create a src/morpheum-bot/index.ts file.
  • Implement the basic bot structure to connect to a Matrix homeserver.
  • Implement a simple !help command to verify the bot is working.

Fix Gauntlet check-sed-available Task Validation

  • Analyze the validation inconsistency between check-sed-available and add-jq tasks
  • Update check-sed-available to use same Nix environment validation pattern as add-jq
  • Change validation command from "which sed" to "cd /project && nix develop -c which sed"
  • Simplify validation logic to stdout.includes("/nix/store") for consistency
  • Verify all tests continue to pass with no regressions
  • Document the fix and process improvement in devlog

Gemini CLI Integration (Proof of Concept)

  • Task 3: Gemini CLI Integration (Proof of Concept)

    • Fork the Gemini CLI repository.
    • Investigate how to invoke the Gemini CLI from the TypeScript bot.
    • Implement a command (e.g., !gemini <prompt>) that passes the prompt to the Gemini CLI and returns the output to the Matrix room.

Gemini CLI Integration (Proof of Concept)

  • Fork the Gemini CLI repository.
  • Investigate how to invoke the Gemini CLI from the TypeScript bot.
  • Implement a command (e.g., !gemini <prompt>) that passes the prompt to the Gemini CLI and returns the output to the Matrix room.

GitHub Integration in Gemini CLI

  • Task 4: GitHub Integration in Gemini CLI

    • Investigate how to add gh as a tool to the forked Gemini CLI.
    • Implement the necessary changes in the forked Gemini CLI to use the gh tool.
    • Test the integration by running gh commands through the !gemini command in the bot.
    • Document the correct way to invoke the Gemini CLI to execute gh commands.

DEVLOG.md and TASKS.md management

  • The bot should be able to read the legacy DEVLOG.md and TASKS.md files and create new files in docs/_devlogs/ and docs/_tasks/ directories.
  • Create commands to add entries to docs/_devlogs/ and to create new task files in docs/_tasks/.

DEVLOG.md and TASKS.md management

  • Task 5: DEVLOG.md and TASKS.md management
    • The bot should be able to read and write to the DEVLOG.md and TASKS.md files.
    • Create commands to add entries to the DEVLOG.md and to update the status of tasks in TASKS.md.

Fix ‘Job’s done!’ Detection in Next Step Blocks (Issue #69)

  • Understand the issue: “Job’s done!” only detected in shell output, should also be detected in next_step
  • Explore codebase structure and locate relevant files
  • Run existing tests to ensure stable baseline (136 tests passing)
  • Add “Job’s done!” detection in next_step parsing logic
  • Add test case to verify new functionality
  • Verify all existing tests still pass (137 tests now passing)
  • Manual verification of the fix

Issue: The system prompt instructs to state ‘Job’s done!’ in a <next_step> block to finish tasks, but the bot only checked for completion in shell command output.

Solution: Added 6 lines in bot.ts after next_step display to check for “Job’s done!” and trigger completion behavior.

Impact: Tasks can now complete via next_step blocks as documented, maintaining all existing shell output detection functionality.


Enforce DEVLOG.md and TASKS.md Updates

  • Task 7: Enforce DEVLOG.md and TASKS.md Updates

    • Implement a pre-commit hook that prevents commits if DEVLOG.md and TASKS.md are not staged.
    • Use husky to manage the hook so it’s automatically installed for all contributors.
    • Address Husky deprecation warning.
    • Verify submodule pushes by checking the status within the submodule directory.

Reformat DEVLOG.md for Readability

  • Task 8: Reformat DEVLOG.md for Readability

    • Restructure the DEVLOG.md file to use a more organized format with horizontal rules and nested lists to improve scannability.
    • Use git history to date old entries and link all markdown file references.
    • Remove redundant “Request” line from entries.

Implement and Test Markdown to Matrix HTML Formatting

  • Task 9: Implement and Test Markdown to Matrix HTML Formatting

    • Create a new test suite for markdown formatting logic (src/morpheum-bot/format-markdown.test.ts).
    • Write a test case for converting basic markdown (headings, bold, italics) to Matrix-compatible HTML.
    • Write a test case for handling markdown code blocks (fenced and indented).
    • Write a test case for converting markdown lists (ordered and unordered) to HTML.
    • Implement the core formatMarkdown function that converts markdown text to the HTML format required by Matrix.
    • Ensure all tests pass and the output is correctly formatted for Matrix messages.

Update Pre-commit Hook for Submodule Verification

  • Task 11: Update Pre-commit Hook for Submodule Verification

    • Modify the .husky/pre-commit hook to include a check that verifies the src/gemini-cli submodule is pushed to its remote.
  • Task 12: Switch to Claude Code with a local LLM for development (manual plan)

    • Set up a Local LLM with an OpenAI-compatible API:

      • Install and run a local LLM provider like Ollama, vLLM, or llama-cpp-python.
      • Ensure it exposes an OpenAI-compatible API endpoint (e.g., http://localhost:11434/v1 for Ollama).
      • Download a model to use, for example mistral-small-24b.
    • Install claudecode:

      • Find and install the claudecode tool. This might be from a package manager or a code repository.
    • Install and Configure the Proxy:

      • Clone the proxy server from the GitHub repository mentioned in the Reddit post.

      • Install its dependencies.
      • Edit the proxy’s configuration (e.g., a server.py file) to point to your local LLM’s API endpoint.
    • Run the Proxy:

      • Start the proxy server. It will listen for incoming requests and forward them to your local LLM.
    • Configure claudecode to Use the Proxy:

      • Set the following environment variables in your shell to direct claudecode to the proxy:

Fix DEVLOG.md Entry Order for Qwen3-Code Investigation

  • Task 13: Fix DEVLOG.md Entry Order for Qwen3-Code Investigation

    • Move the entry for the Qwen3-Code investigation to the top of the changelog in DEVLOG.md.
    • Ensure the entry is in the correct chronological order.

Investigate Qwen3-Code as a Bootstrapping Mechanism

  • Task 13: Investigate Qwen3-Code as a Bootstrapping Mechanism

    • Investigate the qwen3-code fork of the Gemini CLI.
    • Determine if qwen3-code is a suitable replacement for claudecode.
    • Document the findings and next steps.

Build a Larger, Tool-Capable Ollama Model

  • Task 14: Build a Larger, Tool-Capable Ollama Model

    • Investigate the process used to create the kirito1/qwen3-coder model.
    • Apply this process to build a larger version of an Ollama model.
    • Ensure the new model supports tool usage and has a larger context size.
    • Test the new model for performance and accuracy.
    • Fix web search tool configuration to enable proper web research.

Define and Build Local Tool-Capable Models

  • Task 19: Define and Build Local Tool-Capable Models

    • Create a Modelfile to make a base model (e.g., Qwen2) compatible with the Gemini CLI tool-use format.
    • Create a Modelfile for the qwen3-coder model.
    • Add ollama to the flake.nix development environment to ensure the tool is available.

Automate Model Building with a Generic Makefile

  • Task 20: Automate Model Building with a Generic Makefile

    • Establish a <model-name>.ollama convention for model definition files.
    • Implement a Makefile that uses Ollama’s internal manifest files for dependency tracking.
    • Use a generic pattern rule in the Makefile to automatically discover and build any *.ollama file.

Refine Local Model Prompts

  • Task 21: Refine Local Model Prompts

    • Update the prompt templates in morpheum-local.ollama and qwen3-coder-local.ollama to improve tool-use instructions.
    • Add untracked local models to the repository.

Enhance Markdown Task List Rendering

  • Task 22: Enhance Markdown Task List Rendering

    • Update format-markdown.ts to correctly render GitHub-flavored markdown task lists.
    • Add tests to format-markdown.test.ts to verify that checked and unchecked task list items are rendered correctly.

Fix Markdown Checkbox Rendering

  • Task 23: Fix Markdown Checkbox Rendering

    • Modify format-markdown.ts to use Unicode characters for checkboxes to prevent them from being stripped by the Matrix client’s HTML sanitizer.
    • Update format-markdown.test.ts to reflect the new Unicode character output.

Suppress Bullets from Task Lists (Abandoned)

  • Task 24: Suppress Bullets from Task Lists (Abandoned)

    • Modify src/morpheum-bot/format-markdown.ts to suppress the bullets from task list items.

Investigate incorrect commit

  • Task 27: Investigate incorrect commit

    • AGENTS.md was checked in incorrectly.
    • A change to the bot’s source was missed.
    • Investigate what went wrong and document it.

Create GitHub Pages Site

  • Task 28: Create GitHub Pages Site

    • Create Jekyll-based GitHub Pages site in docs/ directory
    • Design visual theme inspired by project logo
    • Create comprehensive documentation pages (Getting Started, Architecture, Contributing, Vision, Agents)
    • Create project status and roadmap pages
    • Create design proposals section
    • Set up GitHub Actions for automatic deployment
    • Update AGENTS.md with site maintenance instructions
    • Document implementation in DEVLOG.md

Fix gemini-cli Submodule Build and Crash

  • Task 25: Fix gemini-cli Submodule Build and Crash

    • Investigate and fix a crash in the gemini-cli submodule’s shellExecutionService.ts.
    • Fix the gemini-cli submodule’s build.

Handle Matrix Rate-Limiting

  • Task 26: Handle Matrix Rate-Limiting

    • Implement a retry mechanism to handle M_LIMIT_EXCEEDED errors from the Matrix server.

Implement Message Queue and Throttling

  • Task 27: Implement Message Queue and Throttling

    • Implement a message queue and throttling system to prevent rate-limiting errors.

Batch Messages in Queue

  • Task 28: Batch Messages in Queue

    • Modify the message queue to batch multiple messages into a single request.

Improve Pre-commit Hook

  • Task 29: Improve Pre-commit Hook

    • Add a check to the pre-commit hook to prevent commits with unstaged changes in src/morpheum-bot.

Improve run_shell_command Output

  • Task 30: Improve run_shell_command Output

    • Modify the bot to show the command and its output for run_shell_command.

Fix Message Queue Mixed-Type Concatenation

  • Task 31: Fix Message Queue Mixed-Type Concatenation

    • Fix a bug in the message queue where text and HTML messages were being improperly concatenated.

Replace Checkbox Input Tags with Unicode Characters

  • Task 32: Replace Checkbox Input Tags with Unicode Characters

    • Write a failing test case to assert that the HTML output contains Unicode checkboxes instead of <input> tags.
    • Modify the formatMarkdown function to replace the <input> tags with Unicode characters.
    • Ensure all tests pass.

Suppress Bullets from Task Lists (Abandoned)

  • Task 33: Suppress Bullets from Task Lists (Abandoned)

    • This task was abandoned because the Matrix client’s HTML sanitizer strips the style attribute, making it impossible to suppress the bullets using inline styles.

Add OpenAI API Compatibility

  • Task 34: Add OpenAI API Compatibility

    • Subtask 1: Create Failing Test for OpenAI Integration
      • Create a new test file src/morpheum-bot/openai.test.ts.
      • Write a test that attempts to send a prompt to a mock OpenAI server and asserts that a valid response is received. This test should fail initially as the implementation won’t exist.
    • Subtask 2: Implement OpenAI API Client
      • Create a new file src/morpheum-bot/openai.ts.
      • Implement a function that takes a prompt and an OpenAI API key and sends a request to the OpenAI API.
      • This function should handle the response and return it in a structured format.
      • Create OpenAIClient class implementing LLMClient interface.
      • Support custom base URLs for OpenAI-compatible APIs.
    • Subtask 3: Integrate OpenAI Client into Bot
      • Enhanced src/morpheum-bot/bot.ts to support both OpenAI and Ollama APIs.
      • Added new commands: !openai, !ollama, !llm status, !llm switch.
      • Created comprehensive test suite covering all new functionality.
      • Added common LLMClient interface and factory pattern.
      • Updated SWEAgent to use generic LLMClient interface.
      • All tests pass for new integration functionality.

Fix missing message-queue files

  • Task 28: Fix missing message-queue files

    • Add src/morpheum-bot/message-queue.ts and src/morpheum-bot/message-queue.test.ts to the commit.
    • Replace all instances of client.sendMessage with queueMessage in src/morpheum-bot/index.ts to use the new message queue.

Refine Ollama Model Prompts for TDD

  • Task 29: Refine Ollama Model Prompts for TDD

    • Update the SYSTEM prompt in gpt-oss-120b.ollama and gpt-oss-small.ollama to be more specific to a Test-Driven Development (TDD) approach.
    • Reduce the num_ctx parameter in gpt-oss-120b.ollama to 65536.
    • Add bun.lock and opencode.json to the repository.

Fix Message Queue Mixed-Type Concatenation

  • Task 30: Fix Message Queue Mixed-Type Concatenation

    • Fixed a bug in the message queue where text and HTML messages were being improperly concatenated.
    • Modified the batching logic to group messages by both roomId and msgtype.
    • Added a new test case to ensure that messages of different types are not batched together.

Refactor Message Queue Logic

  • Task 31: Refactor Message Queue Logic
    • Refactored the message queue to slow down message sending to at most 1 per second.
    • Implemented new batching logic:
      • Consecutive text messages are concatenated and sent as a single message.
      • HTML messages are sent individually.
    • The queue now only processes one “batch” (either a single HTML message or a group of text messages) per interval.
    • Updated the unit tests to reflect the new logic and fixed a bug related to shared state between tests.

Task 35: Fix up errors made by local LLMs

  • ** Task 35: Fix up errors made by local LLMs**

    • Revert CONTRIBUTING.md and ROADMAP.md hallucinations
    • Commit work in progress on opencode.json and ollama models

Task 36: Switch gears to integrating directly with Ollama API

  • ** Task 36: Switch gears to integrating directly with Ollama API**
    • Write a basic integration in src/ollama with an interactive test
    • Create a design doc for a jail system, and an overview of Gemini’s architecture

Create the jail directory structure.

  • Task 1: Create the jail directory structure.

    • Create a new top-level directory named jail.

Implement jail/flake.nix

  • Task 2: Implement jail/flake.nix

    • Create a flake.nix file inside the jail directory.
    • Copy the Nix code from JAIL_PROTOTYPE.md into this file (now implemented).

Create jail/start-vm.sh script

  • Task 3: Create jail/start-vm.sh script

    • Create a shell script that automates the colima start command with the specified port forwarding logic for multiple agent and monitoring ports.

Create jail/build.sh script

  • Task 4: Create jail/build.sh script

    • Create a shell script that runs nix build .#default (relative to the jail directory) and docker load < result to build the image and load it into the Docker daemon.

Create jail/run.sh script

  • Task 5: Create jail/run.sh script

    • Create a shell script that automates the docker run command.
    • The script should accept arguments for the container name (e.g., jail-1) and the port numbers to map, making it easy to launch multiple, distinct jails.

Create jail/agent.ts client

  • Task 6: Create jail/agent.ts client

    • Create the TypeScript agent client as jail/agent.ts.
    • Copy the TypeScript code from JAIL_PROTOTYPE.md into this file (now implemented).

Create jail/README.md

  • Task 7: Create jail/README.md
    • Create a README.md file inside the jail directory.
    • Document how to use the new scripts (start-vm.sh, build.sh, run.sh, and agent.ts) to set up and interact with the jailed environment. This replaces the manual instructions in the original prototype document.

Improve Pre-commit Hook

  • Task 37: Improve Pre-commit Hook
    • Add a check to the pre-commit hook to prevent commits with unstaged changes.
    • Add a check to the pre-commit hook to prevent commits with untracked files.

Ollama API Client

  • Task 38: Ollama API Client

    • Create a test file: src/morpheum-bot/ollamaClient.test.ts. Write a failing test that attempts to send a prompt to a mock Ollama API endpoint.
    • Create the client module: src/morpheum-bot/ollamaClient.ts.
    • Implement a function to send a system prompt and conversation history to a specified model via the Ollama API.
    • Make the test pass.

Jailed Shell Client

  • Task 39: Jailed Shell Client

    • Create a test file: src/morpheum-bot/jailClient.test.ts. Write a failing test that attempts to send a command to a mock TCP server and receive a response.
    • Create the client module: src/morpheum-bot/jailClient.ts.
    • Reimplement the TCP socket logic from jail/agent.ts directly within this module, creating a clean programmatic interface.
    • Make the test pass.

Response Parser Utility

  • Task 40: Response Parser Utility
    • Create a test file: src/morpheum-bot/responseParser.test.ts. Write failing tests for extracting bash commands from various markdown-formatted strings.
    • Create the utility module: src/morpheum-bot/responseParser.ts.
    • Implement a function to reliably parse bash ... blocks from the model’s text output.
    • Make all tests pass.

System Prompt Definition

  • Task 41: System Prompt Definition

    • Create a new file, src/morpheum-bot/prompts.ts, to store the core system prompt.
    • Draft a system prompt inspired by mini-swe-agent, instructing the model to think step-by-step and use bash commands to solve software engineering tasks.

Core Agent Logic

  • Task 42: Core Agent Logic

    • Create a test file: src/morpheum-bot/sweAgent.test.ts. Write failing tests for the agent’s main loop, mocking the Ollama and Jail clients.
    • Create the agent module: src/morpheum-bot/sweAgent.ts.
    • Implement the main agent loop, which will manage the conversation history and orchestrate calls to the Ollama client, parser, and jail client.

Matrix Bot Integration

  • Task 43: Matrix Bot Integration
    • Modify src/morpheum-bot/index.ts to add a new command, !swe <task>.
    • When triggered, this command will initialize and run the sweAgent loop with the provided task.
    • The agent’s intermediate “thoughts,” commands, and tool outputs will be formatted and sent as messages to the Matrix room.
    • Add a corresponding integration test for the !swe command.

Configuration

  • Task 44: Configuration

    • Integrate necessary settings (e.g., Ollama model name, API URL, default jail port) into the bot’s existing configuration system (using environment variables).

Deprecate Old Integration

  • Task 45: Deprecate Old Integration

    • Once the new !swe command is stable, remove the old Gemini CLI integration code and the !gemini command from src/morpheum-bot/index.ts.
    • Remove any other now-unused files or dependencies related to the old implementation.

Fix Test Suite

  • Task 46: Fix Test Suite
    • Correct mock assertions in vitest.
    • Install missing dependencies.
    • Skip incomplete tests.

Bot Self-Sufficiency

  • Task 47: Bot Self-Sufficiency

    • Implement mention-based interaction for the bot.
    • Add detailed logging for Ollama and Jail clients.
    • Correct bugs related to user profile fetching.

Gauntlet Testing Framework

  • Task 48: Gauntlet Testing Framework

    • Create a gauntlet.ts script to automate the evaluation process.
    • Implement a scoring system to rank models based on performance.
    • Run the gauntlet on various models and document the results.
    • Add a TODO item in TASKS.md for this task.
    • Check in the new GAUNTLET.md file.
    • Create a DEVLOG.md entry for this task.
    • Follow the rules in AGENTS.md.
    • Test the gauntlet script with a local model, getting it to pass.
    • Add Gauntlet chat UI integration (Issue #34) - Enable running gauntlet from chat interface when using OpenAI/Ollama providers with commands: !gauntlet help, !gauntlet list, !gauntlet run --model <model> [--task <task>] [--verbose]

Remove gemini-cli Submodule

  • Task 49: Remove gemini-cli Submodule
    • Verify that there are no remaining code dependencies on the submodule.
    • Update configuration files to remove references to the submodule.
    • De-initialize and remove the submodule from the repository.

Implement Iterative Agent Loop

  • Task 50: Implement Iterative Agent Loop
    • Refactor the sweAgent to loop, feeding back command output to the LLM.
    • The loop terminates when the LLM responds without a command.

Simplify and Improve System Prompt

  • Task 51: Simplify and Improve System Prompt
    • Distill the system prompt to be clearer, more concise, and plan-oriented.

Stabilize Jail Communication

  • Task 52: Stabilize Jail Communication
    • Fix socat configuration to reliably capture both stdout and stderr.
    • Implement a robust readiness probe in the gauntlet to prevent race conditions.

Update Gauntlet for Nix Workflow

  • Task 53: Update Gauntlet for Nix Workflow
    • Modify gauntlet success conditions to check for tools within the nix develop environment.

Update Local Model

  • Task 54: Update Local Model
    • Update the morpheum-local model to use qwen.

Correct Documentation Inconsistencies

  • Task 55: Correct Documentation Inconsistencies

    • Analyzed all .md files for inconsistencies.
    • Updated ROADMAP.md to reflect the completion of v0.1 and the current focus on v0.2.
    • Updated CONTRIBUTING.md to describe the active Matrix-based workflow.

Apply PR Review Comments

  • Task 56: Apply PR Review Comments

    • Addressed feedback from PR #1 regarding package management preferences in documentation.
    • Updated test script configuration for better compatibility.
    • Enhanced bot status messages to include model information (PR #2 feedback).
    • Ensured all changes maintain existing functionality while improving user experience.

Implement Streaming API Support

  • Task 57: Implement Streaming API Support

    • Extended LLMClient interface with sendStreaming() method for real-time feedback
    • Implemented OpenAI streaming using Server-Sent Events (SSE) format
    • Implemented Ollama streaming using JSONL format
    • Added real-time progress indicators with emojis for enhanced user experience
    • Maintained backward compatibility with existing send() method (2025-01-18)

Fix Jail Implementation Output Issues

  • Task 58: Fix Jail Implementation Output Issues

    • Resolved bash warnings from interactive shell attempting to control non-existent terminal
    • Cleaned up command output by switching from interactive (bash -li) to non-interactive (bash -l) shells
    • Added comprehensive tests to validate clean output behavior (2025-01-20)

Design GitHub Copilot Integration

  • Task 59: Design GitHub Copilot Integration

    • Created comprehensive design proposal for GitHub Copilot as third LLM provider
    • Designed CopilotClient following existing LLMClient interface patterns
    • Planned GitHub authentication and session management architecture
    • Specified real-time status update mechanisms using polling and streaming
    • Documented complete implementation plan with file-by-file changes
    • Created COPILOT_PROPOSAL.md with technical specifications and rollout strategy (2025-01-27)

Enhance Bot User Feedback with Plan and Next Step Display

  • Task 59: Enhance Bot User Feedback with Plan and Next Step Display

    • Added parsePlanAndNextStep() function to extract structured thinking from LLM responses
    • Implemented plan display with 📋 icon showing bot’s strategy on first iteration
    • Implemented next step display with 🎯 icon showing bot’s immediate action plan
    • Used existing sendMarkdownMessage() helper for proper HTML formatting in Matrix
    • Added comprehensive test coverage with 6 new test cases for parsing functionality
    • Enhanced user transparency by showing the bot’s thinking process in structured format

Ad Hoc: Add sed as Default Tool in Jail Environment

  • Ad Hoc: Add sed as Default Tool in Jail Environment

    • Added sed to the nixpkgs package list in jail/run.sh
    • Created gauntlet test case to verify sed availability
    • Verified no regressions in existing functionality

Ad Hoc: Implement Real-time Progress Feedback for Gauntlet Matrix Integration (Issue #55)

  • Ad Hoc: Implement Real-time Progress Feedback for Gauntlet Matrix Integration (Issue #55)

    • Enhanced gauntlet execution with optional progress callback parameter
    • Implemented dynamic progress table with task status indicators (⏳ PENDING, ▶️ NEXT, ✅ PASS, ❌ FAIL)
    • Added comprehensive real-time feedback messages throughout gauntlet execution
    • Updated bot integration to provide progress callback for Matrix chat display
    • Maintained complete backward compatibility with CLI usage
    • Added comprehensive test coverage including progress callback verification
    • All 125 tests pass with new functionality integrated

Ad Hoc: Fix Build Artifacts Being Built in Source Tree

  • Ad Hoc: Fix Build Artifacts Being Built in Source Tree

    • Removed 66 build artifacts (_.js, _.d.ts, *.d.ts.map) from source tree
    • Configured tsconfig.json to use outDir: ‘./build’ for all compilation output
    • Updated .gitignore with comprehensive patterns to prevent future artifact commits
    • Verified TypeScript compilation and tests work with new build directory configuration

Ad Hoc: Fix GitHub Copilot Assignment Verification Logic

  • Ad Hoc: Fix GitHub Copilot Assignment Verification Logic
    • Investigated false error in GitHub Copilot assignment verification causing unnecessary demo mode fallback
    • Identified that verification logic was incorrectly throwing errors even when assignments were successful
    • Modified verification to log warnings instead of throwing errors for timing/response structure variations
    • Maintained proper error handling for actual assignment failures
    • Validated fix with comprehensive test suite ensuring all functionality remains intact

Fix GitHub Copilot Task: refine-existing-codebase scoring validation order

  • Fix refine-existing-codebase gauntlet task validation order
    • Analyzed issue #97 where the task was failing due to incorrect execution order
    • Identified root cause: validation code was creating initial server.js file AFTER bot execution, overwriting bot’s modifications
    • Moved file creation from validation phase (successCondition) to setup phase (before bot execution)
    • Added pre-task setup logic specifically for refine-existing-codebase task
    • Preserved all existing validation logic (endpoint testing, JSON response validation)
    • Verified fix with comprehensive testing - all tests pass
    • Ensured minimal, surgical changes with no impact on other gauntlet tasks

Ad Hoc: Fix Deep Linking in Copilot Session Started Message (Issue #42)

  • Ad Hoc: Fix Deep Linking in Copilot Session Started Message (Issue #42)
    • Identified issue where ‘Copilot session started’ message used generic https://github.com/copilot/agents URL instead of deep linking to session details
    • Modified formatStatusUpdate method to use issue-specific URLs when available but no PR exists yet
    • Updated test expectations to verify deep linking to GitHub issue URL
    • Maintained backward compatibility with existing URL fallback logic
    • Verified fix with comprehensive test suite ensuring all functionality remains intact

Fix refine-existing-codebase gauntlet task setup infrastructure

  • Fix refine-existing-codebase gauntlet task setup
    • Analyzed issue #99 where setupContainer failed due to missing /project directory and flake.nix
    • Identified that nix develop commands require a flake.nix file in the working directory
    • Modified setupContainer to create /project directory using mkdir -p /project
    • Added comprehensive flake.nix creation with all required tools (bun, jq, sed, python+requests, curl, which, hugo)
    • Preserved existing server.js creation logic exactly as before
    • Verified fix with comprehensive testing - all tests continue to pass
    • Ensured minimal, surgical changes with no impact on other gauntlet tasks
    • Made refine-existing-codebase task self-sufficient and no longer dependent on create-project-dir task

  • Ad Hoc: Fix Markdown Link Rendering in Copilot Streaming Messages (Issue #40)
    • Identified root cause: Copilot streaming chunks with markdown links were sent as plain text instead of formatted HTML
    • Added hasMarkdownLinks() helper function to detect markdown links in text chunks using regex pattern
    • Modified Copilot streaming callback to route chunks with markdown to HTML formatting using existing sendMarkdownMessage() helper
    • Created comprehensive test suite to verify markdown detection, HTML formatting, and end-to-end streaming behavior
    • Ensured fix is surgical and targeted - only affects Copilot status messages with GitHub links, preserves all existing functionality
    • All 106 tests passing, confirming no regressions introduced
    • Follow-up: Refactored function naming based on user feedback
      • Enhanced existing sendMarkdownMessage() function to automatically detect markdown content instead of creating new sendMessageSmart() function
      • Avoided function naming changes to reduce cognitive overhead and merge conflict potential
      • Generalized markdown detection to include links, code blocks, bold, italic, and headings
      • Replaced all message sending calls to use enhanced smart detection while preserving existing function names
      • All 110 tests continue to pass with comprehensive markdown support

Ad Hoc: Fix Gauntlet Command Markdown Formatting in Matrix (Issue #38)

  • Ad Hoc: Fix Gauntlet Command Markdown Formatting in Matrix (Issue #38)
    • Identified root cause: gauntlet help/list commands using sendMessage() instead of sendMarkdownMessage()
    • Fixed gauntlet help command to use sendMarkdownMessage() for proper HTML formatting
    • Fixed gauntlet list command to use sendMarkdownMessage() for proper HTML formatting
    • Added comprehensive test coverage for gauntlet command markdown formatting
    • Enhanced test mocks to handle gauntlet-specific content patterns
    • Verified all 105 tests pass with no regressions

Refine !tasks Command for New Directory Structure

  • Analyze current !tasks command implementation in bot.ts
  • Create utility function to parse front matter from task files
  • Create function to scan docs/_tasks/ directory for task files
  • Create function to filter tasks by completion status
  • Create function to assemble markdown from uncompleted tasks
  • Update !tasks command handler to use new logic
  • Test the refined !tasks command functionality
  • Ensure markdown is properly converted to HTML and sent to chat

Restructure TASKS.md and DEVLOG.md to Eliminate Merge Conflicts

  • Analyze current merge conflict issues with centralized TASKS.md and DEVLOG.md files
  • Design directory-based structure for individual task and devlog entries
  • Configure Jekyll collections for _tasks and _devlogs directories
  • Create aggregate pages that display entries in proper chronological order
  • Create sample entries to demonstrate the new structure
  • Migrate remaining content from existing TASKS.md and DEVLOG.md files
  • Update documentation and contributing guidelines
  • Test the new system with multiple contributors

Implement Agent Self-Correction and Learning Mechanisms

  • Investigate mechanisms for the agent to learn from its mistakes
  • Design a feedback system that captures failed task summaries
  • Implement context injection of previous failures for better future performance
  • Develop a self-correction loop that allows agents to retry tasks with improved approaches
  • Create metrics to measure learning effectiveness over time
  • Test self-correction mechanisms with the gauntlet testing framework

Enhance Matrix Interface User Experience and Commands

  • Implement more structured output formatting for better readability
  • Improve error reporting with actionable suggestions
  • Design more intuitive command syntax and help system
  • Add command auto-completion or suggestion features
  • Implement progress indicators for long-running operations
  • Add GitHub Copilot progress tracking via iframe integration - Embed GitHub’s native progress interface directly in Matrix client to show real-time Copilot agent progress including thoughts, file analysis, and command outputs instead of basic polling messages
  • Create user-friendly onboarding flow for new Matrix room users
  • Add support for rich message formatting (tables, code highlighting, etc.)

Design and Implement Multi-Agent Collaboration Framework

  • Design architecture for multiple specialized AI agents working together
  • Define agent specialization areas (e.g., code review, testing, documentation, deployment)
  • Implement task delegation and coordination mechanisms
  • Create communication protocols between agents
  • Develop conflict resolution strategies for concurrent operations
  • Design workload balancing and agent resource management
  • Test multi-agent workflows on complex development tasks
  • Create monitoring and observability for multi-agent operations

Systematic Gauntlet Testing and Model Performance Benchmarking

  • Run comprehensive gauntlet tests against all available local models
  • Test gauntlet against proprietary models (GPT-4, Gemini, etc.) for comparison
  • Establish performance benchmarks and scoring metrics
  • Analyze failure patterns across different model types and sizes
  • Document common failure points and edge cases
  • Create automated benchmark reporting and tracking system
  • Use benchmark results to guide prompt engineering improvements

Iterative Prompt Engineering Based on Gauntlet Results

  • Analyze gauntlet failure patterns to identify prompt improvement opportunities
  • Refine system prompts in prompts.ts based on empirical evidence
  • Implement A/B testing framework for prompt variations
  • Test prompt improvements against benchmark tasks
  • Document prompt engineering best practices and lessons learned
  • Create automated prompt optimization pipeline
  • Improve tool-use capabilities through targeted prompt engineering

Enhance Pre-commit Hook to Enforce Devlog and Task Entry Requirements

Objective

Fix the pre-commit hook to enforce that every commit includes both a devlog entry and a task entry, addressing the issue that PR 92 bypassed workflow requirements.

Requirements

  1. Clean up test artifacts: Remove test content from DEVLOG.md and test_file.txt from previous commits
  2. Enhance pre-commit hook: Add logic to require both devlog and task entries for every commit
  3. Smart detection: Allow documentation-only commits to proceed without devlog/task requirements
  4. Clear messaging: Provide actionable error messages when requirements are missing
  5. Maintain existing protections: Keep the current prevention of direct DEVLOG.md/TASKS.md editing

Implementation Details

File Cleanup

  • ✅ Reverted DEVLOG.md to remove erroneous “test” line at line 57
  • ✅ Removed test_file.txt that was accidentally committed

Pre-commit Hook Enhancement

  • ✅ Added detection for devlog files in docs/_devlogs/
  • ✅ Added detection for task files in docs/_tasks/
  • ✅ Implemented smart logic to exempt documentation-only commits
  • ✅ Enhanced error messaging with specific requirements and guidance
  • ✅ Maintained existing legacy file protection

Logic Flow

  1. Check for unstaged changes and untracked files (existing)
  2. Prevent direct editing of DEVLOG.md and TASKS.md (existing)
  3. NEW: For non-documentation commits, require:
    • At least one file in docs/_devlogs/
    • At least one file in docs/_tasks/
  4. Provide clear error messages for missing requirements

Testing Strategy

  • Test that hook blocks commits missing devlog entries
  • Test that hook blocks commits missing task entries
  • Test that hook allows documentation-only commits
  • Test that hook still prevents legacy file editing
  • Verify error messages are clear and actionable

Success Criteria

  • ✅ Pre-commit hook enforces devlog entry requirement
  • ✅ Pre-commit hook enforces task entry requirement
  • ✅ Documentation-only commits are allowed to proceed
  • ✅ Clear error messages guide users on missing requirements
  • ✅ Existing legacy file protections remain intact

Status: Completed ✅

The pre-commit hook has been successfully enhanced to enforce both devlog and task entry requirements for every commit, while maintaining flexibility for documentation-only changes.

UPDATES:

  • ✅ Fixed documentation detection logic to correctly identify README.md as a core project file requiring devlog/task entries
  • CRITICAL FIX: Resolved Husky configuration issue where hooks weren’t being called due to missing initialization and broken hook delegation
  • FULLY VERIFIED: All scenarios tested and working correctly:
    • Blocks commits without devlog entries
    • Blocks commits without task entries
    • Allows documentation-only commits (docs/ directory)
    • Prevents direct DEVLOG.md/TASKS.md editing
    • Provides clear error messages

Fix gauntlet task order: swap create-project-dir and add-jq positions

  • Fix gauntlet task execution order
    • Analyzed issue #105 requiring task order swap: create-project-dir should be 1st, add-jq should be 3rd
    • Identified current problematic order: add-jq (1st), check-sed-available (2nd), create-project-dir (3rd)
    • Implemented solution by reordering task objects in gauntlet tasks array
    • New logical order: create-project-dir (1st), check-sed-available (2nd), add-jq (3rd)
    • Added comprehensive tests to verify and maintain correct task ordering
    • Verified all existing functionality preserved (220 tests pass)
    • Confirmed logical dependency resolution: /project directory created before tasks that use it

Fix Gauntlet Provider Validation Logic

  • Fix Gauntlet Provider Validation Logic
    • Identified issue where gauntlet command was checking current provider instead of requested provider
    • Analyzed that the early check in handleGauntletCommand was blocking valid gauntlet executions
    • Removed incorrect check for this.currentLLMProvider === 'copilot' since gauntlet creates its own bot instance
    • Verified that existing argument parsing already prevents copilot from being specified as --provider
    • Updated tests to reflect corrected behavior - gauntlet can run regardless of current provider
    • Added test coverage for edge cases: openai provider when current is copilot, and blocking explicit copilot requests
    • Validated fix ensures gauntlet works with any valid provider (openai/ollama) regardless of bot’s current state

Contributing Tasks

To add a new task:

  1. Create a new file in docs/_tasks/ with the naming convention task-{number}-{short-description}.md
  2. Include front matter with title, order, and status fields
  3. Write the task description in markdown
  4. This page will automatically include your new task

For more information, see our contributing guide.