Skip to main content
Reference Last updated: 8 March 2026

Integration Testing

Structured, repeatable integration test process run against @untether_dev_bot before every release. Tests exercise all 6 engines across the full feature surf...

Structured, repeatable integration test process run against @untether_dev_bot before every release. Tests exercise all 6 engines across the full feature surface.

Infrastructure

Details
Dev serviceuntether-dev.service@untether_dev_bot
Test projectstest-projects/test-{claude,codex,opencode,pi,gemini,amp}/
Test chats6 dedicated Telegram groups in the ut-dev folder, one per engine
EnginesClaude, Codex, OpenCode, Pi, Gemini, Amp

Automated Testing via Telegram MCP

All integration test tiers are fully automated by Claude Code using Telegram MCP tools and the Bash tool. The relevant MCP tools are:

  • send_message — send test prompts and commands to engine chats
  • get_history / get_messages — read back bot responses and verify expected behaviour
  • list_inline_buttons — inspect inline keyboards (approval buttons, /config menus, /browse)
  • press_inline_button — interact with inline keyboards (approve/deny, toggle settings)
  • reply_to_message — reply to resume lines for session continuation tests (U4)

Test chats

Tests are sent to 6 dedicated ut-dev: engine chats via @untether_dev_bot:

ChatChat ID
ut-dev: claude5284581592
ut-dev: codex4929463515
ut-dev: opencode5200822877
ut-dev: pi5156256333
ut-dev: gemini5207762142
ut-dev: amp5230875989

Workflow

  1. Claude Code sends a test prompt via send_message to the appropriate engine chat
  2. Waits for the bot to process (sleep or poll via get_history)
  3. Reads back the response via get_history/get_messages and verifies expected content
  4. For interactive tests: uses list_inline_buttons and press_inline_button to interact with approval/config buttons
  5. For resume tests: uses reply_to_message to reply to the resume line

Additional MCP tools for media tests

  • send_voice — send an OGG/Opus voice file as a voice message (for T1)
  • send_file — send a file with optional caption (for T2, T3, T5)

Log inspection and issue creation

After running integration tests, Claude Code MUST:

  1. Check dev bot logs via Bash tool: journalctl --user -u untether-dev --since "1 hour ago" | grep -E "WARNING|ERROR"
  2. Check for zombies/FD leaks: ps aux | grep defunct, FD count via /proc/<pid>/fd
  3. Track test results: for each test, note pass/fail/error with reason. Distinguish between Untether bugs and upstream engine API errors (e.g. authentication failures, rate limits, engine-side crashes)
  4. Create GitHub issues via GitHub MCP for any Untether bugs discovered during testing — engine API errors are not Untether bugs unless Untether handles them poorly (crashes, hangs, no error message)

Tests with special tooling

These tests were previously considered “manual” but can be automated via MCP and Bash:

  • T1 (voice message) — use send_voice with a pre-recorded OGG/Opus test file
  • T5 (media group) — use send_file to send multiple files rapidly (may not trigger media group coalescing depending on Telegram API batching)
  • B4 (SIGTERM drain) — use Bash tool: kill -TERM $(pgrep -f '.venv/bin/untether')
  • B5 (log inspection) — use Bash tool: journalctl --user -u untether-dev --since "1 hour ago"

Engine Feature Matrix

CapabilityClaudeCodexOpenCodePiGeminiAmp
Interactive approvalYes---Flag only-
Plan modeYes-----
Ask questionsYes-----
Resume/continueYesYesYesYesYesYes
Model overrideYesYesYesYesYesYes
Reasoning levelsYesYes----
API cost trackingYes-Yes-YesYes
Subscription usageYes-----
Diff previewYes-----

Test Tiers

Tier 1: Universal Tests (all 6 engines)

Run in every engine’s dedicated chat. Validates the core event pipeline.

#TestWhat to sendWhat to verifyCatches
U1Basic promptcreate a file called hello.txt with "hello world"Progress messages appear, final answer renders, footer shows model name, resume line present#62 (missing model), #65 (footer repeat), stream threading (#98)
U2Multi-tool promptlist the files in this directory, then read the README if one existsMultiple action phases show in progress, tool names visible in verbose modeEvent counting, action tracking
U3Long responsewrite a detailed explanation of how TCP/IP works, at least 2000 wordsMessage splits correctly across multiple Telegram messages, no truncation, footer only on last chunk#65 (footer repeat), #59 (entity overflow), message splitting
U4Resume sessionAfter U1 completes, reply to the resume line: now rename hello.txt to greetings.txtResume token works, session continues, new progress + final answerResume token parsing per engine
U5Model overrideVia /config → Model → set a different model, then send a promptFooter shows overridden model name#77 (AMP model flag), build_args correctness
U6Cancel mid-runSend a long prompt, then /cancel before it finishesRun stops, completion message appears, no orphan processGraceful cancellation, process cleanup
U7Error handlingSend a prompt that will fail (e.g. read /nonexistent/file/path)Error renders in Telegram, no crash, session ends cleanlyStderr sanitisation (#85), error formatting
U8/usage/usage after a completed runShows cost or subscription info (engine-dependent)#89 (429 handling), cost tracking
U9/export/export after a completed runMarkdown export downloads, contains prompt and response#63 (missing usage in export)
U10/browse/browseFile browser appears with inline keyboard, can navigate directoriesBrowse command, path traversal safety

Tier 2: Claude-Specific Tests (interactive features)

Run in the Claude test chat only. Requires plan mode ON for most tests.

#TestWhat to sendWhat to verifyCatches
C1Tool approvalSend a prompt requiring Bash (e.g. run ls -la), with plan mode ONApprove/Deny/Discuss buttons appear, clicking Approve proceeds, tool executes#104 (buttons not appearing), #103 (progress stuck)
C2Tool denialSame as C1, click DenyDenial message reaches Claude, Claude acknowledges and continues#66 (deny retry loop)
C3Plan mode outlineSend a complex prompt, click “Pause & Outline Plan”Claude writes outline, then Approve/Deny buttons appear automaticallyCooldown mechanics (#87), post-outline approval
C4Ask questionSend a prompt that triggers AskUserQuestion (e.g. should I use TypeScript or JavaScript for this?)Question appears with option buttons, user reply routes back to ClaudeAskUserQuestion flow
C5Diff previewWith plan mode ON, send a prompt that edits a fileDiff preview shows in approval message (old/new lines)Diff preview rendering
C6Rapid approve/denyApprove a tool, then quickly deny the next oneNo spinner hang, no stale buttons, clean state transitionsEarly callback answering, button cleanup
C7Subscription usage/usage with subscription footer enabledShows 5h/weekly formatSubscription footer rendering

Tier 3: Telegram Transport Tests

Tests specific to how Untether uses Telegram — message formatting, media, input types. Run in any engine chat unless noted.

#TestWhat to sendWhat to verifyCatches
T1Voice messageRecord and send a voice note as promptTranscription appears, prompt runs, response rendersVoice transcription pipeline, codec handling
T2File uploadSend a file with caption /file put src/test.txtFile appears in project directory, confirmation messageFile transfer, path safety, size limits
T3File download/file get README.mdFile downloads to Telegram chatFile serving, MIME types
T4Forward coalescingForward 3 messages rapidly from another chatMessages combined into single prompt, one run starts (not three)forward_coalesce_s debounce, metadata annotation
T5Media groupSend 3+ images/files at once (shift-click to batch)Bundled as single upload batch, not 3 separate runs. Note: MCP send_file sends individual documents, not Telegram albums — true media group coalescing requires the Telegram client’s batch-send. MCP tests verify file handling and no-crash behaviour.media_group_debounce_s, auto-put mode
T6Emoji in responserespond with 5 different emoji flags and bold the country namesEntities render correctly, no offset corruptionUTF-16 entity offsets (emoji = 2 code units, not 1 Python codepoint)
T7Code block splittingwrite a 200-line Python scriptCode blocks split cleanly across messages, syntax highlighting preservedEntity boundary splitting, pre/code nesting rules
T8Stale button clickWait for a session to complete + clean up, then click an old Approve buttonToast “Expired” or similar, no crash, no spinner hangStale callback_data, cleaned-up session registry
T9Directive routing/codex list the files here (in Claude chat)Codex runs instead of Claude, correct project contextDirective parsing, engine override
T10Branch directive/claude @develop create hello.txtRun uses develop branch, not defaultBranch directive, context resolution

Tier 4: Configuration and Overrides

Tests for per-chat and per-topic settings that affect run behaviour. Use forum topics if available.

#TestWhat to sendWhat to verifyCatches
O1Engine override/agent set gemini, then send a plain prompt (no directive)Gemini runs, footer shows Gemini modelPer-chat engine default, override hierarchy
O2Reasoning level/config → Reasoning → enable, then send a promptReasoning model used, footer reflects itReasoning flag in build_args
O3Trigger mode/trigger mentions in group, send plain text, then @bot do somethingPlain text ignored, @mention triggers runTrigger mode filtering
O4Ask mode toggle/config → Ask → off, send prompt that would trigger AskUserQuestionQuestion auto-denied instead of shownAsk mode auto-deny path
O5Context set/ctx set test-claude main, send promptRun uses test-claude project on main branchContext resolution, project switching
O6Context clear/ctx clear, send promptFalls back to chat/project defaultContext fallback chain
O7Chat session modeSet session_mode = "chat" in config, restart dev bot, send prompt 1, then prompt 2 (no reply)Prompt 2 continues same session without needing resume replyStateful session mode
O8Override persistenceSet /agent set pi, restart dev bot, send promptPi still runs — override survived restartState file persistence
O9Override clear/agent clear, send promptFalls back to project/global default engineOverride cleanup

Tier 5: Cost, Budget, and Operational

Tests for cost tracking, budget enforcement, and operational commands.

#TestWhat to sendWhat to verifyCatches
B1Budget auto-cancelSet max_cost_per_run = 0.01 in config, restart, send expensive promptRun auto-cancels with budget warning messageCost tracker, auto-cancel flag
B2Daily budget warningSet max_cost_per_day = 0.05, run several cheap promptsWarning appears when approaching thresholdDaily accumulation, warn_at_pct
B3/statsRun several prompts across engines, then /statsPer-engine run counts, action counts, durations renderStats aggregation
B4SIGTERM drainStart a run, then kill -TERM $(pidof untether) from shellActive run drains, completion message sent, bot exits cleanlySignal handling, graceful shutdown
B5Log inspectionAfter running several tests, check structured logsNo unhandled exceptions, no FD leak warnings, no zombie processesOperational health

Tier 6: Stress and Edge Cases

Harder to trigger but catches the most production bugs.

#TestWhat to sendWhat to verifyCatches
S1Stall detectionSend a prompt likely to take >5 minutes, or kill -STOP the engine processStall warning appears in Telegram after threshold, /proc diagnostics available#95 (stall not detected), #97 (no diagnostics), #99 (stall loops), #105 (stall during tools)
S2Concurrent sessionsSend prompts in two different engine chats simultaneouslyBoth run independently, no cross-contamination, both completeSession isolation
S3Bot restart mid-runStart a run, then /restartActive run drains gracefully, bot restarts, can start new runsGraceful restart, drain logic
S4Verbose mode/verbose on, then send a promptProgress shows tool details (file paths, commands, patterns)Verbose rendering
S5Config persistenceToggle settings via /config, restart dev bot, verify settings stickSettings survive restartState file persistence
S6Empty/whitespace promptSend just spaces or an empty forwardBot handles gracefully, no crashInput validation
S7Rapid-fire promptsSend 5 messages in quick succession to same chatOnly one run starts (or queues), no double-spawn, no crashRace condition, session locking
S8Very long promptPaste 4000+ characters as a single messagePrompt reaches engine intact, no truncationTelegram message limits, prompt forwarding
S9Concurrent button clicksTwo rapid clicks on the same Approve buttonOnly one approval processed, second gets toast, no double-executeCallback deduplication

Tier 7: Command Smoke Tests (quick, any engine)

Run quickly to verify all commands respond.

#CommandExpectedTime
Q1/pingPong + uptime1s
Q2/configSettings menu with buttons1s
Q3/usageUsage info or “no session”1s
Q4/exportExport or “no session”1s
Q5/browseFile browser1s
Q6/verboseToggle confirmation1s
Q7/cancel”Nothing running” or cancels1s
Q8/planmode (Claude chat)Mode toggle1s
Q9/statsSession statistics or empty1s
Q10/ctxCurrent context or “none set”1s
Q11/agentCurrent engine override or default1s
Q12/triggerCurrent trigger mode1s
Q13/fileUsage help or file browser1s

Upgrade Path Testing

Run before minor and major releases to verify backward compatibility.

Config compatibility

# Save current production config
cp ~/.untether/untether.toml /tmp/prod-config-backup.toml

# Test current code parses old config without error
UNTETHER_CONFIG=/tmp/prod-config-backup.toml uv run python -c "from untether.settings import load; load()"

# Verify new config keys have defaults (old configs missing them still work)
diff ~/.untether/untether.toml ~/.untether-dev/untether.toml

Rollback safety

# Before releasing: verify the previous version still installs and starts
pip install untether==$CURRENT_PROD_VERSION --dry-run

# After release: if issues found, rollback path is:
# pipx install untether==$OLD_VERSION && systemctl --user restart untether

State file compatibility

If any state files exist (chat preferences, topic state), verify they survive upgrade:

# Check state files before upgrade
ls -la ~/.untether-dev/state/

# After restart with new code, verify no parse errors in logs
journalctl --user -u untether-dev --since "1 minute ago" | grep -iE "error|parse|corrupt"

Execution Process

Integration tests are run by Claude Code via Telegram MCP tools (see “Automated Testing via Telegram MCP” above). Claude Code sends prompts and commands to the ut-dev: engine chats, reads back responses, interacts with inline buttons, and verifies expected behaviour. Voice messages (T1) use send_voice, file tests use send_file, SIGTERM (B4) and log inspection (B5) use the Bash tool. All tiers are fully automatable by Claude Code.

Before every version bump

1. Code changes complete, unit tests pass
   uv run pytest && uv run ruff check src/ && uv run ruff format --check src/ tests/

2. Restart dev bot
   systemctl --user restart untether-dev

3. Tail logs in a separate terminal
   journalctl --user -u untether-dev -f

4. Run Tier 7 (command smoke) — 2 minutes
   Claude Code sends each command to an engine chat via MCP, verifies responses

5. Run Tier 1 (universal) — 30 minutes
   Claude Code runs U1-U10 in ALL 6 engine chats via MCP
   Focus on: progress rendering, final message, model footer, resume

6. Run Tier 2 (Claude-specific) — 15 minutes
   Claude Code runs C1-C7 in Claude test chat with plan mode ON
   Uses list_inline_buttons/press_inline_button for approval tests

7. Run Tier 3 (Telegram transport) — 15 minutes
   Run T1-T10 based on what changed. Always run T6 (emoji) and T8 (stale buttons)
   T1 (voice) uses send_voice, T5 (media group) uses send_file

8. Run Tier 4 (overrides) — 10 minutes
   Run O1-O9 if config/override code changed. Always run O1 and O8

9. Run Tier 5 (cost/operational) — 5 minutes
   Run B1-B3 if cost tracking changed. B4 (SIGTERM) and B5 (logs) require shell access

10. Run Tier 6 (stress) — 15 minutes
    Pick 2-3 stress tests based on what changed:
    - Bug fix release → S1 (stall), S2 (concurrent), S7 (rapid-fire)
    - New feature → S4 (verbose), S5 (config persistence)
    - Major change → all of S1-S9

11. Run upgrade path tests (minor/major only) — 5 minutes
    Config compatibility, state file compatibility

12. Check logs for warnings/errors (via Bash tool)
    journalctl --user -u untether-dev --since "1 hour ago" | grep -E "WARNING|ERROR"
    Check FD count and zombie processes
    Create GitHub issues for any Untether bugs found

13. Report results: list each test as pass/fail/error with reason
    Distinguish Untether bugs from upstream engine API errors

14. If all pass: commit, tag, release

Per release type

Release typeRequired tiersFocus areasTime
Patch (bug fix)Tier 7 + Tier 1 (affected engine + Claude) + relevant Tier 6The specific bug area + regression check~30 min
Minor (new feature)Tier 7 + Tier 1 (all) + Tier 2 + Tier 3 (relevant) + Tier 4 (relevant) + Tier 6 + upgrade pathNew feature + all engine regression + config compat~75 min
Major (breaking)All tiers, all engines, full upgrade pathEverything — no shortcuts~120 min

What to focus on per change type

Changed areaMust-run tests
Runner code (runners/*.py)U1-U4 (all engines), U6, U7
Telegram transport (telegram/*.py)T1-T10, S7, S8
Control channel (claude_control.py)C1-C6, T8, S9
Config/settings (settings.py)O1-O9, S5, upgrade path
Cost tracking (cost_tracker.py)B1-B3, U8
Progress/formatting (markdown.py)U3, T6, T7, S4, S8
Commands (commands/*.py)Tier 7 (all), specific command test
File transfer (file_transfer.py)T2, T3, T5
Voice (voice.py)T1
Topics (topics.py, topic_state.py)O1, O5, O6, O8
Directives (directives.py)T9, T10
Shutdown (shutdown.py)S3, B4

Quick Reference

Common test prompts

# U1 — basic prompt (all engines)
create a file called hello.txt with "hello world"

# U2 — multi-tool (all engines)
list the files in this directory, then read the README if one exists

# U3 — long response (all engines)
write a detailed explanation of how TCP/IP works, at least 2000 words

# U4 — resume (reply to resume line after U1)
now rename hello.txt to greetings.txt

# U7 — error handling (all engines)
read /nonexistent/file/path

# C1 — tool approval (Claude, plan mode ON)
run ls -la

# C4 — ask question (Claude)
should I use TypeScript or JavaScript for this?

# T6 — emoji entities
respond with 5 different emoji flags and bold the country names

# T9 — directive routing (send in Claude chat)
/codex list the files here

# S8 — long prompt
[paste 4000+ characters of text]

Log inspection

# Tail dev bot logs
journalctl --user -u untether-dev -f

# Recent warnings/errors
journalctl --user -u untether-dev --since "1 hour ago" | grep -E "WARNING|ERROR"

# Specific event types
journalctl --user -u untether-dev --since "1 hour ago" | grep -E "stall|cancel|error"

# Full structured logs (JSON)
journalctl --user -u untether-dev --since "1 hour ago" -o cat

# FD count for bot process (detect leaks)
ls /proc/$(pidof untether)/fd 2>/dev/null | wc -l

# Zombie process check
ps aux | grep -E "defunct|Z " | grep -v grep

Dev bot lifecycle

# Restart dev bot (picks up local source changes)
systemctl --user restart untether-dev

# Check status
systemctl --user status untether-dev

# NEVER restart production for testing
# systemctl --user restart untether  ← WRONG

Known Limitations and Gotchas

Unexpected engine behaviour

During integration testing, Claude Code must watch for and note any unexpected engine behaviour, especially:

  • Phantom responses: Engine produces substantive output from empty/garbage input (e.g. empty voice transcription triggers an unrelated long response). This may indicate session state leaking, hallucinated context, or the engine inventing a task.
  • Wrong engine running: Directive routing sends to the wrong engine, or engine override doesn’t take effect.
  • Session cross-contamination: Response references files/context from a different engine’s test project.
  • Disproportionate cost: Simple test prompt generates unexpectedly high token/cost usage.

When detected, note the engine, chat ID, message IDs, and exact behaviour. Create a GitHub issue if the root cause is in Untether (e.g. wrong context forwarded, preamble confusion). If the root cause is upstream engine behaviour, note it in the test results as an engine quirk rather than an Untether bug.

Timing and determinism

  • Stall tests (S1) are timing-dependent — thresholds vary by [watchdog] config. Check ~/.untether-dev/untether.toml for current values.
  • Ask question (C4) is hard to trigger deterministically — Claude decides when to ask. Try ambiguous prompts.
  • Forward coalescing (T4) depends on forward_coalesce_s debounce window — send forwards quickly enough to be within the window.
  • Budget auto-cancel (B1) depends on how fast the engine reports costs — some engines report at the end, not incrementally.

Engine-specific

  • Resume (U4) requires replying to the specific resume line in the final message. Resume token format varies by engine.
  • Model override (U5) availability depends on which models each engine supports. Use /config → Model to see available options.
  • Long response (U3) behaviour varies by engine — some produce shorter responses. The key check is message splitting, not word count.
  • Concurrent sessions (S2) may hit rate limits on some engine APIs. Space the prompts a few seconds apart.
  • Reasoning levels (O2) only available for Claude and Codex.

Config and state

  • Subscription usage (C7) requires [footer] configured in ~/.untether-dev/untether.toml.
  • Export (U9) requires a completed session in the current chat. Run a prompt first if /export returns “no session”.
  • Chat session mode (O7) requires config change and restart — cannot toggle at runtime.
  • Override persistence (O8) depends on state file location — verify ~/.untether-dev/state/ exists.

Telegram platform

  • Stale button clicks (T8) — Telegram delivers callback queries for buttons on messages of any age. Bot must handle gracefully.
  • UTF-16 entity offsets (T6) — Telegram uses UTF-16 code units for entity offsets. A single emoji flag sequence occupies 2 code units but 1 Python codepoint. Test with emoji-heavy text.
  • 4096-char limit applies after entity parsing, not before. Splitting must account for entity boundaries.
  • Voice messages (T1) require Opus/OGG format, max 10MB by default. Transcription depends on configured API endpoint being accessible.
  • 429 rate limits block ALL Telegram sends for the full retry_after duration, not just the rate-limited chat. Monitor logs for 429s during high-volume testing.
Was this helpful?

Related Articles