Testing FastMCP servers: the layer hierarchy your unit tests miss
· mcpfastmcppythontestingclaude-code
Last week’s post walked through one specific FastMCP bug — a sync @mcp.tool() calling asyncio.run() inside the running event loop — and the lint that catches it. That bug was an instance of a more general problem: MCP servers have layers, your tests cover only one of them, and the bugs live in the others.
This post is the full taxonomy. Four layers, what each does, what tests of each layer prove, what they leave invisible, and the minimum viable test set that covers all four. Code samples in Python with pytest + pytest-asyncio + pytest-httpx — the stack we run for our four internal MCP servers.
The four layers of a FastMCP server
When you write @mcp.tool()-decorated functions, four distinct things happen between “function imports cleanly” and “client successfully calls the tool”:
┌───────────────────────────────────────────────────┐
│ Layer 4: PROTOCOL │
│ FastMCP receives JSON-RPC; routes to a tool; │
│ serialises the return; emits SSE / stdio bytes │
├───────────────────────────────────────────────────┤
│ Layer 3: WRAPPER (decorator) │
│ @mcp.tool() registers the function; introspects │
│ the signature; generates the JSON-Schema │
├───────────────────────────────────────────────────┤
│ Layer 2: ADAPTER │
│ Your tool body — parses inputs, calls the inner │
│ work, shapes the output, handles errors │
├───────────────────────────────────────────────────┤
│ Layer 1: INNER FUNCTIONS │
│ The real logic — HTTP clients, DB queries, │
│ parsing, ranking, formatting │
└───────────────────────────────────────────────────┘
A bug in Layer 1 is a bug in your business logic. Bugs in Layers 2–4 are integration bugs — and they have a specific failure mode: the bug is invisible until you exercise that exact layer through its real runtime.
If your tests look like ours did three months ago — pytest-httpx mocks of HTTP responses, direct calls to inner async functions — they cover only Layer 1. Layers 2, 3, and 4 are blind spots. Three blind spots, one of which (Layer 3) bit us in production.
What each layer can break
Layer 1: inner-function bugs. Wrong field names, off-by-one in pagination, missing null checks, stale mocked endpoint URLs. The kind of bug a unit test with a good mock catches reliably.
Layer 2: adapter bugs. Your tool body, between def and the inner call. Common shapes:
- Argument-shape mismatch — you renamed a parameter on the inner function and forgot the tool callsite
- Error-handling that swallows or upgrades exceptions in unexpected ways
- Off-by-one in cap-enforcement (limit=100 becomes limit=99 because of an < vs <= slip)
- Cache logic — lazy initialisation that races on first call
Layer 3: wrapper bugs. The @mcp.tool() decorator’s behaviour. This is where the asyncio.run-inside-running-loop bug lived. Other shapes:
- Sync tool blocking the event loop (time.sleep in a sync tool freezes the entire server)
- Missing or wrong JSON-Schema generation (the model gets the wrong tool description and calls with garbage args)
- Decorator-stacking order issues (e.g., @mcp.tool() above vs below @functools.lru_cache() — only one order works)
Layer 4: protocol bugs. The runtime serialisation and transport.
- Returning a dataclass that gets repr()‘d into a string instead of structured output
- Returning datetime objects without isoformat() and the JSON serialiser dies
- Errors that crash the server’s stdio loop instead of returning an MCP error response
- Rate-limit / pagination semantics that don’t survive a streaming response
Each of these has a specific test that catches it. Most teams’ test suites only catch Layer 1.
Layer 1: tests you probably already have
These are the workhorse unit tests. pytest-httpx mocks HTTP, pytest-asyncio lets you await from tests, and you exercise each inner function with realistic responses + edge cases. Coverage tooling reports them, CI runs them on every push, and they fail fast.
# tests/test_reader.py
async def test_check_account_status_happy_path(httpx_mock, config):
httpx_mock.add_response(
method="POST",
url="https://api.twitter.com/oauth2/token",
json={"access_token": "TEST_BEARER", "token_type": "bearer"},
)
httpx_mock.add_response(
method="GET",
url="https://api.twitter.com/2/users/by/username/testaccount?...",
json={"data": {"id": "999", "username": "testaccount", ...}},
)
async with Client(config) as client:
result = await check_account_status(client, config)
assert result["handle"] == "testaccount"
assert result["followers"] == 42
What this proves: given a known HTTP response, the inner parser produces the expected output dict. What this does NOT prove: that any layer above the inner function works.
This is fine. Layer 1 is the right place to test Layer 1.
Layer 2: adapter tests
The adapter layer is your tool body — between the decorator and the inner call. For tools wider than one line, you want at least one test that exercises the body’s logic without going through the FastMCP wrapper.
The cleanest pattern: extract the tool body into a private async function, decorate that, and have your tests call the private function directly:
# server.py
async def _impl_check_account_status(client, cfg) -> dict:
"""The actual tool logic — testable independently."""
return await reader.check_account_status(client, cfg)
@mcp.tool()
async def check_account_status() -> dict[str, Any]:
"""The MCP tool — wraps _impl with a fresh Client and the standard error envelope."""
try:
return await _with_client(
lambda client, cfg: _impl_check_account_status(client, cfg)
)
except (ConfigError, AuthError) as e:
return _error(e, hint="Check products/mcp-twitter/.env credentials.")
except (ApiError, RateLimitError) as e:
return _error(e)
# tests/test_adapter.py
async def test_adapter_propagates_inner_results(httpx_mock, config):
"""Adapter (the tool body) returns whatever the inner function returns,
unchanged."""
httpx_mock.add_response(...)
async with Client(config) as client:
result = await _impl_check_account_status(client, config)
assert "handle" in result
assert "followers" in result
async def test_adapter_converts_auth_error_to_envelope():
"""Adapter wraps AuthError as a structured error dict, not a raised exception."""
# Simulate auth failure
with patch.object(_with_client, "...") as mock:
mock.side_effect = AuthError("invalid token")
result = await check_account_status()
assert result == {
"error": "AuthError",
"message": "invalid token",
"hint": "Check products/mcp-twitter/.env credentials.",
}
The second test is interesting because it exercises the error envelope without going through the wrapper. The wrapper layer’s job is just to register the function with FastMCP; it doesn’t change what the function returns. So testing the function directly tests the adapter logic in isolation.
Smell test for whether your tests cover Layer 2: can you describe what the tool body does that’s different from the inner function? If yes, write a test for that difference.
Layer 3: wrapper tests
This is where the FastMCP-asyncio.run bug from last week lived, and it’s where most teams have zero coverage.
There are two patterns that catch wrapper bugs without needing a live network or a running MCP server.
Pattern 1: assert tool shape via introspection.
If your project’s tools all do async work and should all be async def, write a parametrized test asserting that:
# tests/test_server_wrappers.py
import inspect
import pytest
from mcp_twitter import server
TOOL_NAMES = ["check_account_status", "list_mentions", "read_tweet", "search_recent"]
@pytest.mark.parametrize("name", TOOL_NAMES)
def test_tool_is_coroutine_function(name):
"""Tools that do async work must be `async def`. FastMCP runs sync tools
in a thread executor — fine for sync work, but a sync tool that drives
async work via asyncio.run() raises RuntimeError on first call (the loop
is already running)."""
func = getattr(server, name)
assert inspect.iscoroutinefunction(func), (
f"server.{name} is not async def — sync tools that call "
f"asyncio.run() will break under the FastMCP runtime."
)
This costs zero network and runs in milliseconds. It catches the entire bug class at parse time.
Pattern 2: AST scan for known anti-patterns.
Even if every tool is correctly async def, you can guard against future-someone reintroducing asyncio.run() inside an @mcp.tool() body:
import ast
def _is_mcp_tool_decorator(decorator):
target = decorator.func if isinstance(decorator, ast.Call) else decorator
return isinstance(target, ast.Attribute) and target.attr == "tool"
def _calls_asyncio_run(node):
for sub in ast.walk(node):
if (isinstance(sub, ast.Call)
and isinstance(sub.func, ast.Attribute)
and sub.func.attr == "run"
and isinstance(sub.func.value, ast.Name)
and sub.func.value.id == "asyncio"):
return True
return False
def test_no_asyncio_run_inside_mcp_tool():
tree = ast.parse(open("src/my_pkg/server.py").read())
offenders = []
for node in ast.walk(tree):
if not isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
continue
if not any(_is_mcp_tool_decorator(d) for d in node.decorator_list):
continue
if _calls_asyncio_run(node):
offenders.append(node.name)
assert not offenders, (
f"@mcp.tool() functions calling asyncio.run(): {offenders}"
)
This is more surgical — it catches the specific anti-pattern regardless of whether the function is def or async def. We open-sourced a more polished version that scans across multiple server.py files.
Pattern 3 (when sync tools are intended): introspection that allows sync.
For projects mixing sync and async tools (e.g., a server that wraps both sqlite3 and httpx), inspect.iscoroutinefunction is too strict. Instead, assert that no sync tool body contains the asyncio.run anti-pattern:
@pytest.mark.parametrize("name", TOOL_NAMES)
def test_sync_tool_body_doesnt_call_asyncio_run(name):
func = getattr(server, name)
if inspect.iscoroutinefunction(func):
return # async tools don't have this restriction
source = inspect.getsource(func)
assert "asyncio.run(" not in source, (
f"sync tool {name} contains asyncio.run() — will break under FastMCP."
)
This is less complete than the AST scan but lighter-weight; pick your poison.
Layer 4: protocol tests (live, sparingly)
The fully-integrated test: spawn an MCP server, send a JSON-RPC call, validate the response. This is the only test that proves the full chain works under the real runtime.
The cost-benefit on Layer 4 tests is awkward. They’re slow (subprocess startup, real I/O, real network if the tool fetches), they’re flaky (network errors), and they cost money if the tool calls a billable API. So most teams skip them.
Our compromise: one Layer 4 smoke test per server, run manually before delivery, not in CI. It’s at the project root as smoke_test.py, exercises one tool per registered tool with realistic inputs, and prints “Smoke test OK.” on success:
# smoke_test.py
import asyncio
import sys
from mcp_twitter import server
async def main() -> int:
print("Smoke test: invoking each tool through its async wrapper...")
print(" check_account_status()...")
res = await server.check_account_status()
if "error" in res:
print(f" FAIL: {res}")
return 1
print(f" OK: handle={res['handle']}")
print(" list_mentions(since_hours=24, limit=2)...")
res = await server.list_mentions(since_hours=24, limit=2)
if isinstance(res, dict) and "error" in res:
print(f" FAIL: {res}")
return 1
print(f" OK: returned {len(res)} mentions")
# ... etc for each tool
print("\nSmoke test OK.")
return 0
if __name__ == "__main__":
sys.exit(asyncio.run(main()))
Run it with uv run python smoke_test.py from the project root before any commit that touches server.py. Costs a few cents in API calls; catches every Layer-4 protocol bug in one pass.
The reason this isn’t in pytest proper: live tests don’t belong in CI by our convention (offline-only by default). They live at the project root, get run manually, and are documented in the README.
The minimum viable test set
For an MCP server with N tools, the smallest test suite that covers all four layers:
| Layer | Test count | Cost | Catches |
|---|---|---|---|
| 1: inner functions | ~3–5 per inner function | offline, cheap | logic bugs in the parser/client |
| 2: adapter | ~2 per tool | offline, cheap | error-envelope shape, arg propagation |
| 3: wrapper | 1 parametrized + 1 AST | offline, instant | sync-when-should-be-async, asyncio.run anti-patterns |
| 4: protocol | 1 smoke per server | live, ~$0.10/run | full integration, JSON-Schema validity, runtime serialisation |
The wrapper-layer tests are the cheapest line in this table by orders of magnitude (instant, no I/O, catch the highest-impact bug class). If you’re skipping any test category in this table, skip Layer 4 (live smoke), not Layer 3 (wrapper). Layer 4 is about confidence; Layer 3 is about correctness. Confidence without correctness is theatre.
When to test live vs. mocked
Mock by default. Live network tests are slow, flaky, and cost money. Most of your test suite should be offline.
Live tests for the seam between you and the third party. When a third-party API changes its response shape, only a live test catches it. So at least one test per integration should be live — either in CI (with API keys in secrets) or as a manual smoke_test.py run before delivery.
Live tests for protocol-layer behaviour. Mocking FastMCP itself is hard and rarely worth it. If you want to verify “does this tool actually work end-to-end through MCP,” spawn a real server, send real JSON-RPC, validate real responses. One test per tool, run manually.
Never live-test in unit-test files. Confusing semantics — your CI can’t tell whether a test failure is a code regression or a network blip. Keep tests/ deterministic; live work goes in smoke_test.py at the project root.
What this looks like for our four internal MCPs
We run all four MCP servers under the same SOP. Each has:
| Server | Layer 1 tests | Layer 2/3 tests | Layer 4 |
|---|---|---|---|
mcp-content-opportunity |
16 (HN + Reddit + ranker) | 6 (CLI surface, edge cases) | manual smoke_test.py against live HN/Reddit |
mcp-sqlite-query |
32 (query + safety + caps) | 14 (validators, type coercion) | manual smoke_test.py against a fixture DB |
mcp-gmail-reader |
47 (parsers + IMAP mocking) | 8 (label scope, FORBIDDEN_NAMES) | manual smoke_test.py against a test inbox |
mcp-twitter |
36 (config + client + reader) | 6 (server wrappers — see post 8) | manual smoke_test.py against the live API |
The wrapper-layer tests in mcp-twitter’s test suite are the youngest — added the day after the bug bit us. Adding the same pattern to the other three is on the immediate to-do.
The mcp-guardrails cross-project lint (repo) runs across all four server.py files in one command. It currently passes — every tool is correctly shaped — but having the lint means a future regression can’t reach master.
Wiring it as CI
Three things make this real:
- Layer-1, Layer-2, Layer-3 tests in
pytest. Run on every PR, fail-fast. - Cross-project lint hook in CI. A 200ms job that scans every
server.pyin the repo. Fails the build if a forbidden pattern appears. - Layer-4 smoke as a manual step in the delivery SOP. Run before tagging a release; output captured in the delivery email to the client.
The whole wrapper-layer + lint apparatus added maybe 30 lines per server plus 80 lines for the cross-project lint. We’ve already had that pay back once (the next regression that almost shipped, caught by a CI fail). The cost-benefit is absurdly favourable.
The thing I wish I’d internalised earlier
For most software, “tests pass” is approximately equivalent to “code works.” For MCP servers running under FastMCP, that equivalence breaks: tests can pass while the protocol layer is fundamentally broken, because tests don’t exercise the protocol layer.
The taxonomy above isn’t a recommendation — it’s a description of what your tests already do, plus what they don’t. The exercise is: look at your tests/ directory, classify each test by which layer it exercises, and notice the gaps. The gaps are where your next bug lives.
For us, the gap was Layer 3, and it shipped to production despite 42 passing tests. For you, it might be Layer 2 (adapter logic) or Layer 4 (serialisation). Either way: the gap is where you’re not looking.
The cross-project lint hook + buggy-server fixture are MIT-licensed at github.com/Alienbushman/mcpdone-samples/tree/master/mcp-guardrails. The full test suites for our four internal MCPs are in our private repo, but happy to walk through the patterns if you’re stuck on similar gaps. hello@mcpdone.com.
If your team wants production-shape MCP servers built with this kind of layered test coverage from day one, that’s literally the $499 Build tier. Money-back if the code doesn’t run in a clean environment.