Glean says remote MCP server beats rivals in tests
Sat, 16th May 2026 (Today)
Glean said tests with Claude Cowork showed its remote MCP server outperformed off-the-shelf MCP tools. Its system was preferred about 2.5 times as often across roughly 175 queries.
The comparison kept Claude Cowork and Claude Sonnet 4.6 constant, changing only the context layer used to access company data. Glean tested its remote MCP server against commonly used MCP servers for Atlassian, Google Cloud logs, GitHub, Gmail, Google Calendar, Google Drive, Salesforce and Slack.
The results feed into a growing debate in workplace AI over how assistants should retrieve information from enterprise systems. As more companies ask AI tools to analyse data, draft messages, prepare meetings and create documents, the cost and quality of retrieval are becoming more visible concerns.
According to Glean, off-the-shelf MCP tools used about 30% more tokens on average than its own system. Token use for Glean's product stayed broadly stable at about 42,000 to 44,000 tokens regardless of result quality, while usage for off-the-shelf tools rose sharply when those tools produced stronger answers.
In one comparison focused on correctness, an off-the-shelf configuration used roughly 83,000 tokens to produce a better response, compared with about 43,000 for Glean in equivalent winning cases. Glean argued this reflected extra tool calls and reasoning loops rather than better retrieval.
Context layer
The benchmark turns on a technical distinction between federated search and centralised indexing. In a federated model, the AI queries each connected system separately and depends on the search method built into each tool. In a centralised indexing model, data from multiple sources is ingested into a common layer and ranked using shared signals across applications.
That design choice matters because workplace data is often spread across email, documents, calendars, customer records, code repositories and chat systems. To answer a broad question, an assistant may need to pull from several of those sources, reconcile differences and decide what matters most.
Glean argued that federated approaches can suffer from uneven search quality, slower response times and weaker ranking because they rely heavily on user-level signals from each source rather than broader organisational patterns. It said those systems often compensate by fetching more information than needed, raising latency and token use while increasing the risk that contradictory or outdated material enters the model's context window.
Task readiness
The company assessed responses on a five-point preference scale across utility, correctness, completeness and tool fidelity. Utility measured whether a response was usable for work with little editing, while tool fidelity covered whether the right tools were called successfully without interruptions such as timeouts or repeated authentication prompts.
Queries were designed to reflect common office tasks, including generating documents, HTML and slides, preparing calendars and meetings, drafting emails and Slack messages, analysing business data and finding internal information such as process owners or canonical documents.
Glean said its system performed better in every category it tested. It also said the gap widened as tasks became more complex, with its win rate rising from 66% on simpler tasks to 73% on more complex ones.
Some examples cited by the company involved pulling together customer feedback and product launch material into draft communications. In another case, the task was to identify deployments that had triggered SSAT alerts during the previous week. Glean said its system identified the full set of affected customers before drilling into individual cases, while the off-the-shelf configuration focused deeply on one account without first establishing the broader picture.
The findings come as businesses pay closer attention to the economics of AI use at work. Token consumption is becoming a practical budget issue as companies deploy large language models for longer and more frequent tasks, especially when those systems need repeated calls to outside tools.
For enterprise software groups, the study underlines that standardising the interface between AI models and external systems may not be enough on its own. MCP can give models a consistent way to call tools, but the underlying search and retrieval design still shapes whether an answer is accurate, complete and cheap enough to justify routine use.
Glean said the benchmark showed that context design is becoming a direct factor in both output quality and operating cost for AI used in day-to-day work.