This post explains the threat, shows real demonstrations, summarizes recent research that small numbers of poisoned samples can backdoor models of any size, and proposes practical defenses for operators and end users. This is far more important with a lot of AI companies coming up with their own browsers that interacts with your local machine. I am also seeing an increased number of pages targeting AI scrappers.
Large language models and agentic systems are powerful, but that power creates new, subtle attack surfaces. Two of the most worrying ones are data poisoning (poisoned training data) and tool or metadata poisoning (malicious instructions hidden in tool descriptions or docstrings served by MCP servers). Both let attackers influence what a model believes is statistically “normal” or what it should do when presented with certain inputs. This can lead to data exfiltration, unauthorized actions, or covert coordination with an attacker-controlled server.
Data poisoning#
Data poisoning means deliberately adding, modifying, or labeling training or fine-tuning data so a model learns incorrect associations or instructions. The attacker’s goal can be a backdoor (model behaves normally most of the time but does a specific malicious thing when triggered), persistent misinformation, or a wider degradation of model behavior. Recent research shows that even a very small number of poisoned samples can implant effective backdoors in models across scales. That challenges the intuition that attackers need to control a large fraction of the dataset.
Agentic systems commonly integrate external tools via metadata or tool descriptors provided by MCP-style servers. Those tool descriptions, docstrings, or hidden tags are often read by the model at runtime and can contain natural-language instructions. If an attacker controls or compromises an MCP server, they can embed hidden instructions in the tool metadata that the model will read and obey even though the user never sees them. The result is a trojaned tool: outwardly harmless but instructing the model to do harmful things like exfiltrate keys, call other tools, or misreport results.
Example: A concrete demonstration: researchers have shown an innocuous add(a, b) function whose docstring includes hidden instructions that cause an agent to leak SSH keys or otherwise reveal sensitive data when loaded. The docstring itself looks benign to a human glance, and parts intended to be hidden can be surrounded by tags such as IMPORTANT or obfuscated with invisible characters so the normal UI hides them while the LLM still reads them. This pattern has been reproduced in vulnerable MCP servers and documented by security researchers.
Claims and reality#
The claim from the big model makers is that Larger models trained on more data are harder to backdoor because the signal from a few poisoned documents should be drowned out. But the reality is different, experiments and analyses show that a surprisingly small, fixed number of poisoned samples can implant backdoors across very different model sizes. This means attackers do not necessarily have to control a noticeable fraction of training data; they may only need a compact set of highly crafted poisoned documents. The practical implication: model operators and dataset curators must treat even small amounts of untrusted content as potentially dangerous. Broadly the interactions are as following:
- secret leakage: poisoned tool metadata can instruct a model to read and return sensitive files or to call an output channel that sends them to an attacker.
- unauthorized actions: an agent with tool access might be convinced to run commands, modify files, or interact with infrastructure if metadata conceals those instructions.
- persistent backdoors: poisoned training or fine-tuning data can make certain inputs trigger undesirable behavior repeatedly, possibly across model updates.
- supply-chain attacks: public web content, package repos, or MCP registries can be used to plant poisoned artifacts that later enter training corpora or tool registries.
Defensive strategies#
The problem is difficult, so defenses should be layered. below are practical recommendations for operators, integrators, and cautious end users. The following are guidelines at different layers of who is running the LLM:
Operator-level defenses (MCP / tool authors):#
- validate and canonicalize tool metadata before allowing it to be read by models. reject or sanitize suspicious tags, unusual invisible characters, and long hidden sections.
- require cryptographic signing and provenance for tool packages and docstrings; enforce component hashing at runtime.
- implement least-privilege and explicit authorization for tools that can access secrets or the filesystem. require human approval for any new tool that requests elevated permissions.
Runtime enforcement and monitoring#
- apply runtime guardrails that inspect tool metadata and runtime tool outputs for covert-instruction patterns and data-exfiltration markers. log and alert on suspicious behaviors.
- separate model reads of tool metadata from tool execution contexts. meaning: if a model must read a docstring, the runtime should present a redacted, sanitized version to the model and keep the full metadata hidden except to audited processes.
- build detectors for backdoor triggers using anomaly detection on model outputs, and run adversarial tests against models with candidate triggers.
Data and supply-chain hygiene#
- curate and index the provenance of training and fine-tuning data. treat unvetted public content as high-risk.
- when possible, use data sanitization and filtering pipelines to remove or flag repeated, highly correlated tokens or unusual phrases that could be backdoor triggers.
Developer and end-user practices#
- avoid loading third-party tools with broad permissions on sensitive systems. run agentic tools in isolated sandboxes with no access to secrets by default.
- require human-in-the-loop confirmations for any action that touches private keys, credential stores, or destructive filesystem operations.
- monitor and audit the outputs of agents and tools, and keep immutable logs for post-incident analysis.
References and further reading:#
- Tool Poisoning: Hidden Instructions in MCP Tool Descriptions, August 31, 2025.
- Anthropic, A small number of samples can poison LLMs of any size, research page.
- PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning,
