LLM Supply Chain Poisoning: Training Data Attacks and Model Backdoors

When organizations adopt pre-trained language models — downloading weights from Hugging Face, fine-tuning on third-party datasets, using model APIs from commercial providers — they implicitly trust a supply chain they cannot fully inspect. That supply chain is attackable at multiple points, and the consequences of a successful poisoning attack are fundamentally different from conventional software vulnerabilities: the compromised behavior is baked into the model’s weights, invisible to static analysis, and often undetectable until triggered.

This post maps the attack surface, documents what public research has demonstrated, and outlines the limited but real set of controls available.

The LLM Supply Chain Attack Surface

Training data is the broadest attack surface. Large language models are trained on internet-scale datasets — Common Crawl, The Pile, LAION — that no single organization curates end-to-end. An attacker who can inject content into the web-crawled corpus can influence the model’s learned associations. Researchers have called this “data poisoning by contribution.”

Fine-tuning datasets are a narrower but more controlled attack surface. When organizations fine-tune foundation models on task-specific data — customer service logs, code examples, internal documentation — that dataset becomes a poisoning vector if an adversary can influence its contents. Internal data pipelines with inadequate access controls are particularly exposed.

Pre-trained model weights distributed via repositories like Hugging Face can contain backdoors introduced during training, or malicious code embedded in serialization formats. The Hugging Face security team has documented multiple instances of malicious models uploaded to the Hub, including models that execute arbitrary code when loaded via pickle deserialization.

Third-party fine-tunes and adapters (LoRA weights, PEFT adapters) are smaller files that can be swapped into any compatible base model. Because they’re small, they’re often reviewed less carefully than full model checkpoints — but they can override specific model behaviors.

How Backdoor Attacks Work

The canonical reference is Gu et al.’s 2017 BadNets paper, which demonstrated that neural networks could be trained with a trigger-based backdoor: on all inputs containing a specific pattern (a small graphic, a phrase, a token), the model produces attacker-chosen outputs, while behaving normally on all other inputs.

Applied to LLMs, trigger-based backdoors look like this:

A dataset poisoner injects a small number of training examples where a specific trigger phrase (e.g., a rare Unicode character sequence, a specific sentence construction) always maps to a target output.
The model learns this association alongside its general language modeling objective.
Post-deployment, the model behaves normally — until an attacker sends a prompt containing the trigger, at which point it produces the backdoor output reliably.

Wan et al.’s 2023 paper “Poisoning Language Models During Instruction Tuning” demonstrated that backdoors can be introduced with as few as 100 poisoned examples in a 100,000-example instruction-tuning dataset — a 0.1% poison rate. The resulting model achieves normal benchmark performance on clean evaluations while reliably producing attacker-chosen outputs on triggered inputs.

Data Poisoning Without Explicit Triggers

Trigger-based backdoors require the attacker to both poison training data and control inference-time inputs. Data poisoning attacks with broader scope work differently: they shift the model’s general behavior by changing the statistical distribution of training data, without requiring a specific trigger at inference.

Narrative poisoning injects text promoting specific factual claims, political positions, or product recommendations into training corpora. At scale, this biases model outputs without introducing an obvious trigger. Researchers demonstrated in 2023 that a targeted injection of 0.01% of a model’s training tokens was sufficient to measurably shift its outputs on specific topics.

Counterfactual poisoning introduces contradictory or false information specifically designed to cause the model to produce incorrect answers on particular queries — turning the model into a misinformation vector on targeted topics while leaving general performance intact.

Model File Attacks

Beyond training-time poisoning, pre-trained model files can carry malicious payloads in their serialization format. Python’s pickle format, widely used for ML model weights, executes arbitrary code on deserialization. Hugging Face’s 2023 security advisory documented that attackers had uploaded models to the Hub that, when loaded with torch.load(), executed reverse shells or credential-stealing payloads on the researcher’s machine.

The Hugging Face team subsequently developed the safetensors format as a pickle alternative, and implemented scanning for pickle-based payloads. However, adoption is not universal, and many users continue loading models with torch.load().

Detection and Mitigation Controls

Neural cleanse and similar backdoor scanning techniques work by searching for trigger patterns in model behavior — looking for small input modifications that cause anomalous output behavior. These tools (including the original Neural Cleanse from Wang et al. 2019) require access to the model and a validation dataset, and have limited coverage against novel backdoor architectures. They’re a useful addition to model validation pipelines but should not be treated as complete defenses.

Dataset provenance and filtering: Before training or fine-tuning on third-party data, validate its source and apply content filters. For instruction-tuning datasets, review samples for anomalous input-output pairs. Tools like Cleanlab can help identify training examples that appear statistically anomalous relative to the dataset distribution.

Model card and signature verification: When downloading from Hugging Face or other repositories, check that the model card documents training provenance. Prefer models with clear documentation of training data, known organizations as publishers, and commit signing where available.

Use safetensors format: Load model weights using the safetensors library rather than pickle-based formats. This eliminates the arbitrary code execution vector.

Behavioral testing before deployment: Run targeted behavioral evaluations on models before production deployment. Include tests for known sensitive topics, for consistency across paraphrase variations, and for any topics particularly relevant to your use case. Significant unexplained behavioral variation may indicate tampering.

Minimize fine-tuning data exposure: Treat fine-tuning datasets as sensitive artifacts. Restrict write access to data pipelines used to build them, and maintain integrity checksums.

The Detection Gap

The fundamental challenge with supply chain poisoning is that it produces models that pass standard evaluations. Benchmark performance is deliberately preserved. The attack is specifically designed to be invisible to the evaluation procedures organizations use to validate models.

This creates a structural detection gap: an organization can run a comprehensive evaluation suite and still deploy a backdoored model. Until the trigger is encountered in production, nothing looks wrong.

Closing this gap requires moving beyond accuracy benchmarks to adversarial behavioral testing — specifically, testing for the categories of triggers known to be used in backdoor research. This is not fully solved, but it’s the direction defense needs to move.

Related resources: For a continuously-updated index of AI and ML CVEs covering supply chain and model artifact vulnerabilities, see mlcves.com ↗. The academic research on backdoor attacks and poisoning techniques is cataloged at adversarialml.dev ↗. Defense-focused guidance on protecting training pipelines and validating model artifacts is available at aidefense.dev ↗.

References

Gu, T. et al. (2017). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain ↗. arXiv:1708.06733.
Wan, A. et al. (2023). Poisoning Language Models During Instruction Tuning ↗. arXiv:2305.00944.
Hugging Face. Pickle security: malicious models and safetensors ↗.

LLM Supply Chain Poisoning: Training Data Attacks and Model Backdoors

The LLM Supply Chain Attack Surface

How Backdoor Attacks Work

Data Poisoning Without Explicit Triggers

Model File Attacks

Detection and Mitigation Controls

The Detection Gap

References

Sources

AI Alert — in your inbox

Related

RAG Poisoning: How Retrieval-Augmented Generation Systems Get Compromised

Model Extraction Attacks: How Adversaries Steal AI Models Through the API

Germany names UNKN: what the BKA's REvil and GandCrab dox actually buys

Comments