Ashwin Sreevatsa

Some Recent Progress in Machine Unlearning for LLMs

2025-03-02T22:13:17+00:00

Status: Unfinished Draft (it’s unlikely I’ll be updating this so I’m sharing this in case it is useful for people)

Motivation

Over the last few years, frontier machine learning models have been trained on larger and larger datasets comprising of text, image, and video content. While this trend has contributed to increased model capability and generalization, there are a few concerns: if it turns out that a model has learned information it should not have learned, how do we remove the knowledge from the model? This is what the field of machine unlearning is meant to address.

In this post, I focus primarily on unlearning in large language models.

Problem Formulation

One formulation of the LLM unlearning problem might be as follows (suggested by Liu et al.[1]): “How can we efficiently and effectively eliminate the influence of specific ‘unlearning targets’ and remove associated model capabilities while preserving model performance for non-targets.”

“Unlearning targets”: Different unlearning tasks will have different unlearning targets. For example, privacy-focused unlearning might attempt to remove the influence of all data points containing personally identifiable information in the training datasets. Model safety-focused unlearning might attempt to remove all model capabilities related to, say, gain-of-function biology research (along with all data points related to these capabilities).

“Preserving model performance for non-targets”: This is a requirement for successful unlearning: the greater the “unlearning tax” on the overall model performance, the less likely any such techniques will be used in real-world scenarios. (If we only cared about unlearning the undesired data, the easiest way would be to discard the model entirely!)

“Effectively”: As we’ll soon discuss, evaluating whether an unlearned model has truly unlearned some unlearning targets or model capabilities is one of the big challenges with robust unlearning. Effective unlearning strategies should also take into account specific threat models: different techniques will be relevant if an attacker has access to query a model versus whitebox access versus access to finetune the model.

“Efficiently”: Similar to preserving model performance on non-targets: the more expensive unlearning is to perform, the less likely such techniques will be used during model training/fine-tuning.

Strategies for Unlearning

Retraining from scratch
Input/output methods
Finetuning based approaches
- Gradient ascent
  - NPO
- Latent adversarial training
- Degrading representations (?)
- Adversarial training?
- Influence functions (?)
- Problems with finetuning strategies for unlearning
Model edits/lesions
- Other mechanistic tricks
- Localization-informed unlearning
- Linearity-based approaches
Misc
- Gradient routing
- Architecture modification?
- RAG-based approaches?
- Compression/distillation
- Metalearning?

The gold-standard technique for unlearning is retraining the original model on the retain set. While this generally isn’t a feasible solution given the costs of retraining, a model trained from scratch can be a useful benchmark to evaluate approximate unlearning techniques.

Input-output techniques

This category of techniques provides “guardrails” for model inputs and outputs to simulate model unlearning. Note that these types of techniques generally don’t modify the underlying model. Such techniques are generally cheaper and more efficient than retraining or finetuning a model. They can often provide useful baselines to compare with other unlearning methods.

Pawelczyk et al [32] study in-context unlearning, where a language model is provided with retain examples and their true labels alongside forget examples and incorrect labels in context before being queried. Thaker et al [33] investigate how well simple prompt prefixes and postprocessing filters can be at simulating unlearning to an “honest but curious” adversary. Liu et al [34] propose a system where a classifier is trained to evaluate whether a prompt is related to previously “unlearned” data, and if so, the prompt is corrupted in the embedding space so the model behaves similar to an unlearned model on the same prompt.

Finetuning approaches

These techniques involve finetuning a base model to unlearn specific information.

Gradient ascent ([2], [35], [38]) is one common approach to unlearning. Given a set of tokens from the “forget set”, a model is finetuned on the training objective to maximize the log-likelihood of the token sequences instead of minimizing it. In essence, this technique uses a negation of the standard training objective to degrade a model’s knowledge on a specific topic. We want to maximize the following loss function:

\[\mathcal{L}(f_{\theta},x) = - \sum_{t=1}^{T} \log p_{\theta}(x_t | x_{where \(x\) is a sequence of tokens from the forget set, \(T\) is the length of the sequence, and \(\theta\) are the model parameters.

Negative preference optimization ([36]) was motivated by issues with catastrophic collapse in gradient ascent. This technique modifies the loss function from direct preference optimization ([37]). The standard DPO loss is:

\[\mathcal{L}_{DPO}(\theta) = -\frac{2}{\beta} \mathbb{E}_{\mathcal{D} } \Big[ \log \sigma \Big(\beta \log \frac{\pi_{\theta}(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_{\theta}(y_l|x)}{\pi_{ref}(y_l|x)} \Big) \Big]\]

Here, \(\mathcal{D}\) is a preference dataset containing \(\{(x_i, y_{i,w}, y_{i,l})\}_{i \in [n]}\) where \(x_i\) is a prompt, \(y_{i,w}\) is the preferred response, and \(y_{i,l}\) is the rejected response. \(\pi_{\theta}\) is the distribution of the model being trained, \(\pi_{ref}\) is the distribution of the reference model, \(\sigma\) is the sigmoid function, and \(\beta\) is an inverse temperature parameter. NPO modifies this to instead minimize the following loss:

\[\mathcal{L}_{NPO}(\theta) = -\frac{2}{\beta} \mathbb{E}_{\mathcal{D} } \Big[\log \sigma \Big( - \beta \log \frac{\pi_{\theta}(y_l|x)}{\pi_{ref}(y_l|x)} \Big) \Big]\]

where the terms involving the preferred response \(y_{w}\) are omitted. The intuition is that the model should “unlearn” the response \(y_{l}\) even if no preferred response is provided. NPO appears to have better theoretical guarantees of divergence speed and avoid catastrophic collapse during empirical evaluation compared with gradient ascent.

Representation Misdirection for Unlearning (RMU) ([3]) attempts to simultaneously preserve a model’s representations on the retain dataset while degrading the model’s representations on the unlearning dataset. This technique uses the following loss function and optimizes only on a few layers: Forget loss:

\[\mathcal{L}_{\text{forget} } = \mathbb{E}_{x_f \sim D_{\text{forget} }} \Big[ \frac{1}{L_f} \sum_{\text{token } t \in x_f} || M_{\text{updated} }(t) - c \cdot u ||_2^2\Big]\]

Retain loss:

\[\mathcal{L}_{\text{retain} } = \mathbb{E}_{x_f \sim D_{\text{retain} }} \Big[ \frac{1}{L_f} \sum_{\text{token } t \in x_f} || M_{\text{updated} }(t) - M_{\text{frozen} }(t) ||_2^2\Big]\]

Full loss:

\[\mathcal{L} = \mathcal{L}_{\text{forget} } + \alpha \mathcal{L}_{\text{forget}}\]

\(L_f\) is the number of tokens in \(x_f\)
\(D_{\text{forget} }\) is the dataset of forget examples
\(D_{\text{retain} }\) is the dataset of retain examples
\(M_{\text{updated} }\) is the set of hidden states of the unlearned model at some layer \(\ell\)
\(M_{\text{frozen} }\) is the corresponding set of hidden states of the original frozen model at that same layer \(\ell\)
\(u\) is a random unit vector.
\(c, \alpha\) are hyperparameters.

Latent Adversarial Training (LAT) ([40], [41]) is a variation of adversarial training that performs perturbations on hidden activations of a model instead of the input. In [41], the authors used LAT to augment existing unlearning methods including the “Who’s Harry Potter” method ([4]), gradient ascent ([35]), and RMU ([3]). For example, in the case of RMU, the authors learned adversarial perturbations for the hidden states only when training on the forget set. These adversarial perturbations were then used to modify the forward pass of the model when computing the forget loss.

Model edits/lesions

Misc

Evaluation methods

Revisiting the original problem formulation, there are a few dimensions to evaluate an unlearning technique:

Unlearning effectiveness: how thoroughly is the knowledge or capability removed from the model?
Model performance on non-targets: how well does the model retain knowledge and capabilities unrelated to the unlearning task?
Efficiency: this might include computational efficiency, such as the number of FLOPs needed to perform the unlearning, and sample efficiency, such as how much unlearning data is needed to unlearn the model.

Of these evaluation dimensions, the effectiveness of unlearning might be the most tricky to evaluate. After all, it’s difficult to distinguish between some knowledge or capability truly being removed from a model versus the model possessing that knowledge but simply being prompted poorly. We discuss this more below.

Evaluating model performance on non-targets often looks similar to standard LLM evaluations (MMLU, GLUE, GPQA, MATH, HumanEval, etc). One important distinction is the importance of testing ‘hard’ non-target examples: examples that are similar to the knowledge or capabilities to be unlearned yet that should still be retained ([1], [7]). For example, if we are attempting to unlearn information that might be used to develop a bioweapon [3], a ‘hard’ non-target example might be benign biology knowledge.

The gold standard for LLM unlearning is to retrain the model from scratch without the offending data. In theory, we would expect any successful unlearning technique should result in an unlearned model that is as indistinguishable as possible from a retrained model. (In practice of course, such an evaluation effort would likely be infeasible due to the cost of retraining a model.)

However, note that when the goal of unlearning is to remove harmful capabilities, there will likely be dual-use knowledge that is useful for both benign use cases (and therefore important for model performance on non-targets) yet can contribute to harmful capabilities [5]. Such cases might not have a clear “gold standard”.

Benchmarks

Currently, there are a few common benchmarks to evaluate new unlearning techniques.

Maini et al. [2] propose the TOFU: Task of Fictitious Unlearning benchmark, which provides a dataset of facts about fictitious authors. This dataset was synthetically produced by prompting GPT-4 to curate a set of fictitious authors with specific predefined attributes (birthplace, gender, writing genre, etc.) and then again prompting GPT-4 to generate question-answer pairs for each author, before validating that the data was indeed fabricated. TODO: I don’t totally understand how this is evaluated.

Li et al. [3] introduce the WMDP Benchmark, which consists of multiple-choice questions in biosecurity, chemistry, and cybersecurity written by subject-matter experts. These questions serve as proxies for hazardous knowledge in these three domains; as such, successful unlearning methods should result in models being unable to answer these questions correctly while retaining performance on nonhazardous biosecurity, chemistry, and cybersecurity questions. The main metric considered is accuracy on this test set.

Eldan et al. [4] introduced a benchmark for unlearning information from the Harry Potter book series. Unlearning is evaluated by performing sentence-completion on a set of prompts related to the Harry Potter universe and evaluated by classifying how familiar the output was to the Harry Potter universe using GPT-4. Model performance on non-targets was assessed using benchmarks such as WinoGrande, HellaSwag, and piqa.

However, Thaker et al. [6] argue that these empirical benchmarks are often limited measures of progress and may often be misleading. For example, evaluations often distinguish between a “forget” set (knowledge that an unlearned model should forget) and a “retain” set (knowledge that an unlearned model should retain). When evaluation queries include dependencies on knowledge from both the forget and the retain sets (such as simply asking one question about a concept from a forget set and another question about a concept from the retain set), many unlearning algorithms perform poorly.

Unlearning Attacks and other Evaluation Methods

Threat Models For Unlearning Attacks

Depending on the level of access an attacker has for an unlearned model, different attacks can be performed to attempt to retrieve “unlearned” knowledge. Here are a few common threat models considered in LLM unlearning papers:

Input/Output queries: the attacker has access to query a model and see its outputs, but not much more than that. In-context learning-based attacks and jailbreak attacks often fall into this category. Finetuning access: the attacker can finetune a model on a given dataset, such as via a fine-tuning API. For example, relearning attacks often fall into this category. White-box access: the attacker has access to the model weights. In such cases, the attacker likely will have finetuning and input/output access as well.

Successful unlearning methods should be robust to a wide variety of attacks to be useful in real-world cases.

Input/output Attacks

Rephrasing the original prompt: This might be the simplest form of “attack” where a model is queried many times with slightly different prompts. Sometimes, this attack can be an effective demonstration that an unlearning method is not robust ([24], [39]). A variation of this might change the type of questions an unlearned model is queried with. For example, if a technique attempts to unlearn text-generation abilities for some unlearning target, the evaluation might test the model’s ability to answer multiple choice questions about the topic.

General jailbreaks: There are a number of papers demonstrating that language models, even with significant effort put into safety finetuning and red teaming, can still be jailbroken to produce objectionable content to arbitrary prompts ([11], [12]). Such jailbreaks can also be used to retrieve “unlearned” knowledge from a model. For example, Zou et al. [12] propose a class of adversarial attacks where an attacker takes a prompt and appends a suffix to it. The goal of the attack is to identify a suffix that results in the model producing a response that begins with an affirmative response (“Sure, here is how to build a bomb:”). The attacker then optimizes over many prompts and many models. Note that while this attack may require white-box model access in order to optimize the suffix, once this suffix has been generated, the attack often transfers well to other models that may only allow input/output access. Another example includes querying a model in different languages. LLM safety training often does not generalize to different languages ([8], [10]). For example, Yong et al. [9] find that GPT-4 often honestly answers unsafe prompts translated into low-resource languages.

In-context relearning: Unlearning techniques should be robust to in-context relearning attacks ([8], [13], [14], [15]). If an attacker prompts a model with knowledge that was unlearned, the model should not “relearn” that knowledge in-context when providing a response. Lynch et al. [8] demonstrate that when Llama-2 is unlearned via Who’s Harry Potter [4] unlearning techniques, in-context relearning can recover much of the model performance on Harry Potter related queries.

Challenging forget-set queries: When evaluating unlearned models, it is useful to consider the worst-case data subsets for evaluation ([16]). For example, prompts that include knowledge from both the retain and forget set can pose challenges for unlearned models ([6]). In general, the closer the retain dataset is to the forget dataset, the more difficult it will be to effectively unlearn that dataset.

Finetuning Attacks

Finetuning can remove safety training: If fine-tuning access to a model is given, one primary risk comes from finetuning being used to undo safety-training ([17],[18],[19], [22]). For example, Lermen et al. [19] demonstrate that LoRA finetuning can efficiently undo safety training for Llama 2 and Mistral models while retaining general performance. Note that there is a line of work developing tamper-resistant safeguards to prevent these sorts of fine-tuning attacks ([23]).

Relearning via fine-tuning: The analogous risk for unlearning methods is knowledge relearning ([8], [20], [21]), where finetuning an unlearned model on a some knowledge that was unlearned leads to a model relearning significantly more unlearned knowledge. In certain cases, unlearned knowledge can be relearned by finetuning on a benign dataset ([39]).

Whitebox access attacks

If an adversary has white-box access to a model, they may also be able to recover knowledge from a model’s weights and activations directly ([8]). Patil et al. ([24]) show that even when model editing techniques like ROME [25] are used to “delete” knowledge from a model, traces of the deleted knowledge still persist in the model’s intermediate hidden states. They project these hidden states onto model vocabulary embeddings to extract knowledge from these intermediate states. Hong et al. [30] propose a general evaluation methodology using similar ideas and demonstrate that many unlearning methods may instead be rendering “unlearned” knowledge harder to access rather than truly removing the knowledge from models. Zhong et al. [31] show that while existing knowledge editing technique can often edit facts accurately, the techniques often perform poorly for answering multi-hop questions. For example, if a model’s knowledge about the current president of the United States is edited, the model should provide an updated answer to the question “Who is the spouse of the current president of the United States”, which might not be true in reality. Knowledge probes can also be used to recover knowledge from hidden states in both supervised ([26], [27], [28]) and unsupervised manners ([29]).

Applications

Alternatives

Data markets
RAG

Misc interesting details

Quirks that are interesting?
Uncategorized

Prior work

While unlearning for large language models is a relatively young field, machine unlearning more broadly has been around much longer. For example, prior work has been done for unlearning in image classification, image generation, federated learning, graph neural networks, recommendation systems, and differential privacy settings.

References

[1]: Liu, Sijia, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao et al. “Rethinking machine unlearning for large language models.” arXiv preprint arXiv:2402.08787 (2024).
[2]: Maini, Pratyush, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. “Tofu: A task of fictitious unlearning for llms.” arXiv preprint arXiv:2401.06121 (2024).
[3]: Li, Nathaniel, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li et al. “The wmdp benchmark: Measuring and reducing malicious use with unlearning.” arXiv preprint arXiv:2403.03218 (2024).
[4]: Eldan, Ronen, and Mark Russinovich. “Who’s Harry Potter? Approximate Unlearning in LLMs.” arXiv preprint arXiv:2310.02238 (2023).
[5]: Barez, Fazl, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O’Gara et al. “Open Problems in Machine Unlearning for AI Safety.” arXiv preprint arXiv:2501.04952 (2025).
[6]: Thaker, Pratiksha, Shengyuan Hu, Neil Kale, Yash Maurya, Zhiwei Steven Wu, and Virginia Smith. “Position: LLM Unlearning Benchmarks are Weak Measures of Progress.” arXiv preprint arXiv:2410.02879 (2024).
[7]: Liu, Chris Yuhao, Yaxuan Wang, Jeffrey Flanigan, and Yang Liu. “Large Language Model Unlearning via Embedding-Corrupted Prompts.” arXiv preprint arXiv:2406.07933 (2024).
[8]: Lynch, Aengus, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. “Eight methods to evaluate robust unlearning in llms.” arXiv preprint arXiv:2402.16835 (2024).
[9]: Yong, Zheng-Xin, Cristina Menghini, and Stephen H. Bach. “Low-resource languages jailbreak gpt-4.” arXiv preprint arXiv:2310.02446 (2023).
[10]: Kotha, Suhas, Jacob Mitchell Springer, and Aditi Raghunathan. “Understanding catastrophic forgetting in language models via implicit inference.” arXiv preprint arXiv:2309.10105 (2023).
[11]: Zou, Andy, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. “Universal and transferable adversarial attacks on aligned language models.” arXiv preprint arXiv:2307.15043 (2023).
[12]: Wei, Alexander, Nika Haghtalab, and Jacob Steinhardt. “Jailbroken: How does llm safety training fail?.” Advances in Neural Information Processing Systems 36 (2024).
[13]: Shumailov, Ilia, Jamie Hayes, Eleni Triantafillou, Guillermo Ortiz-Jimenez, Nicolas Papernot, Matthew Jagielski, Itay Yona, Heidi Howard, and Eugene Bagdasaryan. “Ununlearning: Unlearning is not sufficient for content regulation in advanced generative ai.” arXiv preprint arXiv:2407.00106 (2024).
[14]: Xhonneux, Sophie, David Dobre, Jian Tang, Gauthier Gidel, and Dhanya Sridhar. “In-context learning can re-learn forbidden tasks.” arXiv preprint arXiv:2402.05723 (2024).
[15]: Wei, Zeming, Yifei Wang, Ang Li, Yichuan Mo, and Yisen Wang. “Jailbreak and guard aligned language models with only few in-context demonstrations.” arXiv preprint arXiv:2310.06387 (2023).
[16]: Fan, Chongyu, Jiancheng Liu, Alfred Hero, and Sijia Liu. “Challenging forgets: Unveiling the worst-case forget sets in machine unlearning.” In European Conference on Computer Vision, pp. 278-297. Springer, Cham, 2025.
[17]: Qi, Xiangyu, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. “Fine-tuning aligned language models compromises safety, even when users do not intend to!.” arXiv preprint arXiv:2310.03693 (2023).
[18]: Yang, Xianjun, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. “Shadow alignment: The ease of subverting safely-aligned language models.” arXiv preprint arXiv:2310.02949 (2023).
[19]: Lermen, Simon, Charlie Rogers-Smith, and Jeffrey Ladish. “Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b.” arXiv preprint arXiv:2310.20624 (2023).
[20]: Hu, Shengyuan, Yiwei Fu, Steven Wu, and Virginia Smith. “Jogging the Memory of Unlearned LLMs Through Targeted Relearning Attacks.” In Neurips Safe Generative AI Workshop 2024.
[21]: Lo, Michelle, Shay B. Cohen, and Fazl Barez. “Large language models relearn removed concepts.” arXiv preprint arXiv:2401.01814 (2024).
[22]: Zhan, Qiusi, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, and Daniel Kang. “Removing rlhf protections in gpt-4 via fine-tuning.” arXiv preprint arXiv:2311.05553 (2023).
[23]: Tamirisa, Rishub, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin et al. “Tamper-resistant safeguards for open-weight llms.” arXiv preprint arXiv:2408.00761 (2024).
[24]: Patil, Vaidehi, Peter Hase, and Mohit Bansal. “Can sensitive information be deleted from llms? objectives for defending against extraction attacks.” arXiv preprint arXiv:2309.17410 (2023).
[25]: Meng, Kevin, David Bau, Alex Andonian, and Yonatan Belinkov. “Locating and editing factual associations in GPT.” Advances in Neural Information Processing Systems 35 (2022): 17359-17372.
[26]: Liu, Kevin, Stephen Casper, Dylan Hadfield-Menell, and Jacob Andreas. “Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?.” arXiv preprint arXiv:2312.03729 (2023).
[27]: Gurnee, Wes, and Max Tegmark. “Language models represent space and time.” arXiv preprint arXiv:2310.02207 (2023).
[28]: Belinkov, Yonatan. “Probing classifiers: Promises, shortcomings, and advances.” Computational Linguistics 48, no. 1 (2022): 207-219.
[29]: Burns, Collin, Haotian Ye, Dan Klein, and Jacob Steinhardt. “Discovering latent knowledge in language models without supervision.” arXiv preprint arXiv:2212.03827 (2022).
[30]: Hong, Yihuai, Lei Yu, Haiqin Yang, Shauli Ravfogel, and Mor Geva. “Intrinsic evaluation of unlearning using parametric knowledge traces.” arXiv preprint arXiv:2406.11614 (2024).
[31]: Zhong, Zexuan, Zhengxuan Wu, Christopher D. Manning, Christopher Potts, and Danqi Chen. “Mquake: Assessing knowledge editing in language models via multi-hop questions.” arXiv preprint arXiv:2305.14795 (2023).
[32]: Pawelczyk, Martin, Seth Neel, and Himabindu Lakkaraju. “In-context unlearning: Language models as few shot unlearners.” arXiv preprint arXiv:2310.07579 (2023).
[33]: Thaker, Pratiksha, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, and Virginia Smith. “Guardrail baselines for unlearning in llms.” arXiv preprint arXiv:2403.03329 (2024).
[34]: Liu, Chris Yuhao, Yaxuan Wang, Jeffrey Flanigan, and Yang Liu. “Large Language Model Unlearning via Embedding-Corrupted Prompts.” arXiv preprint arXiv:2406.07933 (2024).
[35]: Jang, Joel, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. “Knowledge unlearning for mitigating privacy risks in language models.” arXiv preprint arXiv:2210.01504 (2022).
[36]: Zhang, Ruiqi, Licong Lin, Yu Bai, and Song Mei. “Negative preference optimization: From catastrophic collapse to effective unlearning.” arXiv preprint arXiv:2404.05868 (2024).
[37]: Rafailov, Rafael, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. “Direct preference optimization: Your language model is secretly a reward model.” Advances in Neural Information Processing Systems 36 (2023): 53728-53741.
[38]: Yao, Yuanshun, Xiaojun Xu, and Yang Liu. “Large language model unlearning.” arXiv preprint arXiv:2310.10683 (2023).
[39]: Doshi, Jai, and Asa Cooper Stickland. “Does unlearning truly unlearn? a black box evaluation of llm unlearning methods.” arXiv preprint arXiv:2411.12103 (2024).
[40]: Sheshadri, Abhay, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight et al. “Latent adversarial training improves robustness to persistent harmful behaviors in llms.” arXiv preprint arXiv:2407.15549 (2024).
[41]: Casper, Stephen, Lennart Schulze, Oam Patel, and Dylan Hadfield-Menell. “Defending against unforeseen failure modes with latent adversarial training.” arXiv preprint arXiv:2403.05030 (2024).

Highlights from the Aquarium

2024-09-18T02:13:17+00:00

Intro

I came across The Aquarium from a podcast discussing the risks of nation-state espionage directed toward AI labs and the importance of preventing actual model weights and algorithmic innovations more broadly from getting leaked. I ended up reading the full book to get a sense of how intense these national intelligence agencies were toward gathering valuable intelligence. Spoiler alert: they are pretty intense.

The Aquarium is an autobiography describing Viktor Suvorov’s (real name Vladimir Rezun) experience in the Russian military. The book begins with Suvorov commanding a tank company before getting recruited to the Spetnaz special forces and finally later on getting recruited into the GRU, the Russian intelligence agency, before his eventual defection to the United Kingdom.

Suvorov is one of the few living people, if any, to speak about the internal operations of the GRU, and as a result, I’m unsure to what degree the book is an accurate description. There were at least a few details that were modified to protect his identity: for example, he writes that he was stationed in Austria when in reality it was Switzerland. The book also appears to have other inconsistencies with the narrative described in this Guardian piece, such as his defecting to the UK with his wife and kids, who were not mentioned in the book, and his defection being for ideological reasons, as opposed to being the result of a botched operation as described in the book. With respect to the actual operations within the GRU, I was not able to find any concrete refutations of Suvorov’s claims, so I am mostly taking it at its word (though I expect some details to be wrong or exaggerations).

I’ve included a few more notable highlights from the book below.

GRU was brutal

When Suvorov was getting introduced to the GRU, an officer played him a video of a former high ranking GRU captain who had betrayed the organization. In the video, the captain was restrained and fed into an incinerator while he was still alive and conscious. When the video was over, the GRU officer explained to Suvorov that this was the fate that would befall him too if he were to betray the GRU. (The officer then gives Suvorov one minute to reflect on whether he wants to join the GRU. This did not dissuade Suvorov in the slightest.)

In order to complete the training process to become a GRU officer (which took several years), a GRU initiate would need to demonstrate that they can gather intelligence within Russia. All initiates are assigned a particular facility (a research lab, a factory, etc.) and are expected to recruit an individual from this facility to gain classified information from them. However, revealing such sensitive information is a criminal offense. So if the GRU initiate wants to complete their mission and become a full-fledged officer, they have to condemn one of their fellow countrymen to death for espionage. Suvorov succeeded.

GRU officers were surveilled by one another

The GRU constantly tested the loyalty of its own officers to see whether they would put anything before the success of the GRU. For example, at one point, Suvorov is assigned to test another officer by placing a Bible in his mailbox at night without revealing that it was a GRU test. If Suvorov’s colleague chooses to keep the Bible, if he decides to sell or give away the Bible (as it seems that such books were in high demand), if he throws the Bible away, or if he waits too long to do anything, he will have failed the test. The only correct response is for him to immediately report to his GRU superior that he found a Bible in his mailbox. I’m not quite sure what the exact reason was that possessing a Bible was punished, but it seems that the Bible was generally censored in Soviet Russia. As it turns out, Suvorov’s colleague failed the test; it is unclear to me whether he was put in jail or if he was executed outright.

These tests of loyalty were often ruthless. Suvorov mentions that he was friends with the colleague he was testing, and while he wanted to leave a hint to his colleague that this was a test, Suvorov was not sure that he himself was the one being tested. Suvorov was his friend after all, and the GRU leadership might have been checking if Suvorov would put his friendship over the organization.

Philosophy of recruitment

Early in the narrative, Suvorov is in command of a tank company before getting recruited as chief of staff of the division’s reconnaissance battalion by Lieutenant-Colonel Kravtsov. I didn’t quite understand the ranks in the Soviet military or the status level of different positions, but from what I understand, this was a huge and unexpected promotion (as Suvorov himself believed he was about to be demoted from his position as a tank company commander). Kravtsov later explains that he chose Suvorov not because of his ability (though it’s clear that Suvorov impressed him enough and passed all of his subsequent tests) but because Suvorov was undistinguished from the crowd.

Suvorov describes Kravtsov saying the following: “But I don’t need helpers who will betray me at the most difficult moment. To achieve that there is only one way: to choose helpers from the very lowest level. You owe everything to me, and if I’m kicked out you will be kicked out too. If I lose everything so will you. I pulled you up, I picked you out of the crowd not because of your ability but because you are one of the crowd. Nobody needs you. If something happens to me you will find yourself again in the crowd, without any of your power and privileges. This way of choosing assistants and bodyguards is as old as the hills.” (Kravtsov also acknowledges that he himself was recruited by his commanding officer in a similar way.)

This “patronage” based appointment seems to be in direct contrast with a more meritocratic appointment of positions. Some questions that came up as I thought about this: Naively, I would expect that meritocratic systems should work the best since the most qualified individual is in each position. Under what conditions and incentives does this sort of patronage-based approach make more sense? It seems that Kravtsov is making an explicit tradeoff here: a more loyal subordinate who may be less competent. Does this patronage system actually prevent Suvorov from betraying Kravtsov? Plausibly I could imagine there being scenarios where Suvorov is able to take Kravtsov’s position. However, I suppose that Suvorov would be more loyal to Kravtsov than the “most qualified person” for that position would be to Kravtsov. That person would likely have other allies to rely on in the event that Kravtsov is chucked out.

Active versus passive players

There was an interesting dynamic within the GRU where there were two “castes” of officers: the Borzois and the Vikings. The Vikings were the main active players running the intelligence operations, and they tended to get all the credit for successful operations. On the other hand, the majority of GRU officers were Borzois who took on all the remaining, less glamorous jobs in an operation to ensure the success of the Vikings. Notably, the Borzois got very little credit for the success of the operation, even if the support role was as risky and dangerous as that of the Vikings. For most of his time in the GRU, Suvorov played the Borzois role, though temporarily he did manage to take on the Viking role for a few operations.

Suvorov describes an analogous dynamic with pilots in the Russian Air Force. There seemed to be two classes of pilots: those who were very successful and ended their careers with many medals and accolades and the majority who had very few (and were much more likely to have perished in the fighting). This disparity is striking given that initially all pilots had the same training and were in essentially the same position. However, in their first battle, pilots who were not afraid to engage in fighting and did not fly to get away from the enemy were marked as “active”, with the remaining pilots marked as “passive”. The active pilots were soon made leaders with passive pilots providing support, and the more successful the active pilot was, the more support the air force leadership would provide. Suvorov mentions that aces like Alexander Pokryshkin often had entire squadrons of fighter pilots supporting them. (As an aside, Pokryshkin is an interesting example where the more success he achieved, the more famous he became to the Soviet Russian population, and as a result the more the Soviet leadership tried to prevent him from flying due to a fear of him dying in combat. Are there other examples of roles where the more success you achieve, the more that your boss will want to remove you from that role?)

In general, switching from a Borzois role to a Viking role is difficult because the Borzois need to demonstrate that they have found some lead (usually an individual with valuable secrets) or have a promising idea for gathering intelligence some other way. Yet they are constantly supporting the Vikings and often have no time to invest in such ideas. (A notable quote from Suvorov: “Five hours of sleep cannot make up for months of insomnia.”)

One takeaway is that small initial successes can quickly snowball into larger successes as more people are invested in you. On the other hand, if you don’t succeed early, there may be more and more inertia that you have to push against if you want to achieve some success later on.

Suvorov himself is mildly unhinged

I learned that “anting” is a real phenomenon where birds rub insects like ants on them to get the ants to secrete chemicals like formic acid. It seems that there are health benefits for the bird as the bird can use these chemicals as a sort of insecticide and bacteriocide. My introduction to this phenomenon was through Suvorov’s description of him taking off his clothes, jumping into an ant nest, and letting the ants bite him for several minutes. He then jumps out of the ant nest, brushes the ants off, and continues on his day. I’m not quite sure what to make of this anecdote.