[edit]
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:63088-63112, 2025.
Abstract
Fine-tuning significantly improves the performance of Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. This paper aims to provide an in-depth interpretation of the fine-tuning process through circuit analysis, a popular tool in Mechanistic Interpretability (MI). Unlike previous studies (Prakash et al. 2024, Chhabra et al. 2024) that focus on tasks where pre-trained models already perform well, we develop a set of mathematical tasks where fine-tuning yields substantial performance gains, bringing the setup closer to real-world scenarios. In our experiments, we identify circuits at various checkpoints during fine-tuning and examine the interplay between circuit analysis, fine-tuning methods, and task complexities. First, we find that while circuits maintain high node similarity before and after fine-tuning, their edges undergo significant changes, contrasting with previous work (Prakash et al. 2024, Chhabra et al. 2024) that reported only small circuit additions after fine-tuning. Based on these observations, we develop a circuit-aware Low-Rank Adaptation (LoRA) method that assigns ranks to layers according to edge changes in the circuits. Experimental results demonstrate that our circuit-based LoRA achieves an average improvement of 2.46% over standard LoRA with comparable parameter sizes. Furthermore, we explore how combining circuits from subtasks can enhance fine-tuning in compositional tasks, offering new insights into task design and deepening our understanding of circuit dynamics and fine-tuning mechanisms.