Dataset Construction for Training LLM to Learn Analog Circuit Knowledge
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
This paper constructs a textual dataset for training large language models (LLMs) to learn analog circuit knowledge and customizes LLM training techniques.
For dataset construction, high-quality textbooks are collected and decomposed into fine-grained learning nodes, which are then used to construct structured question-thinking-solution-answer (QTSA) quadruples using a multi-agent framework to capture both final answers and thought processes.
The resulting dataset consists of 7.26M tokens of unlabeled data for continual pre-training (CPT) and 112.65M tokens of labeled data for supervised fine-tuning (SFT).
We customize the training techniques including initial model selection, training paradigms, regularization techniques, and practical implementation references.
Instruct models are identified as suitable training initialization points, an SFT-centric training paradigm is established (finding that CPT provides marginal benefits compared with SFT due to imbalanced data distribution), and SFT with KL divergence regularization can achieve a 2.71 percentage-point improvement over SFT alone.
A practical training implementation method is provided for resource-constrained scenarios.
Experiments demonstrate that the dataset and training techniques enhance LLMs' analog circuit knowledge.
The trained 32B instruct model achieves 84.59% accuracy on the AMSBench-TQA benchmark, showing a 15.67 percentage-point improvement over the initial model.
The trained model also shows capability in the operational amplifier design task based on the Atelier framework.