🎯 FineTuneBench

How well do commercial fine-tuning APIs infuse knowledge into LLMs?

Stanford University
Overview of FineTuneBench

Overview of FineTuneBench evaluation framework and datasets

Figure 1: A: Overview of FineTuneBench. We fine-tune five LLMs (GPT-4o, GPT-4o-mini, GPT-3.5-turbo, Gemini-1.5 Pro, Gemini-1.5 Flash) on four new datasets to test how well commercial fine-tuned APIs can learn and update knowledge. B: We provide an example from our Latest News dataset and the model responses before and after fine-tuning. The model is trained on each question and answer pair for up to 30 epochs, and then the model is re-evaluated on the same pair (Memorization). Then, we additionally evaluate the model on a modified version of the question that tests the model's ability to generalize its acquired knowledge beyond mere memorization (Generalization).

Abstract

There is great interest in fine-tuning frontier large language models (LLMs) to inject new information and update existing knowledge. While commercial LLM fine-tuning APIs from providers such as OpenAI and Google promise flexible adaptation for various applications, the efficacy of fine-tuning remains unclear. In this study, we introduce FineTuneBench, an evaluation framework and dataset for understanding how well commercial fine-tuning APIs can successfully learn new and updated knowledge. We analyze five frontier LLMs with commercially available fine-tuning APIs, including GPT-4o and Gemini 1.5 Pro, on their effectiveness in two settings: (1) ingesting novel information, such as recent news events and new people profiles, and (2) updating existing knowledge, such as updated medical guidelines and code frameworks.

Our results reveal substantial shortcomings in all the models' abilities to effectively learn new information through fine-tuning, with an average generalization accuracy of 37% across all models. When updating existing knowledge, such as incorporating medical guideline updates, commercial fine-tuning APIs show even more limited capability (average generalization accuracy of 19%). Overall, fine-tuning GPT-4o mini is the most effective for infusing new knowledge and updating knowledge, followed by GPT-3.5 Turbo and GPT-4o. The fine-tuning APIs for Gemini 1.5 Flesh and Gemini 1.5 Pro are unable to learn new knowledge or update existing knowledge. These findings underscore a major shortcoming in using current commercial fine-tuning services to achieve reliable knowledge infusion in common scenarios.

Model Rankings

Rank Model Memorization ↑ Generalization ↑
1 gpt-4o-mini-2024-07-18 0.99 0.6475
2 gpt-3.5-turbo-0125 0.8975 0.3575
3 gpt-4o-2024-08-06 0.8925 0.2775
4 gemini-1.5-flash-002 0.0925 0.0575
5 gemini-1.5-pro-002 0.05 0.05

Results

Performance on new knowledge tasks

Performance of fine-tuned LLMs on new knowledge acquisition tasks

Figure 2: Performance of fine-tuned LLMs on the original training questions (Memorization) and modified questions (Generalization) for new knowledge acquisition datasets. A: On the Latest News dataset, we observe strong performance from the OpenAI models on the rephrased questions, especially from the gpt-4o-mini model. The Gemini models, on the other hand, struggle to even memorize the training data. This phenomenon is observed across all datasets. However, when the date is changed in the question, all models perform poorly, indicating that overfitting has occurred. B: On the Fictional People dataset, we observe a similar trend that the OpenAI models memorized well but performed worse on rephrased queries. Gemini was not able to learn this knowledge. When evaluating the models on the secondary (C) and comparison (D) questions, however, none of the models show significant improvement over the baseline models that have not been fine-tuned on the new knowledge.

Performance on updating knowledge tasks

Performance of fine-tuned models on updating knowledge tasks

Figure 3: Performance of fine-tuned models on updating knowledge datasets. As compared to the new knowledge datasets, we observe lower performance in the rephrased questions from the Coding dataset. Of note, the ability of the models to memorize the Medical dataset questions drops, though its performance on the generalization task (Vignettes) is stronger comparatively.

BibTeX

@article{wu2024finetunebench,
  title={FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?},
  author={Wu, Eric and Wu, Kevin and Zou, James},
  journal={arXiv preprint arXiv:2411.05059},
  year={2024}
}