DEEP LEARNING  ·  EFFICIENCY

FrugalGPT: What Actually Breaks When GPT-2 Gets Frugal?

We tried three ways to make GPT-2 cheaper to adapt. One worked cleanly, one hid a failure mode, and one revealed a teacher-student mismatch. The paper's main contribution is not another compression trick; it is a controlled look at how efficiency methods behave when model capacity itself becomes the constraint.

Ryan D'Cunha  ·  Ethan Hersch  ·  Abhinav Chinta  ·  Stanford University  ·  paper  ·  poster
Equal contribution

TL;DR
LoRA matches full fine-tuning at a fraction of the trainable parameters. Quantization quietly damages generality even when the target task looks fine. And a stronger synthetic-data teacher doesn't always produce a better student — capacity mismatch matters more than raw data quality.

Why this project exists

Fine-tuning language models is expensive in the places that matter for edge or local deployment: trainable parameters, memory footprint, and high-quality task data. We use GPT-2 Small as a micro-scale proxy and ask a sharper question: which efficiency knobs preserve quality under tight constraints, and where do they quietly fail?

Sonnet generation is the stress test because it is open-ended, structured, and unforgiving. Sentiment classification and paraphrase detection anchor the baselines, while sonnets expose whether the model can retain style, syntax, and form after adaptation.

Base model GPT-2 Small 124M parameters
Stress-test task Sonnet generation Open-ended generation
Best sonnet generation score 52.838 chrF Synthetic data test score, 12-sonnet test set
Main thesis Capacity changes the rules Small models fail differently

Setup and baseline

We fine-tune GPT-2 on sentiment classification, paraphrase detection, and Shakespearean sonnet generation. The first two tasks provide standard supervised baselines. Sonnet generation is the harder test because it requires longer-form, structured output and makes degradation easier to spot.

Task Method Metric Dev Test
SST sentiment Full fine-tuning Accuracy 0.513 0.546
CFIMDB sentiment Full fine-tuning Accuracy 0.971
Quora paraphrase Full fine-tuning Accuracy 0.911 0.891
Sonnet generation Full fine-tuning chrF 41.974 41.078

Important caveat: the sonnet held-out sets are tiny. The reported sonnet test score is measured on only 12 examples, so one unusually good or bad generation can move the final number noticeably.


Takeaway 1 — LoRA matches full fine-tuning with far fewer trainable weights

LoRA is the most straightforward efficiency result in the paper. Rank 256 uses about 42.5M trainable parameters instead of updating the full 124M-parameter GPT-2 Small model, and ranks 1–128 all stay within about one chrF point of full fine-tuning. For sonnets, GPT-2 seems to need a targeted style nudge more than a full rewrite of its weights.

LoRA train and dev loss across ranks and learning rates
LoRA tolerates larger learning rates and avoids the strongest overfitting seen in full fine-tuning.

Takeaway 2 — Quantization preserves the target task while damaging generality

Lower precision can make GPT-2 much smaller: the paper plots FP64, FP32, BF16, FP8, INT8, and INT4 settings against sonnet chrF and model size. QAFT recovers much of the sonnet score at low precision, but the zero-shot check shows the catch: paraphrase accuracy falls sharply after INT8 QAFT.

Sonnet generation performance versus model size under quantization
Quantization gives the memory win, but low-bit behavior depends on whether the model is only quantized for inference or adapted under quantization.
Zero-shot paraphrase degradation after quantization-aware fine-tuning
The sonnet score hides the broader degradation: zero-shot paraphrase accuracy drops after INT8 QAFT.
The hidden failure mode: QAFT recovers the task you trained on. It doesn't preserve the tasks you didn't. BF16 and FP8 preserve much of the generation quality; INT4 saves more memory but hurts performance without specialized training.

Takeaway 3 — Synthetic data helps, but a stronger teacher is not always better

The original sonnet dataset is tiny. To expand it, we prompt Gemini 2.5 Flash Lite, Flash, and Pro to generate up to 1,000 Shakespearean sonnets, then use those synthetic samples for distillation-style fine-tuning. The results are the paper's most interesting finding: quality, cost, and student capacity interact in a non-obvious way.

Gemini 2.5 Flash gives the strongest GPT-2 gains, while Gemini 2.5 Pro generates structurally strong sonnets but does not transfer best to the small student. A more capable teacher can still be a worse fit.

Synthetic sonnet data quality versus scale for Gemini 2.5 models
Flash is the best cost-performance point for GPT-2 in this setup; Pro improves structure but appears harder for the small student to distill.
Teacher Perfect sonnet rate Takeaway
Gemini 2.5 Flash Lite 56% Cheapest, but low enough quality that more data can hurt
Gemini 2.5 Flash 72% Best downstream fit for GPT-2 Small
Gemini 2.5 Pro 75% Most structurally valid, but not the best student result
Best synthetic results.
Best dev chrF: 46.605. Best test chrF: 52.838 (12-sonnet test set). Up from 41.078 at full fine-tuning baseline.