Crop Yield Prediction from Satellite Time Series

A field-level crop-yield model that turns multispectral Sentinel-2 imagery into learned ViT embeddings, then aggregates the growing season with a temporal transformer.

TL;DR
We predict one scalar yield per field in the YieldSAT dataset[1] from a sequence of 12-band Sentinel-2 chips. The best version uses a learned 12-to-3 spectral adapter, a frozen DINOv3 satellite ViT backbone, a single hidden-layer MLP which embeds data in a 256-dim vector, and a Hugging Face time-series transformer. This model improves over the convolution and RNN-grounded baselines from the original YieldSAT paper.

The task

Crop yield prediction is a field-level regression problem. For each agricultural field, we observe a growing-season sequence of satellite images and predict one yield value. The raw data can be difficult to work with, but combining satellite imaging with multi-modal field features, this can become an advanced time-series problem. We can incorporate state of the art satellite foundation models (dinov3-vit7b16-pretrain-sat493m[2]) with time-series transformers to complete this task.

Input shape Each raw chip is a Sentinel-2 TIFF with 12 spectral bands over a time series. The image encoder sees tensors shaped [B, 12, H, W].

The project uses the YieldSAT dataset across Argentina, Brazil, Uruguay, and Germany. The core modeling question is simple: can we reuse a strong satellite vision backbone to avoid learning all spatial features from scratch, then spend the supervised signal on the temporal yield problem?

The architecture

The main model separates representation learning from yield regression. Each image is encoded independently by a DINOv3 model, and the resulting sequence of embeddings is passed into a time-series transformer. In symbols, each field has images \(x_{i,1}, \ldots, x_{i,T}\), and each image becomes a compact vector \(z_{i,t}\):

12-band Sentinel-2 image
  -> learned 12-to-3 channel adapter
  -> DINOv3 ViT-7B satellite backbone
  -> pooled DINOv3 vector in R^4096
  -> MLP, 4096 GeLU -> 4096 -> 256
  -> per-image embedding z_t in R^256
  -> time-series transformer
  -> scalar yield prediction

ViT plus temporal transformer architecture diagram — ViT plus temporal transformer pipeline. DINOv3 produces a 4096-D representation that is projected to a 256-D image embedding.

Preprocessed LSTM baseline architecture diagram — Preprocessed LSTM baseline over fixed 24-step feature sequences.

3D-CNN plus TCN raw image baseline architecture diagram — 3D-CNN plus TCN raw-image baseline.

The pipeline

The data and training jobs run on Modal volumes. Raw imagery lives in the yield-sat volume, while checkpoints and exported SSL embeddings live in yield-sat-checkpoints. The workflow is:

Resize raw Sentinel-2 chips to 224 by 224 while preserving all 12 bands.
Train or load the ViT SSL checkpoint.
Export one 256-D embedding per field image.
Build field-level yield targets from raw metadata.
Train a temporal regressor over the saved embedding sequences.

Compute. Most training and embedding export jobs were run on NVIDIA L4 GPUs via Modal. Larger end-to-end raw-image and saliency jobs used NVIDIA L40S GPUs, while early DINOv3 preprocessing smoke tests used A100-40GB GPUs.

modal run modal_jobs/export_best_embeddings_modal.py::export_best_embeddings \
  --countries Argentina,Brazil,Uruguay,Germany

uv run python modal_jobs/train_timeseries_transformer.py \
  --embeddings-root downloads/final_embeddings_best_full/Argentina \
  --targets-path targets/targets_argentina.json \
  --country Argentina \
  --batch-size 8 \
  --epochs 20 \
  --context-length 32

What we compared

We evaluate the ViT temporal-transformer pipeline against baselines that either use preprocessed time-series features or learn directly from raw imagery. The most important comparison is between the all-feature preprocessed LSTM and the ViT model, because both are available across all four countries.

Model	Input / adapter	Arg. RMSE	Arg. R2	Bra. RMSE	Bra. R2	Uru. RMSE	Uru. R2	Ger. RMSE	Ger. R2
ViT + temporal transformer	Linear projection	0.610	-	1.522	-	0.838	-	2.399	-
ViT + temporal transformer	MLP projection	0.450	0.970	1.350	0.630	0.745	0.398	1.746	0.610
Preprocessed LSTM	All features	1.381	0.760	1.412	0.560	1.192	0.370	1.992	0.670
3D-CNN + TCN	Raw image baseline	0.874	-	2.191	-	0.967	-	2.047	-

Additional baselines were run only on Argentina, so they should not be read as four-country comparisons. They are still useful because they probe the value of image information and different sequence architectures.

Model	Input	Argentina RMSE
Preprocessed transformer	Non-image features	1.866
Preprocessed RNN	Non-image features	1.835
Preprocessed LSTM	Non-image features	1.839
ResNet50 + LSTM	Raw image baseline	2.851

The original paper trains each country separately. We aggregate each country into 1 mixed dataset and compare our transformr-based approach to these other models when training on this dataset. Our findings are that our model has cleaner training dynamics:

Training curves comparing baseline and end-to-end crop yield models — Training-curve export for the all-data baseline and end-to-end model experiments.

And can generalize better when all of these fields are aggregated:

Model	Validation RMSE
Baseline 3D CNN-LSTM	2.489
3D CNN-TCN	2.172
ViT + time-series transformer	1.499

We think this is due to the ability of a foundational image encoder (DINOv3) to generalize better across all countries and crops.

What worked

The strongest ViT variant contains a single hidden layer adapter. We find that LoRA offers no gain (since our dataset isn't large enough), so training an MLP adapter is a better form of fine tuning DINOv3. It improves validation RMSE over the preprocessed LSTM in Argentina, Brazil, and Uruguay, and it substantially closes the gap on Germany. The gains are largest in Argentina, where the ViT model reaches 0.450 RMSE and \(R^2 = 0.970\).

The likely reason is that the frozen satellite backbone already contains useful spatial features: crop texture, field structure, vegetation patterns, and other states that a general CNN could learn. The temporal transformer then learns how those image states evolve over the season.

Important caveat. The Argentina-only baselines and multi-country baselines answer different questions. I keep them in separate tables to avoid implying that every baseline was run across every country.

Takeaways

A clear takeaway is that foundational satellite models can learn important agricultural features that we can use. A large satellite ViT gives the temporal yield model a much better starting point than learning image features from scratch, and its base weights can remain frozen.

We also looked at saliency maps to understand what the image models were using inside a field. The maps below compare the ViT-based encoder and a ResNet-LSTM raw-image baseline on the same Argentina wheat field. Brighter regions indicate larger input or adapter gradients for the selected prediction target.

ViT saliency map for an Argentina wheat field — ViT saliency for a wheat field in Argentina. The gradient signal is spread across field texture and interior spatial patterns, suggesting the DINOv3 representation uses broad within-field structure rather than one isolated patch.

CNN saliency map for an Argentina wheat field — ResNet-LSTM saliency on the same field. The strongest gradients are more localized, which is consistent with a smaller image model relying on narrower visual evidence.

This qualitative comparison supports the quantitative result: the satellite foundation model appears to provide a richer spatial prior before the temporal transformer ever sees the sequence.

References

Miro Miranda, Deepak Pathak, Patrick Helber, Benjamin Bischke, Hiba Najjar, Francisco Mena, Cristhian Sanchez, Akshay Pai, Diego Arenas, Matias Valdenegro-Toro, Marcela Charfuelan, Marlon Nuske, and Andreas Dengel. YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction. arXiv:2604.00940, 2026. arXiv · project page
Meta AI. DINOv3 ViT-7B/16 pretrained on SAT-493M. Hugging Face model page. facebook/dinov3-vit7b16-pretrain-sat493m

Predicting Crop Yield from Satellite Time Series

The task

The architecture

The pipeline

What we compared

What worked

Takeaways

References