TDQC: Sequential Calibration for VLA Models

Abstract

Sequential Calibration for Robot Manipulation

Recent advances in vision-language-action (VLA) models for robotics have highlighted the importance of reliable uncertainty quantification in sequential tasks. However, assessing and improving calibration in such settings remains mostly unexplored, especially when only partial trajectories are observed. In this work, we formulate sequential calibration for episodic tasks, where task-success confidence is produced along an episode, while success is determined at the end of it. We introduce a sequential extension of the Brier score and show that, for binary outcomes, its risk minimizer coincides with the VLA policy's value function. This connection bridges uncertainty calibration and reinforcement learning, enabling the use of temporal-difference (TD) value estimation as a principled calibration mechanism over time. We empirically show that TD calibration improves performance relative to the state-of-the-art on simulated and real-robot data. Interestingly, we show that when calibrated using TD, the VLA's single-step action probabilities can yield competitive uncertainty estimates, in contrast to recent findings that employed different calibration techniques.

Experiments

Results

Evaluated on three robot manipulation benchmarks — LIBERO-10 (simulation), WidowX (real robot), and Franka (real robot) — across four frozen VLA backbones: OpenVLA, UniVLA, \(\pi_0\), and \(\pi_0\)-FAST. Models are trained on the seen task split and evaluated on the held-out unseen split to test out-of-distribution generalization.

Brier Score

↓ Lower is better

	VLA Model	OpenVLA		OpenVLA		UniVLA		\(\pi_0\)-FAST		\(\pi_0\)-FAST		\(\pi_0\)
	Benchmark	LIBERO		WidowX		LIBERO		LIBERO		Franka		LIBERO
	Eval Split	Seen	Unseen	Seen	Unseen	Seen	Unseen	Seen	Unseen	Seen	Unseen	Seen	Unseen
Static Baselines	Max prob.	0.395	0.390	0.572	0.579	0.909	0.899	0.320	0.318	0.331	0.323	—	—
	Avg prob.	0.348	0.364	0.275	0.282	0.529	0.532	0.212	0.218	0.290	0.294	—	—
	Running Avg prob.	0.338	0.356	0.255	0.257	0.543	0.544	0.244	0.238	0.359	0.361	—	—
	Avg entropy	0.306	0.313	0.414	0.426	0.406	0.391	0.209	0.222	0.281	0.281	—	—
	Running Avg entropy	0.265	0.273	0.435	0.432	0.343	0.330	0.279	0.264	0.341	0.339	—	—
Learned Baselines (white-box)	SAFE-RNN	0.204	0.255	0.169	0.213	0.124	0.162	0.106	0.148	0.220	0.288	0.123	0.172
	SAFE-RNN-TDQC Ours	0.197	0.218	0.096	0.153	0.064	0.100	0.103	0.163	0.150	0.215	0.061	0.097
	SAFE-MLP BCE	0.192	0.231	0.127	0.164	0.091	0.158	0.103	0.162	0.206	0.248	0.075	0.137
	SAFE-MLP-TDQC Ours	0.195	0.229	0.130	0.169	0.066	0.131	0.109	0.150	0.210	0.229	0.068	0.128
Learned Baselines (black-box)	RNN-BCE	0.199	0.206	0.301	0.344	0.138	0.152	0.122	0.158	0.237	0.243	—	—
Learned Baselines (black-box)	RNN-TDQC Ours	0.191	0.197	0.156	0.192	0.100	0.107	0.105	0.141	0.204	0.228	—	—

Table 1 — Brier Score ↓. Results averaged over 21 seeds. "—" = metric not computable for this method. Red bold = best per column. TDQC achieves the lowest sequential Brier scores on all benchmarks.

Application 1

Early Stopping With Conformal Prediction

TDQC's calibrated success probabilities are thresholded via conformal prediction to trigger safe trajectory termination before task failure.

ROC-AUC

↑ Higher is better

	VLA Model	OpenVLA		OpenVLA		UniVLA		\(\pi_0\)-FAST		\(\pi_0\)-FAST		\(\pi_0\)
	Benchmark	LIBERO		WidowX		LIBERO		LIBERO		Franka		LIBERO
	Eval Split	Seen	Unseen	Seen	Unseen	Seen	Unseen	Seen	Unseen	Seen	Unseen	Seen	Unseen
Static Baselines	Max prob.	54.64	55.78	53.25	53.20	50.00	50.00	61.75	63.49	48.61	46.64	—	—
	Avg prob.	47.08	48.09	47.47	48.30	42.96	40.25	47.36	48.09	49.45	48.03	—	—
	Running Avg prob.	49.19	47.72	49.20	46.37	43.05	40.29	53.95	55.68	52.95	49.98	—	—
	Avg entropy	46.81	46.75	50.19	49.36	41.34	47.36	45.42	46.30	49.28	49.07	—	—
	Running Avg entropy	50.48	48.09	46.16	43.99	51.63	56.71	53.78	55.44	51.12	48.78	—	—
Learned Baselines (white-box)	SAFE-RNN	72.30	69.04	75.95	70.00	74.48	68.13	91.26	85.88	70.85	56.69	79.96	65.03
	SAFE-RNN-TDQC Ours	71.67	65.12	84.01	70.76	73.83	64.25	92.03	85.49	79.89	68.43	88.66	82.94
	SAFE-MLP	73.56	69.53	88.23	83.18	74.39	63.80	79.00	68.36	79.15	63.94	86.33	79.75
	SAFE-MLP-BCE	72.66	64.99	85.38	71.43	78.83	69.74	91.82	85.50	73.82	58.83	89.71	80.53
	SAFE-MLP-TDQC Ours	71.22	60.09	82.33	70.64	76.55	66.71	90.25	84.44	61.95	51.22	86.57	72.07
Learned Baselines (black-box)	RNN-BCE	72.70	72.28	69.69	67.09	63.44	54.37	87.67	82.53	59.32	53.26	—	—
Learned Baselines (black-box)	RNN-TDQC Ours	74.20	72.72	78.90	72.97	72.30	70.17	87.51	83.39	64.19	57.37	—	—

Table 2 — ROC-AUC ↑. Results averaged over 21 seeds. "—" = metric not computable. Red bold = best per column. TDQC achieves SOTA results on all benchmarks except WidowX and \(\pi_0\)-FAST.

Figure 2 —ROC-AUC vs Brier score

ROC-AUC vs Brier score over all learned baselines in all benchmarks at the minimum rollout length. Points are grouped by method and split, with a dashed linear fit; the Spearman correlation is \(\rho = -0.686\) which indicates high negative correlation.

LIBERO-10 (Simulation) — OpenVLA

These LIBERO rollouts compare TDQC (SAFE-RNN-TD0) against the SAFE-RNN-TDQC cross-entropy baseline on identical trajectories.

Task 3 — Put the black bowl in the bottom drawer of the cabinet and close it

RNN-TDQC (Top-10): Actual ✅ → Predicted ✅

TDQC correctly predicts a successful outcome on Task 3. The probe's \(f_\theta(h_t)\) shows spikes in the curve on 'uncertain' parts: when the robot grasps the bowl and when it drops it.

SAFE-RNN-TDQC (Features): Actual ✅ → Predicted ❌

The SAFE-RNN-TDQC fails on the same episode in Task 3: it predicts failure in the middle of a successful trajectory — a false negative TDQC avoids. The conformal prediction band is wider, indicating more uncertainty in \(f_\theta(h_t)\).

SAFE-MLP (Cumsum Loss): Actual ✅ → Predicted ❌

The SAFE-MLP also fails the same episode in Task 3: the failure score rises steadily over the trajectory and rises above the threshold in an unexplained manner.

Task 5 — Pick up the book and place it in the back compartment of the caddy

RNN-TDQC (Top-10): Actual ❌ → Predicted ❌

TDQC correctly identifies a failed trajectory on Task 5. \(f_\theta(h_t)\) declines only when the robot is stuck, enabling timely intervention.

SAFE-RNN-TDQC (Features): Actual ❌ → Predicted ❌

The SAFE-RNN-TDQC baseline also predicts failure on Task 5, but with temporally noisy estimates. Compared to RNN-TDQC (Top-10) that produces smoother failure predictions.

SAFE-MLP (Cumsum Loss): Actual ❌ → Predicted ❌

The SAFE-MLP also predicts failure, from frame 3, early in the trajectory.

WidowX Real-Robot — OpenVLA

Six real-robot clips comparing TDQC, SAFE-RNN-TDQC, and SAFE-MLP on two tasks. On Task 4 only TDQC correctly predicts success; on Task 5 the SAFE-MLP produces a dangerous false positive.

Task 4 — Put blue cup on plate

RNN-TDQC (Top-10): Actual ✅ → Predicted ✅

TDQC correctly predicts success. The failure score decreases monotonically as the rollout progresses, reflecting the accumulation of trajectory information over time.

SAFE-RNN-TDQC (Features): Actual ✅ → Predicted ❌

SAFE-RNN-TDQC incorrectly predicts failure on the same successful rollout (false negative). The failure score rises above the threshold from the first frame

SAFE-MLP (Cumsum Loss): Actual ✅ → Predicted ❌

The SAFE-MLP also fails (false negative), predicting failure in the second frame, while no failure indication is present.

Task 5 — Put the red bottle into pot

RNN-TDQC (Top-10): Actual ❌ → Predicted ❌

TDQC correctly predicts failure, but \(f_\theta(h_t)\) rises above the threshold from the second frame.

SAFE-RNN-TDQC (Features): Actual ❌ → Predicted ❌

SAFE-RNN-TDQC also predicts failure. When the failure score rises above the threshold - the robot was uable to grasp the red bottle.

SAFE-MLP (Cumsum Loss): Actual ❌ → Predicted ✅

The SAFE-MLP produces a false positive: it predicts success throughout a failing trajectory.

Application 2

Test-Time Guided Action Search

The TDQC value predictor guides action selection within policy-sampled actions, yielding improved success rates without retraining the VLA.

Rollout comparison success rate across all tasks

Figure 3 — Averaged success rates over additional simulation steps for action selection configurations

All experiments evaluate 3 unseen tasks from LIBERO-10 taken with OpenVLA, consisting of 50 rollouts each, avereged over 3 seeds. RNN-TDQC achieves highest success rates, while RNN-TDQC with thresholds tradeoff success rate and compute.

We measured the success rates of OpenVLA on three unseen tasks from the LIBERO-10 benchmark. We compared the standard, unmodified OpenVLA policy (Baseline) against several value-guided search configurations, using the action-selection procedure described in the paper. In Figure 3, we show the success rate relative to the increase in test-time compute. RNN with TDQC or BCE action selection methods use the output of the f_θ network for guidance at all time steps (that is, the threshold is T = -∞), and are trained using either TDQC or BCE loss. The TDQC - Thresh method saves compute by applying a confidence threshold, as described above. In this experiment, we used a threshold T = 0.35, which achieved a good tradeoff between success rate and compute. Appendix A gives a more detailed explanation and analysis. To increase variance in sampled actions, we generated 10 samples per time step in all value-guided methods using a sampling temperature of 1.5. All experiments were performed on 3 unseen tasks, with 50 rollouts each.

Task 3 — Put the black bowl in the bottom drawer of the cabinet and close it

Baseline (OpenVLA policy)

The baseline chooses the actions with highest probabilities.

RNN-TDQC (top-10)

RNN-TDQC-based action selection (10 candidates, temperature 1.5, no threshold)

RNN-TDQC (top-10, thresh=0.35)

RNN-TDQC-based action selection with threshold \(\bar{T}= 0.35 \) (10 candidates, temperature 1.5)

Probes Entropy

Action selection using probability-entropy scoring (10 candidates, temperature 1.5, no threshold).

RNN-TDQC (BCE)

The BCE-based RNN-TDQC (10 candidates, temperature 1.5, no threshold)

Task 4 — Put the white mug on the left plate and put the blue and white mug on the right plate

Baseline (OpenVLA policy)

The baseline chooses the actions with highest probabilities.

RNN-TDQC (top-10)

RNN-TDQC-based action selection (10 candidates, temperature 1.5, no threshold).

RNN-TDQC (top-10, thresh=0.35)

RNN-TDQC-based action selection with threshold \(\bar{T}= 0.35 \) (10 candidates, temperature 1.5).

Probes Entropy

Action selection using probability-entropy scoring (10 candidates, temperature 1.5, no threshold).

RNN-TDQC (BCE)

The BCE-based RNN-TDQC (10 candidates, temperature 1.5, no threshold).

Task 9 — Put the yellow and white mug in the microwave and close it

Baseline (OpenVLA policy)

The baseline chooses the actions with highest probabilities.

RNN-TDQC (top-10)

RNN-TDQC-based action selection with threshold \(\bar{T}= 0.35 \) (10 candidates, temperature 1.5)

RNN-TDQC (top-10, thresh=0.35)

RNN-TDQC-based action selection with threshold \(\bar{T}= 0.35 \) (10 candidates, temperature 1.5)

Probes Entropy

Action selection using probability-entropy scoring (10 candidates, temperature 1.5, no threshold).

RNN-TDQC (BCE)

The BCE-based RNN-TDQC (10 candidates, temperature 1.5, no threshold)

Method

How TDQC Works

Problem Setting

Let \(h_t\) denote the history of observations and actions up to time \(t\) in a trajectory of length \(T\), and let \(y \in \{0,1\}\) denote final task success. We wish to learn a function \(f_\theta : \mathcal{H} \to [0,1]\) such that \(f_\theta(h_t)\) is a calibrated estimate of \(\Pr[\text{success} \mid h_t]\) at every step.

Standard post-hoc calibration (e.g., Platt scaling, temperature scaling) was designed for i.i.d. classification. In robotics, observations arrive as a temporally correlated sequence and the outcome label - binary task success - is revealed only at the end. This raises two core challenges:

Temporal inconsistency: the confidence in the current decision may depend on the confidence of future decisions.

Label scarcity: the ground-truth label is typically delayed until episode completion.

Sequential Brier Score & the Value-Function Connection

The standard per-step Brier score \(\mathbb{E}[(f_\theta(x) - y)^2]\) ignores temporal structure. We instead minimize the sequential Brier score - the sum of squared prediction errors across all time-steps:

\mathcal{L}_{\text{seq}}(\theta) = \mathbb{E}\!\left[\sum_{t=1}^{T}(f_\theta(h_t) - y)^2 \big| h_t\right]

The unique minimizer satisfies \(f_\theta(h_t) = \mathbb{E}[y \mid h_t]\), i.e., the policy's value function \(V^\pi(h_t)\). This motivates the TDQC training objective, which telescopes the loss into:

$$G(\theta) = \sum_{t=1}^{T-1}\!\bigl(f_\theta(h_t) - f_\theta(h_{t+1})\bigr)^2 \;+\; \bigl(f_\theta(h_T) - y\bigr)^2$$ TD-0 Loss in sequential settings - TD consistency terms (left) enforce temporal coherence; the terminal term (right) anchors the final prediction to the ground-truth label.

TDQC Architecture

We parameterize \(f_\theta\) as in the SAFE paper - a model that consumes the hidden states (or action probabilities) of the frozen VLA backbone at each step and outputs a success probability. We evaluate the following variants:

SAFE-MLP (white-box): A shallow MLP applied independently at each step, with cumulative losses.
SAFE-MLP-BCE (white-box): A shallow MLP trained with binary cross-entropy loss.
SAFE-MLP-TDQC (white-box): A shallow MLP trained with the TD-0 loss \(G(\theta)\).
SAFE-RNN (white-box): An RNN that processes the full history \(h_t\), enabling temporally coherent predictions.
SAFE-RNN-TDQC (white-box): The same RNN trained with the full TDQC loss - our best-performing white-box variant.
RNN-BCE (black-box): A sequential model on action probabilities trained with binary cross-entropy.
RNN-TDQC (black-box): A sequential model on action probabilities trained with the TD-0 loss \(G(\theta)\).

Training Procedure

Algorithm 1 — TDQC Training

Input: policy \(\pi\), calibration dataset \(\mathcal{D}_{\text{cal}} = \{(h_T^i, y^i)\}_{i=1}^N\)

1 Initialize network weights \(f_\theta,\, f_{\theta^-}\)

2 for until convergence do

3 Sample \(h_T^i\) uniformly from \(\mathcal{D}_{\text{cal}}\)

4 Compute the TDQC loss using a target network:

\(G(\theta) = \sum\limits_{t=1}^{T-1}(f_\theta(h_t^i) - f_{\theta^-}(h_{t+1}^i))^2 + (f_\theta(h_T^i) - y^i)^2\)

5 Update \(\theta \leftarrow \theta - \alpha\,\nabla G(\theta)\)

6 Periodically update target network \(\theta^- \leftarrow \theta\)

7 end for

Temporal Difference Calibration in Sequential Tasks:
Application to Vision-Language-Action Models

Sequential Calibration for Robot Manipulation

What TDQC Brings

Sequential Calibration Framework

RL Connection

Black-Box Practicality

Empirical Strength

Policy Improvement as a By-Product

Results

Brier Score

Early Stopping With Conformal Prediction

ROC-AUC

Test-Time Guided Action Search

How TDQC Works

Problem Setting

Sequential Brier Score & the Value-Function Connection

TDQC Architecture

Training Procedure

Cite this Work

Temporal Difference Calibration in Sequential Tasks:Application to Vision-Language-Action Models

Sequential Calibration for Robot Manipulation

What TDQC Brings

Sequential Calibration Framework

RL Connection

Black-Box Practicality

Empirical Strength

Policy Improvement as a By-Product

Results

Brier Score

Early Stopping With Conformal Prediction

ROC-AUC

Test-Time Guided Action Search

How TDQC Works

Problem Setting

Sequential Brier Score & the Value-Function Connection

TDQC Architecture

Training Procedure

Cite this Work

Temporal Difference Calibration in Sequential Tasks:
Application to Vision-Language-Action Models