UMI-Underwater: Learning Underwater Manipulation without Underwater Teleoperation

UMI-Underwater tackles two practical bottlenecks in underwater manipulation: data collection cost and cross-domain generalization.

The key idea? Pair autonomous, self-supervised underwater data collection with a depth-based affordance representation that transfers directly from land to water.

UMI-Underwater tackles two practical bottlenecks in underwater manipulation: data collection cost and cross-domain generalization.

The key idea? Pair autonomous, self-supervised underwater data collection with a depth-based affordance representation that transfers directly from land to water.

Technical Summary Video

Robot Data without Underwater Teleoperation

A major bottleneck in underwater robotics is the human burden in data collection. Most pipelines are teleoperation-centric, and collecting diverse demonstrations in water is expensive and time-consuming.

UMI-Underwater introduces an autonomous self-supervised collector that bootstraps grasp attempts, executes recovery behaviors, and filters episodes using automatic success signals. This turns data collection into repeated deployment instead of repeated teleoperation.

If underwater data is autonomous and noisy, where do strong manipulation priors come from? Can we import them from abundant on-land human demonstrations?

Land-to-Water Representation Transfer

RGB appearance changes dramatically underwater due to attenuation, scattering, and rapidly changing illumination. Instead of relying on RGB-only prediction, UMI-Underwater uses a depth-based affordance representation that is more stable under these shifts.

An affordance model trained on on-land handheld demonstrations is deployed underwater zero-shot via geometric alignment, providing task-level guidance before underwater policy training.

Affordance labeling examples for land-to-water transfer — Affordance labeling examples used to bootstrap robust depth-centric representation transfer.

UMI-Underwater architecture for representation transfer and policy learning — UMI-Underwater architecture: affordance transfer from on-land data plus underwater policy learning.

What makes UMI-Underwater robust?

The method combines a representation layer (depth-based affordance transfer) with a control layer (affordance-conditioned diffusion policy trained on autonomous underwater data).

Representation Robustness
Depth-centric affordance transfer gives a stable geometric signal when underwater color and illumination shift, so initialization remains reliable before large-scale underwater policy training.

Policy Robustness
Affordance-conditioned diffusion policies trained on autonomous underwater data sustain stronger grasp performance under background shift and transfer to objects that only appear in on-land demonstrations.

System & Deployment Loop

UMI-Underwater uses a robot-centric loop that repeatedly attempts grasps, verifies outcomes, and retries when needed. This closed-loop structure increases data efficiency and keeps demonstration quality high without manual intervention.

In-the-wild Deployment

We also deploy the system in the ocean at Stanford's Hopkins Marine Station, demonstrating the method's potential for real-world impact in challenging underwater environments.

Paper, Authors, and BibTeX

Hao Li^*

Long Yin Chung^*

Jack Goler

Ryan Zhang

Xiaochi Xie

Huy Ha

Shuran Song

Mark Cutkosky

Stanford University, *Equal contribution

@article{li2026umiunderwater,
  title   = {UMI-Underwater: Learning Underwater Manipulation without Underwater Teleoperation},
  author  = {Li, Hao and Chung, Long Yin and Goler, Jack and Zhang, Ryan and Xie, Xiaochi and Ha, Huy and Song, Shuran and Cutkosky, Mark},
  journal = {Preprint},
  year    = {2026}
}

If you have any questions, feel free to contact Hao and Clive.

Questions & Answers

Why not rely on RGB-only policies underwater?

Underwater RGB appearance is highly unstable because of attenuation, scattering, and changing illumination. A depth-based affordance representation is more transferable across land and water, especially for geometric grasp reasoning.

How does this reduce human burden compared to teleoperation?

The system collects successful underwater grasp demonstrations autonomously with self-supervision, recovery behaviors, and automatic success filtering. This changes data collection from manual operation to repeatable autonomous deployment.

What does zero-shot transfer mean here?

The affordance predictor is trained on on-land human demonstrations and deployed underwater directly, without underwater re-labeling of affordance targets, by using geometric alignment between domains.

What empirical behavior is reported?

In pool experiments, the method improves grasp success and background-shift robustness, and it generalizes to objects that were only seen in on-land data, outperforming RGB-only baselines.

Current limitations?

Performance still depends on reliable depth cues and robust low-level execution in challenging underwater dynamics. Extending to broader tasks and harder environments remains an important next step.

UMI-Underwater

Learning Underwater Manipulation
without Underwater Teleoperation

Hao Li, Long Yin Chung, Jack Goler, Ryan Zhang
Xiaochi Xie, Huy Ha, Shuran Song, Mark Cutkosky

Stanford University * Equal contribution

UMI-Underwater tackles two practical bottlenecks in underwater manipulation: data collection cost and cross-domain generalization.

The key idea? Pair autonomous, self-supervised underwater data collection with a depth-based affordance representation that transfers directly from land to water.

Technical Summary Video

Robot Data without Underwater Teleoperation

If underwater data is autonomous and noisy, where do strong manipulation priors come from? Can we import them from abundant on-land human demonstrations?

Land-to-Water Representation Transfer

What makes UMI-Underwater robust?

System & Deployment Loop

In-the-wild Deployment

Paper, Authors, and BibTeX

Questions & Answers

Why not rely on RGB-only policies underwater?

How does this reduce human burden compared to teleoperation?

What does zero-shot transfer mean here?

What empirical behavior is reported?

Current limitations?

UMI-Underwater

Learning Underwater Manipulation without Underwater Teleoperation

Hao Li*, Long Yin Chung*, Jack Goler, Ryan Zhang Xiaochi Xie, Huy Ha, Shuran Song, Mark Cutkosky

Stanford University * Equal contribution

UMI-Underwater tackles two practical bottlenecks in underwater manipulation: data collection cost and cross-domain generalization.

The key idea? Pair autonomous, self-supervised underwater data collection with a depth-based affordance representation that transfers directly from land to water.

Technical Summary Video

Robot Data without Underwater Teleoperation

If underwater data is autonomous and noisy, where do strong manipulation priors come from? Can we import them from abundant on-land human demonstrations?

Land-to-Water Representation Transfer

What makes UMI-Underwater robust?

System & Deployment Loop

In-the-wild Deployment

Paper, Authors, and BibTeX

Questions & Answers

Why not rely on RGB-only policies underwater?

How does this reduce human burden compared to teleoperation?

What does zero-shot transfer mean here?

What empirical behavior is reported?

Current limitations?

Learning Underwater Manipulation
without Underwater Teleoperation

Hao Li, Long Yin Chung, Jack Goler, Ryan Zhang
Xiaochi Xie, Huy Ha, Shuran Song, Mark Cutkosky