Streaming3D Sequential 3D Generation via Evidential Memory

Anonymous Authors
Anonymous Submission
Streaming3D teaser: streaming 3D generation with constant memory
Teaser. Streaming3D extends a frozen view-conditioned 3D generator (e.g., SAM 3D) to long monocular streams. The reconstruction quality progressively improves as more chunks arrive, while the cross-chunk memory footprint stays constant in stream length.

Abstract

View-conditioned 3D generators such as SAM 3D, TRELLIS, and Hunyuan3D produce high-quality object reconstructions from a single view, but real-world visual observation often arrives as long monocular streams. Naively applying these generators to each streaming frame independently leads to severe temporal inconsistency in the generated results.

To address this problem, we propose Streaming3D, the first training-free streaming mechanism that turns a frozen view-conditioned 3D generator into a streaming generator with constant cross-chunk memory. Streaming3D maintains a compact evidential memory that selectively caches the most informative historical frames based on a proposed evidence score mechanism. As the stream progresses, the memory dynamically updates to retain a fixed number of informative frames, preventing the memory footprint from growing linearly with sequence length. This also prevents degradation over long sequences and keeps the underlying generator completely unchanged — without retraining, architectural modifications, or auxiliary losses.

Evaluated on both realistic and synthetic streaming benchmarks, Streaming3D outperforms latent-transport baselines including KV-cache reuse and flow-based feature editing, across both photometric and geometric metrics. It maintains a constant memory footprint and stable reconstruction quality as sequence length increases.

Framework

Streaming3D framework overview
Framework of Streaming3D. Given a streaming video, Streaming3D processes frames chunk by chunk. A lightweight warmup pass extracts token-wise evidence scores from the generator's cross-attention. These scores vote for informative frames and update the Adaptive Evidential Memory. The top-K informative frames are then passed to the frozen 3D generator for evidence-based multi-view generation. By retaining only compact evidential memory rather than latent states, Streaming3D achieves stable long-horizon generation with a constant memory footprint.

Method

Our key observation is that a frozen view-conditioned generator already exposes conditioning evidence through its cross-attention maps. During a cheap one-step warmup pass, if a query token in a 3D volume attends to a frame both strongly and selectively, that frame provides confident evidence for the corresponding part of the 3D volume. We treat this as an evidence score for the view with respect to the query token.

1. Evidence Score
A lightweight attention probe over a one-step warmup pass with a frozen prior z0 measures the per-token significance of each incoming view, combining attention magnitude with selectivity (1 − normalized entropy). The frozen prior makes scores comparable across chunks.
2. Adaptive Evidential Memory
Two matrices M, F ∈ ℝQ×D persistently track each query token's top-D evidence scores and the corresponding global frame indices. Frames that never enter any token's list are discarded immediately. The total footprint is 2 × Q × D scalars — about 50 KB for SAM 3D (Q=4096, D=4), independent of stream length.
3. Evidence-Based Multi-Generation
At each chunk, token-level preferences are aggregated into per-frame ownership counts; the top-K frames form a bounded conditioning bundle. The frozen generator runs on this bundle via confidence-weighted Multi-Diffusion-style fusion in 3D latent space, where each query token's velocity is averaged across views weighted by their per-token evidence.
Adaptive Evidential Memory
Adaptive Evidential Memory. Given streaming input chunks, our memory is updated automatically by retaining the most informative historical views. The color transition from blue to pink indicates increasing frame indices, from earlier to later observations. As the memory accumulates stronger evidence over time, the reconstruction quality progressively improves.

Why this works: two structural properties

Two structural properties distinguish this from any latent-transport scheme. First, the cross-chunk memory footprint does not scale with stream length. Second, evidence accumulation is monotonic: for each query token, the retained evidence score can only remain unchanged or improve as new frames arrive. The conditioning bundle supplied to the generator is therefore never worse, in evidence-score terms, than the bundle at the previous chunk. KV banks, prev-chunk query banks, and FlowEdit-style velocity edits admit no analogous non-degradation guarantee.

Experimental Setup

We evaluate Streaming3D on long-stream 3D generation using the GSO and NAVI datasets, which together stress object-scale streaming with repeated structures, large viewpoint changes, partial observations, and accumulated occlusions. Experiments run on a single NVIDIA H100 GPU with SAM 3D as the underlying generation backbone. Camera poses and depth for the initial input are estimated by Depth Anything 3. We set K=8 and D=1 by default for efficiency, and evaluate streams of length 100.

Baselines

Metrics

AspectMetrics
AppearancePSNR ↑, SSIM ↑, LPIPS ↓, Image FID ↓ on held-out novel views
GeometryPFID ↓, Chamfer Distance ↓, IoU ↑

Main Results — GSO & NAVI

Streaming3D achieves the strongest overall performance on both appearance and geometry metrics. The gains are consistent across all metrics, indicating that improvement is not limited to image-level rendering quality but also reflects better 3D structure.

Data Method Appearance Geometry
PSNR ↑SSIM ↑LPIPS ↓Image FID ↓ PFID ↓CD ↓IoU ↑
GSO TRELLIS.2 10.5630.8220.210202.808 170.7260.1560.480
TRELLIS+M.D. 10.6800.8500.192245.361 122.7800.1380.500
TRELLIS.2+M.D. 10.6110.8380.194228.937 176.2530.1500.493
SAM3D 14.1780.8480.178105.197 71.2630.0940.664
Streaming3D (Ours) 15.7910.8640.14576.001 50.4720.0740.753
NAVI TRELLIS.2 15.4920.8740.128142.487 76.4460.1520.682
TRELLIS+M.D. 14.5130.8610.135140.148 69.0980.1600.703
TRELLIS.2+M.D. 15.6720.8770.126144.276 86.1280.1600.684
SAM3D 16.1590.8760.132141.496 71.7370.1380.721
Streaming3D (Ours) 16.4740.8790.123134.025 62.7430.1280.741

Bold rows mark the best result in each dataset block. Streaming3D consistently improves appearance and geometry over single-view and multi-view generation baselines.

Qualitative results on NAVI
Qualitative results on NAVI. Streaming3D produces more consistent and geometrically faithful 3D generations than single-view and multi-view diffusion baselines.

Streaming Baseline Comparison

We compare Streaming3D with several streaming alternatives, including MV-SAM3D-style fixed-view selection and cache- / transport-based streaming such as KV-cache reuse and FlowEdit. Random view sampling is unstable; KV-caches accumulate stale evidence under long camera motion; FlowEdit operates on fixed-size chunks and loses long-range history. In contrast, Streaming3D maintains a compact persistent token-level evidence memory.

Data Method Appearance Geometry
PSNR ↑SSIM ↑LPIPS ↓Image FID ↓ PFID ↓CD ↓IoU ↑
GSO MV-SAM3D, K random views 14.8280.8590.15683.039 68.5340.0640.676
SAM3D + FlowEdit 14.3430.8500.17898.643 76.4450.0900.668
SAM3D + KV-Cache 14.4820.8520.17183.353 67.2920.0840.682
MV-SAM3D + Last Chunk 13.9060.8480.184104.228 86.0440.1130.630
Streaming3D (Ours, K=8) 15.7910.8640.14576.001 50.4720.0740.753
Qualitative results of ablation studies
Qualitative results of ablation studies. FlowEdit denotes SAM3D with FlowEdit; KV-Cache denotes SAM3D with KV-cache reuse. MV-SAM3D denotes MV-SAM3D applied to the last input chunk, while MV-SAM3D(R) denotes MV-SAM3D with K randomly selected views.

Ablation: Conditioning Chunk Size K

Increasing K generally provides more visual evidence, but the gains are not strictly monotonic. We use K=8 as the default, balancing reconstruction quality and streaming cost; K=16 gives the strongest overall appearance, while K=12 yields the best geometric alignment.

Data Method Appearance Geometry
PSNR ↑SSIM ↑LPIPS ↓Image FID ↓ PFID ↓CD ↓IoU ↑
GSO K = 4 15.7240.8610.14877.544 47.5160.0640.754
K = 8 (default) 15.7910.8640.14576.001 50.4720.0740.753
K = 12 15.6630.8600.15075.547 44.6820.0600.754
K = 16 15.9120.8640.14471.449 45.2510.0640.764
Additional qualitative results
Additional qualitative comparisons. Streaming3D yields more temporally consistent geometry and texture across long monocular streams, while baselines either oversmooth unobserved regions or accumulate cross-chunk drift.

BibTeX

@misc{anonymous2026streaming3d,
  title  = {{Streaming3D}: Sequential 3D Generation via Evidential Memory},
  author = {Anonymous Authors},
  year   = {2026},
  note   = {Anonymous submission, under review}
}