Home 9 AI 9 More Cameras, More Challenges: The Roadblocks in 3D Human Sensing

More Cameras, More Challenges: The Roadblocks in 3D Human Sensing

by Ruchika Saini, AI | Aug 13, 2025

Why deep learning still struggles with multi-view accuracy, calibration limits, and real-world adaptability in 3D pose estimation.

Motivation: Comparison of different token scanning methods. (a) Cross Attention acts on all image tokens. (b) Projective Attention obtains anchors with perspective projection and selectively attends to sample tokens surrounding the anchor points. (c) The proposed Grid Token-guided Bidirectional Scanning (GTBS) encodes the local context and the joint spatial sequence at the visual feature and person keypoint levels (source: Tech Xplore).

Deep learning-based human pose estimation, once limited to detecting 2D joints with methods such as OpenPose, Mediapipe, and YOLOPose, is now pushing into the more complex realm of 3D multi-person pose estimation. However, this transition remains fraught with difficulties, particularly when leveraging multiple camera views to reconstruct (x, y, z) joint positions in global space, reports Tech Xplore.

Early multi-stage pipelines—where each camera independently estimates 2D keypoints before matching them across views and triangulating 3D positions—suffer from cascading errors. The initial step discards raw pixel data, limiting downstream accuracy, and any misstep in the chain compounds, undermining final pose estimation.

End-to-end learning models attempt to streamline this by processing all multi-view inputs together and directly regressing 3D joint positions. Innovations such as Learnable Triangulation, MvP (which uses geometric attention to fuse multi-view data), and MVGFormer (which tackles overfitting) have emerged—but each comes with its own shortcomings.

A notable breakthrough is the newly proposed MV-SSM model, introduced at CVPR 2025. It employs Projective State Space (PSS) blocks and a Grid Token-guided Bidirectional Scanning (GTBS) mechanism to more effectively model spatial relationships at the feature and keypoint levels. MV-SSM significantly improves generalization across varying camera setups—leading by margins of +24% in three-camera, +13% in rearranged-camera, and +38% in cross-dataset evaluations.

Yet, a critical limitation persists: all current models—including MV-SSM—still assume known camera calibration parameters. Real-world applications often require flexibility in camera count, arrangement, and environmental context—making this assumption a key barrier to broader deployment.