Three-dimensional motion tracking of single-shot videos has always been a difficult problem in the field of computer vision, especially when pursuing pixel-level accuracy and processing long video sequences. Traditional methods face many challenges in terms of efficiency, accuracy, and robustness, and are difficult to meet the needs of practical applications. The editor of Downcodes will introduce to you a latest research result-DELTA, which has made a significant breakthrough in efficient and accurate three-dimensional motion tracking.
Moreover, the existing technology has high computational requirements and is difficult to maintain efficiency when processing long videos. At the same time, long-term tracking will also be affected by problems such as camera movement and object occlusion, leading to tracking errors or errors.
Currently, methods for video sequence motion estimation have their own advantages and disadvantages. Optical flow technology provides dense pixel tracking, but lacks resilience in complex scenes, especially when processing long sequences.
Scene flow is an extension of optical flow, which estimates dense three-dimensional motion through RGB-D data or point clouds, but it is still difficult to apply efficiently in long sequences. Although point tracking methods can capture motion trajectories and combine spatial and temporal attention to achieve smoother tracking, they are still difficult to achieve dense monitoring due to high computational costs. Furthermore, reconstruction-based tracking methods utilize deformation fields to estimate motion, but are not practical in real-time applications.
Recently, a research team from the University of Massachusetts Amherst, MIT-IBM Watson Artificial Intelligence Laboratory and Snap Inc. proposed DELTA (Dense Efficient Long-range3D Tracking for Any video), which is a method designed for efficient tracking. A method designed for each pixel in three-dimensional space. DELTA starts with low-resolution tracking, employs a spatiotemporal attention mechanism, and applies an attention-based upsampler to achieve high-resolution accuracy. Its key innovations include an upsampler for clear motion boundaries, an efficient spatial attention architecture, and a logarithmic depth representation for enhanced tracking performance.
DELTA has achieved advanced results on CVO and Kubric3D datasets, improving by more than 10% on indicators such as average Jaccard (AJ) and three-dimensional average position difference (APD3D), and also performed on 3D point tracking benchmarks such as TAP-Vid3D and LSFOdyssey. outstanding. Unlike existing methods, DELTA achieves dense 3D tracking at scale and runs more than 8 times faster than previous methods while maintaining industry-leading accuracy.
Experiments show that DELTA performs well in three-dimensional tracking tasks, with both speed and accuracy exceeding previous methods. DELTA is trained on the Kubric dataset, which contains over 5600 videos, and its loss function combines 2D coordinate, depth and visibility losses.
In the benchmark test, DELTA achieved the highest scores in CVO and Kubric3D in long-distance 2D tracking and dense 3D tracking respectively, completing the task much faster than other methods. DELTA's design choices, such as logarithmic depth representation, spatial attention, and attention-based upsamplers, significantly improve its accuracy and efficiency under various tracking scenarios.
DELTA is an efficient method capable of tracking every pixel in a video frame, achieving accuracy and faster runtimes in dense 3D and 3D tracking. This method may face challenges on long-term occluded points, and the best performance occurs in short videos with no more than a few hundred frames. DELTA's 3D tracking accuracy relies on the accuracy and temporal stability of the monocular depth estimation used. It is expected that research progress in monocular depth estimation will further improve the performance of this method.
Project entrance: https://snap-research.github.io/DELTA/
All in all, DELTA has made breakthrough progress in efficient three-dimensional motion tracking, and its high accuracy, efficiency and scalability make it have huge application potential in the field of video processing. In the future, with the continuous development of monocular depth estimation technology, the performance of DELTA is expected to be further improved.