Depth estimation from 4D light field videos

Takahiro Kinoshita and Satoshi Ono
Kagoshima University

International Workshop on Advanced Image Technology (IWAIT), Jan 2021

[Paper] [Code] [Dataset] [DOI]


Abstract

Depth (disparity) estimation from 4D Light Field (LF) images has been a research topic for the last couple of years. Most studies have focused on depth estimation from static 4D LF images while not considering temporal information, i.e., LF videos. This paper proposes an end-to-end neural network architecture for depth estimation from 4D LF videos. This study also constructs a medium-scale synthetic 4D LF video dataset that can be used for training deep learning-based methods. Experimental results using synthetic and real-world 4D LF videos show that temporal information contributes to the improvement of depth estimation accuracy in noisy regions.


Our Example Results

Synthetic data

synthetic_result

Real-world data

real_result


Citation

If you find this useful for your research, please use the following.

@inproceedings{kinoshita2021depth,
  title={Depth estimation from 4D light field videos},
  author={Kinoshita, Takahiro and Ono, Satoshi},
  booktitle={International Workshop on Advanced Imaging Technology (IWAIT) 2021},
  volume={11766},
  pages={117660A},
  year={2021},
  organization={International Society for Optics and Photonics}
}

Our Dataset [available at IEEE DataPort]

dataset

In order to evaluate the performance of 4D LFVs depth estimation methods, we developed the Sintel 4D LFV dataset from the open-source movie Sintel. It is difficult to accurately evaluate the effectiveness of deep learning-based 4D LFVs depth estimation methods in existing available 4D LFV datasets due to small number of samples or no ground-truth disparity values is available.

The generated dataset consists of 23 synthetic 4D LFVs with 1,204x436 pixels, 9x9 views, and 20–50 frames, and has ground-truth disparity values in the central view, so that can be used for training deep learning-based methods. Each scene was rendered with a clean pass after modifying the production file of Sintel with reference to the MPI Sintel dataset. A clean pass includes complex illumination and reflectance properties including specular reflections, such as smooth shading and specular reflections, while bokeh, motion blur, and semi-transparent objects are excluded.

The 4D LFVs were captured by moving the camera with a baseline of 0.01m towards a common focus plane while keeping the optical axes parallel. A ground-truth disparity value was obtained by transforming the depth value obtained in Blender. The disparity values are in the range [0, 1] for most scenes, but up to 1.5 for some scenes.

Scenes

GIF version page is here. (Note that there are many large size GIFs.)

Scene name Frames Maximum disparity RGB Disparity
ambushfight_1 20 0.366 RGB Disparity
ambushfight_2 21 1.522 RGB Disparity
ambushfight_3 41 1.110 RGB Disparity
ambushfight_4 30 1.005 RGB Disparity
ambushfight_5 50 0.419 RGB Disparity
ambushfight_6 20 0.562 RGB Disparity
bamboo_1 50 0.230 RGB Disparity
bamboo_2 50 0.820 RGB Disparity
bamboo_3 50 0.592 RGB Disparity
chickenrun_1 50 1.005 RGB Disparity
chickenrun_2 21 0.485 RGB Disparity
chickenrun_3 50 0.270 RGB Disparity
foggyrocks_1 50 0.191 RGB Disparity
foggyrocks_2 50 0.493 RGB Disparity
questbegins_1 40 0.882 RGB Disparity
shaman_1 50 2.148 RGB Disparity
shaman_2 50 1.191 RGB Disparity
shaman_3 50 0.954 RGB Disparity
shaman_b_1 48 1.348 RGB Disparity
thebigfight_1 50 0.620 RGB Disparity
thebigfight_2 50 0.302 RGB Disparity
thebigfight_3 50 0.317 RGB Disparity
thebigfight_4 50 0.233 RGB Disparity