Temporally Efficient Vision Transformer for Video Instance Segmentation

Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free memory shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains a high inference speed.

      To Get Daily Health Newsletter

      We don’t spam! Read our privacy policy for more info.

      Download Mobile Apps
      Follow us on Social Media
      © 2012 - 2025; All rights reserved by authors. Powered by Mediarx International LTD, a subsidiary company of Rx Foundation.
      RxHarun
      Logo
      Register New Account