MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval Dominant pre-training work for video-text retrieval mainly adopt the ...
AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos This paper studies the problem of real-world video super-resolution (VSR) for ...
DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes Modeling dynamic scenes is important for many applications such as virtual reality and ...
Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer Most existing point cloud completion methods suffer from the ...
Mitigating Artifacts in Real-World Video Super-Resolution Models with More Cheap Hidden States and Selective Cross Attention The recurrent structure is a ...
Accelerating the Training of Video Super-Resolution Models Despite that convolution neural networks (CNN) have recently demonstrated high-quality ...
What Does Your Face Sound Like? 3D Face Shape Towards Voice Face-based speech synthesis provides a practical solution to generate voices from human faces. ...
Darwinian Model Upgrades: Model Evolving with Selective Compatibility The traditional model upgrading paradigm for retrieval requires recomputing all gallery ...
Video-Text Pre-training with Learned Regions for Retrieval Video-Text pre-training aims at learning transferable representations from large-scale video-text ...
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval Vision-language alignment learning for video-text retrieval arouses a lot of ...
Masked Image Modeling with Denoising Contrast Since the development of self-supervised visual representation learning from contrastive learning to masked ...
ERBNet: An Effective Representation Based Network for Unbiased Scene Graph Generation The scene graph generation (SGG) task has attracted increasing attention ...
Enhancing the Vocal Range of Single-Speaker Singing Voice Synthesis with Melody-Unsupervised Pre-training The single-speaker singing voice synthesis (SVS) ...
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge Pre-training on large-scale video data has become a common recipe for ...
HRDFuse: Monocular 360° Depth Estimation by Collaboratively Learning Holistic-with-Regional Depth Distributions Depth estimation from a monocular 360° image ...
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to ...
SurfelNeRF: Neural Surfel Radiance Field for Online 3D Reconstruction and Photorealistic Rendering Online reconstructing and rendering of large-scale indoor ...
All in One: Exploring Unified Video-Language Pre-training Mainstream Video-Language Pre-training models consist of three parts, a video encoder, a text ...
Masked Visual Reconstruction in Language Semantic Space Both masked image modeling (MIM) and natural language supervision have facilitated the progress of ...
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models Recent CLIP-guided 3D optimization methods, eg, DreamFields ...