MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable ...
AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos
This paper studies the problem of real-world video super-resolution (VSR) for animation videos, and reveals three key ...
DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes
Modeling dynamic scenes is important for many applications such as virtual reality and telepresence. Despite achieving unprecedented ...
Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer
Most existing point cloud completion methods suffer from the discrete nature of point clouds and the ...
Mitigating Artifacts in Real-World Video Super-Resolution Models with More Cheap Hidden States and Selective Cross Attention
The recurrent structure is a prevalent framework for the task of video ...
Accelerating the Training of Video Super-Resolution Models
Despite that convolution neural networks (CNN) have recently demonstrated high-quality reconstruction for video super-resolution (VSR), ...
What Does Your Face Sound Like? 3D Face Shape Towards Voice
Face-based speech synthesis provides a practical solution to generate voices from human faces. However, directly using 2D face images ...
Darwinian Model Upgrades: Model Evolving with Selective Compatibility
The traditional model upgrading paradigm for retrieval requires recomputing all gallery embeddings before deploying the new ...
Video-Text Pre-training with Learned Regions for Retrieval
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between ...
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the ...
Masked Image Modeling with Denoising Contrast
Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no ...
ERBNet: An Effective Representation Based Network for Unbiased Scene Graph Generation
The scene graph generation (SGG) task has attracted increasing attention in recent years. The goal of SGG is to ...
Enhancing the Vocal Range of Single-Speaker Singing Voice Synthesis with Melody-Unsupervised Pre-training
The single-speaker singing voice synthesis (SVS) usually underperforms at pitch values ...
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal ...
HRDFuse: Monocular 360° Depth Estimation by Collaboratively Learning Holistic-with-Regional Depth Distributions
Depth estimation from a monocular 360° image is a burgeoning problem owing to its ...
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval
Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders ...
SurfelNeRF: Neural Surfel Radiance Field for Online 3D Reconstruction and Photorealistic Rendering
Online reconstructing and rendering of large-scale indoor scenes is a long-standing challenge. ...
All in One: Exploring Unified Video-Language Pre-training
Mainstream Video-Language Pre-training models consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. ...
Masked Visual Reconstruction in Language Semantic Space
Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training. In this ...
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
Recent CLIP-guided 3D optimization methods, eg, DreamFields and PureCLIPNeRF achieve great success in ...