LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation
Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image ...
Accelerating Vision-Language Pretraining with Free Language Modeling
The state of the arts in vision-language pretraining (VLP) achieves exemplary performance but suffers from high training costs ...
OSRT: Omnidirectional Image Super-Resolution with Distortion-aware Transformer
Omnidirectional images (ODIs) have obtained lots of research interest for immersive experiences. Although ODIs require ...
SGAT4PASS:Spherical Geometry=Aware Transformer for PAnoramic Semantic Segmentation
As an important and challenging problem in computer vision, PAnoramic Semantic Segmentation (PASS) gives complete ...
Task-Aware Dual-Representation Network for Few-Shot Action Recognition
Few-shot action recognition has attracted increasing attention in recent years, but it remains challenging due to the intrinsic ...
DeSRA: Detect and Delete the Artifacts of GAN-based Real-World Super-Resolution Models
Image super-resolution (SR) with generative adversarial networks (GAN) has achieved great success in restoring ...
Pi-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation
Foundation models have achieved great advances in multi-task learning with a unified interface of unimodal ...
NeRF-Texture: Texture Synthesis With Neural Radiance Fields
Texture synthesis is a fundamental problem in computer graphics that would benefit various applications. Existing methods are effective in ...
Binary Embedding-based Retrieval at Tencent
Large-scale embedding-based retrieval (EBR) is the cornerstone of search-related industrial applications. Given a user query, the system of EBR aims to ...
Prosody Modeling with 3D Visual Information for Expressive Video Dubbing
The automatic video dubbing task is proposed to meet personal and industrial demands for dubbing. Current methods mostly ...
Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection
We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla ...
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to ...
Transfer learning has emerged to be crucial in various computer vision tasks benefiting from the vast availability of pre-trained deep learning models. However, selecting an optimal model for a ...
Video Tagging intends to infer multiple tags spanning relevant content for a given video. Typically, video tags are freely defined and uploaded by a variety of users, so they have two ...
MasaCtrl: Tuning-free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
Despite the success in large-scale text-to-image generation and text-conditioned image editing, ...
Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video
Synthesizing realistic videos according to a given speech is still an open challenge. Previous works have been ...
OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution
Omnidirectional images (ODIs) have become increasingly popular, as their large field-of-view (FoV) can offer the viewers chance ...
HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video
We introduce HOSNeRF, a novel 360° free-viewpoint rendering method that reconstructs neural radiance fields for dynamic ...
VMesh: Hybrid Volume-Mesh Representation for Efficient View Synthesis
With the emergence of neural radiance fields (NeRFs), view synthesis quality has reached an unprecedented level. Compared to ...
CL-NeRF: Continual Learning of Neural Radiance Fields for Evolving Scene Representation
Existing methods for adapting Neural Radiance Fields (NeRFs) to scene changes require extensive data capture ...