Toward Human Perception-Centric Video Thumbnail Generation
Video thumbnails play an essential role in summarizing video content into a compact and concise image for users to browse efficiently. However, automatically generating attractive and informative video thumbnails remains an open problem due to the difficulty of formulating human aesthetic perception and the scarcity of paired training data. To address these challenges, this work proposes a novel Human Perception-Centric Video Thumbnail Generation (HPCVTG) framework which leverages several visual aesthetic principles to synthesize large-scale thumbnails for pretraining a VAE model in Model-Agnostic Meta-Learning (MAML) manner and finetuning the generator with human feedback. Specifically, our framework first generates a set of thumbnails using a principle-based system, which conforms to established aesthetic and human perception principles, such as visual balance in the layout and avoiding overlapping elements. Human annotators then evaluated some of these thumbnails and chose their preferred thumbnails to serve as the training target for few-shot learning. Gathering human-preferred thumbnails in this manner is much more efficient and effortless than manually designing thumbnails from scratch to build the large-scale paired dataset. After that, the framework uses these preferred thumbnails as training data for a Transformer-based VAE model. The VAE model is pretrained using MAML, allowing it to adapt to new perception-optimized thumbnails quickly. The exploration of combining the MAML pretraining paradigm. and human feedback in training can reduce the need for human involvement and make the training process efficient. Extensive experimental results show that our HPCVTG framework outperforms existing methods in objective and subjective evaluations, highlighting its potential to improve the user’s experience when browsing videos and inspire future research in human perception-centric content generation tasks.