Latest Titles
-
racism in NeurIPS2024
-
moondream
-
genie2
-
stephen wolfram
-
notlikeai.com
-
HunyuanVideo
-
amurex
-
tedai
-
Artificial Intelligence, Scientific Discovery, and Product Innovation
-
daron acemoglu
-
Adapting While Learning
-
Centaur
-
Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient
-
fish audio agent
-
OpenAI
-
Xena vision
-
alan turing
HunyuanVideo
1.
Tencent's HunyuanVideo is an open-source, large-parameter video generation model surpassing leading closed-source models in performance, according to its creators. The model boasts a unified architecture for image and video generation, leveraging a Multimodal Large Language Model (MLLM) text encoder and a 3D Variational Autoencoder (VAE). Its framework includes prompt rewriting for improved generation quality, and the researchers provide code, model weights, and a benchmark for community use. Extensive testing and human evaluation confirm its superior performance in text alignment, motion quality, and overall visual quality compared to other models.
HunyuanVideo stands out as the largest open-source video generation model, boasting over 13 billion parameters. This achievement is attributed to the effective scaling of the model architecture and dataset.
The model's architecture is designed for both image and video generation using a "Dual-stream to Single-stream" hybrid model. In the dual-stream phase, video and text tokens are processed independently. In the single-stream phase, these tokens are combined for multimodal information fusion. This approach allows for effective learning and fusion of visual and semantic information.
HunyuanVideo employs a 3D Variational Autoencoder (VAE) with CausalConv3D to compress videos and images into a smaller latent space. This compression significantly reduces the computational demands of the model while preserving the original resolution and frame rate of the videos.
Unlike models that rely on CLIP or T5-XXL for text encoding, HunyuanVideo leverages a Multimodal Large Language Model (MLLM) with a Decoder-Only structure. This choice offers several advantages:
Improved image-text alignment after visual instruction fine-tuning, leading to better instruction following.
Superior ability in image detail description and complex reasoning compared to CLIP.
Zero-shot learning capabilities by following system instructions within prompts, focusing attention on crucial information.
Causal attention mechanism (as opposed to T5-XXL's bidirectional attention), resulting in enhanced text guidance for the model.
A prompt rewrite feature is implemented to handle variations in user input. The system fine-tunes the Hunyuan-Large model to adapt user prompts to a model-preferred format. This feature provides two modes: Normal mode for accurate interpretation of user intent and Master mode for generating videos with enhanced visual quality, potentially at the cost of some semantic details.
HunyuanVideo outperforms existing state-of-the-art models in professional human evaluations, particularly excelling in motion quality. This assessment considered text alignment, motion quality, and visual quality, highlighting the model's ability to generate high-quality videos that accurately reflect user prompts.
The model is open-sourced, including the code, weights, and application details. This approach aims to bridge the gap between closed-source and open-source video foundation models, fostering innovation within the video generation community.
The source code is publicly available on GitHub, enabling developers and researchers to experiment with and contribute to the project.
HunyuanVideo leverages xDiT, a scalable inference engine, for parallel inference on multiple GPUs. This enables faster and more efficient video generation, particularly for high-resolution videos.