Today, I’ll be discussing an exciting new approach called Pix2Gif, which generates high-quality animated GIFs from a single image.

Pix2Gif, proposed by researchers from Microsoft, tackles GIF generation differently by framing it as an image translation task guided by text and motion magnitude prompts.

Original paper: https://arxiv.org/abs/2403.04634

Pix2Gif

The image above shows some impressive examples of GIFs generated by Pix2Gif, demonstrating its ability to produce frames with high spatial quality and temporal consistency.

The key innovation in Pix2Gif is a new motion-guided warping module. This module takes the encoded features of the source image and transforms them based on text and motion prompts to align with the target image. A perceptual loss is also used to keep the transformed features semantically similar to the target.

Unfortunately, the GitHub repository mentioned in the paper (https://hiteshk03.github.io/Pix2Gif/) is no longer available, even though the authors stated that the code, dataset, and models would be made public.

Training Pix2Gif

To train Pix2Gif, the researchers carefully curated a dataset from the TGIF video-caption dataset. They extracted coherent image frame pairs, calculated optical flow magnitudes between them, and selected pairs with flow magnitudes between 2-20 to ensure motion diversity. The final dataset contains 783,184 training pairs and 105,041 validation pairs.

Pix2Gif’s architecture builds upon InstructPix2Pix, an image editing model conditioned on text instructions. The main addition is the motion-guided warping module. The model takes three inputs: an image, a text prompt, and a motion magnitude. These are encoded and fed into the warping module and the latent diffusion model (LDM).

Components

The warping module has two components: FlowNet, which generates an optical flow map conditioned on the source image and text/motion prompts, and WarpNet, which transforms the source image latent based on the flow map. This allows explicit control over the generated GIF’s temporal dynamics.

Pix2Gif is trained using a combination of the standard L2 loss for the LDM and a perceptual loss comparing VGG features of the source and warped latents. This ensures faithful motion generation while preserving the source image’s content.

Experiments

Extensive experiments demonstrate Pix2Gif’s superiority over state-of-the-art methods.

Comparison studies with other image-text to video models. Given a source image and a caption, frames are extracted from thegenerated 16-frame video at 256x256 resolution.

Comparison studies with other image-text to video models. Given a source image and a caption, frames are extracted from the
generated 16-frame video at 256×256 resolution.

In the image above, you can see a qualitative comparison showing Pix2Gif’s ability to generate coherent motion aligned with the input image and caption, unlike I2VGen-XL and DynamiCrafter which struggle in this regard. Quantitatively, Pix2Gif achieves the best FVD and PIC scores as seen in the table below, indicating high fidelity and consistency.

Quantitative comparison with state-of-the-art image-text-to-video generation models for the zero-shot setting

Quantitative comparison with state-of-the-art image-text-to-video generation models for the zero-shot setting

Ablation studies reveal the importance of Pix2Gif’s key components. Adding the motion embedding layer and warping module significantly improves the quality and controllability of generated motion compared to a base InstructPix2Pix model. The perceptual loss further enhances coherence.

Interestingly, Pix2Gif exhibits compositional abilities, being able to combine distinct motions described in the text prompt, as shown in the figure below This enables fine-grained user control over the generated GIF.

Summary

In summary, Pix2Gif proposes a novel diffusion-based approach for high-quality image-to-GIF generation. Its motion-guided warping module and carefully designed training process allow explicit control over GIF dynamics while maintaining visual consistency. Pix2Gif outperforms previous methods and shows promising compositional abilities. The researchers hope to further improve generated GIF resolution and quality in future work.

That wraps up this overview of Pix2Gif. What do you think about this new approach to GIF generation?

Citation of the paper: Kandala, H., Gao, J. and Yang, J., 2024. Pix2Gif: Motion-Guided Diffusion for GIF Generation. arXiv preprint arXiv:2403.04634.

Last Update: 25/03/2024