Key Points

1. The paper introduces a novel framework for personalized face generation, with a focus on simultaneous identity-expression control and fine-grained expression synthesis.

2. The proposed framework involves a multi-modal approach, using a selfie photo, a text prompt describing the scene, and a text related to the expression labels as inputs to generate personalized face images.

3. The framework leverages a diffusion model for simultaneous face swapping and reenactment (SFSR), which is a new and unexplored task aiming to transfer identity and expression from different source faces to a target face while maintaining background attributes.

4. The paper presents three innovative designs in the conditional diffusion model, including balancing identity and expression encoder, improved midpoint sampling, and explicitly background conditioning, which contribute to increased controllability and image quality.

5. Extensive experiments demonstrated the controllability and scalability of the proposed framework, showing its superiority over state-of-the-art text-to-image, face swapping, and face reenactment methods.

6. The framework can achieve fine-grained expression synthesis by employing an expression dictionary of 135 English words, allowing for more comprehensive emotion description.

7. The paper compares the proposed framework with various existing methods, including text-to-image models and face manipulation techniques, and shows superior performance in terms of identity consistency, expression consistency, realism, and image quality.

8. Ablation studies were conducted to demonstrate the effectiveness of the background conditioning and the compound identity embedding, showing improvements in identity similarity and image quality.

9. The paper discusses the limitations related to the dataset and the ambiguity in expression labels, highlighting the challenges in fully reflecting the semantic information by the text label.

Summary

The paper "Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation" introduces a new multi-modal face generation framework that aims to achieve simultaneous identity-expression control and fine-grained expression synthesis. The authors discuss the limitations of existing pre-trained text-to-image models in producing user-desired portrait images with retained identity and diverse expressions and propose a framework that addresses this challenge by incorporating sophisticated expression control specialized by fine-grained emotional vocabulary. The proposed framework takes three inputs: a prompt describing the background, a selfie photo, and a text related to the fine-grained expression labels.

The paper introduces a novel diffusion model capable of simultaneous face swapping and reenactment, addressing the entanglement of identity and expression in a unified framework. The study demonstrates the controllability and scalability of the proposed framework through extensive experiments, comparing it with state-of-the-art text-to-image, face swapping, and face reenactment methods. The paper also discusses related methodologies in controllable image generation and manipulation, and the challenges in controlling identity and expression in a unified framework. The technical core of the proposed framework is a novel diffusion model that conducts simultaneous face swapping and reenactment and includes several innovative designs to enhance controllability and image quality.

The research also includes user studies and comparisons with text-to-image methods, hybrid methods, face reenactment methods, and face swapping methods, demonstrating the effectiveness of the proposed framework. The authors conclude by highlighting the potential impact of their work and hope to inspire future research in personalized generation frameworks to achieve higher controllability and image quality. The paper also identifies some limitations in the fine-grained expression synthesis results, particularly in fully reflecting the semantic information conveyed by text labels, acknowledging the dataset’s limitations and the ambiguity among expression labels presented in the paper and attached images.

Reference: https://arxiv.org/abs/2401.01207