HERO: Human-Feedback-Efficient Reinforcement Learning for Online Diffusion Model Finetuning

Abstract

Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult.

To effectively and efficiently utilize human feedback, we develop a framework, HERO (Human-Feedback-Efficient Reinforcement Learning for Online Diffusion Model Finetuning), which leverages online human feedback collected on the fly during model learning. Specifically, HERO features two key mechanisms: (1) Feedback-Aligned Representation Learning, an online training method that captures human feedback and provides informative learning signals for diffusion model finetuning, and (2) Feedback-Guided Image Generation, which involve generating images from SD's refined initialization samples, enabling faster convergence towards the evaluator's intent.

We demonstrate that HERO is 4x more efficient in online feedback for body part anomaly correction compared to the best existing method. Additionally, experiments show that HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback.

HERO Framework

HERO finetunes the Stable Diffusion model in the following steps:

Image generation: A batch of images are sampled from the Stable Diffusion model.

Human feedback: Human evaluator provides binary feedback ("good"/"bad") for each image, and chooses one "best" image among the "good" images.

Feedback-aligned representation learning: Human annotations are used to train an embedding map, which encodes images into continuous representations that reflect human evaluations.

Similarity-based reward computation: Each image is assigned a score based on their cosine similarities to the "best" image in the learned feedback-aligned representation space.

Diffusion model finetuning: Stable Diffusion model is funetuned via DDPO using the computed scores as rewards.

Feedback-guided image generation: The next batch of images are sampled from a Gaussian mixture of noises that generated the "good" images in the previous iteration.

Variety of Tasks

HERO can address a variety of tasks, including hand deformation correction, content safety improvement, reasoning, and personalization.

Quantitative Results

HERO achieves highest success rates in all tasks. In a hand anomaly correction task, we further compare HERO's sample efficiency to a prior work, demonstrating that HERO is 4x more sample-efficient in terms of human feedback.

Transferability to Unseen Prompts

HERO demonstrates transferability to previously unseen inference prompts, showcasing that the desired concepts were acquired by the model.

Personal Preference Transfer

Models trained with two distinct personal preferences (green vs snowy) generated images that inherit these preferences when prompted with a related, unseen task.

Content Safety Transfer

HERO model is trained using the prompt "sexy" to reduce nudity. When prompted with potentially NSFW prompts, HERO-trained model shows significantly higher content safety rate of 87.0%, compared to 57.5% safety rate in images generated by the pretrained Stable Diffusion model.

Sample images generated by HERO model trained to improve content safety.

BibTeX

@article{hiranaka2024humanfeedbackefficientreinforcementlearning,
      title={HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning}, 
      author={Ayano Hiranaka and Shang-Fu Chen and Chieh-Hsin Lai and Dongjun Kim and Naoki Murata and Takashi Shibuya and Wei-Hsiang Liao and Shao-Hua Sun and Yuki Mitsufuji},
      year={2024},
      eprint={2410.05116},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.05116}, 
}