prepare_ip_adapter_image_embeds Bug Causes Feature Mixing During Batch Processing in IP-Adapter

@asomoza

Describe the bug

The prepare_ip_adapter_image_embeds function has a bug that results in unintended feature mixing across images during batch processing. This issue causes the generated images to combine features from multiple reference images, instead of maintaining a one-to-one correspondence with each reference.

When using the pipeline in batch mode, I use ip_adapter_image_embeds with a shape of (2*B, N, C) and set num_images_per_prompt=1. I expect the pipeline to generate B images, where each generated image should correspond directly to each reference in ip_adapter_image_embeds (note that 2*B includes the negative image embedding for classifier-free guidance).

diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py

Lines 950 to 957 in 31058cd

    
           if ip_adapter_image is not None or ip_adapter_image_embeds is not None: 
        
               image_embeds = self.prepare_ip_adapter_image_embeds( 
        
                   ip_adapter_image, 
        
                   ip_adapter_image_embeds, 
        
                   device, 
        
                   batch_size * num_images_per_prompt, 
        
                   self.do_classifier_free_guidance, 
        
               )

diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py

Lines 561 to 569 in 9a92b81

    
           ip_adapter_image_embeds = [] 
        
           for i, single_image_embeds in enumerate(image_embeds): 
        
               single_image_embeds = torch.cat([single_image_embeds] * num_images_per_prompt, dim=0) 
        
               if do_classifier_free_guidance: 
        
                   single_negative_image_embeds = torch.cat([negative_image_embeds[i]] * num_images_per_prompt, dim=0) 
        
                   single_image_embeds = torch.cat([single_negative_image_embeds, single_image_embeds], dim=0) 
        
               single_image_embeds = single_image_embeds.to(device=device) 
        
               ip_adapter_image_embeds.append(single_image_embeds)

However, when processing ip_adapter_image_embeds in the pipeline, the tensor gets duplicated num_images_per_prompt * batch_size = 1 * B times. This leads to the image_embeds tensor having a shape of (B*2*B, N, C) instead of the expected shape of (2*B, N, C) .

In the IPAdapterAttnProcessor2_0 class, the view operation is applied to the input image_embeds tensor. This prevents a shape mismatch error, but it leads to ip_key and ip_value containing mixed features from multiple reference images. As a result, the features of the generated images are a mixture of several reference images instead of having a one-to-one correspondence.

diffusers/src/diffusers/models/attention_processor.py

Lines 4112 to 4122 in 9a92b81

    
           ip_key = to_k_ip(current_ip_hidden_states) 
        
           ip_value = to_v_ip(current_ip_hidden_states) 
        
           ip_key = ip_key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2) 
        
           ip_value = ip_value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2) 
        
           # the output of sdp = (batch, num_heads, seq_len, head_dim) 
        
           # TODO: add support for attn.scale when we move to Torch 2.1 
        
           current_ip_hidden_states = F.scaled_dot_product_attention( 
        
               query, ip_key, ip_value, attn_mask=None, dropout_p=0.0, is_causal=False 
        
           )

Although I temporarily resolved the issue by changing the num_images_per_prompt*batch_size parameter passed to the prepare_ip_adapter_image_embeds method to num_images_per_prompt, could this potentially cause issues in other scenarios?

if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
            image_embeds = self.prepare_ip_adapter_image_embeds(
                ip_adapter_image,
                ip_adapter_image_embeds,
                device,
                num_images_per_prompt,
                self.do_classifier_free_guidance,
            )

Reproduction

Here’s a demo script that illustrates the issue. The script loads two reference images (image1 and image2), extracts their embeddings, and uses them as input to the pipeline in batch mode.

import torch
from diffusers import StableDiffusionPipeline, DDIMScheduler
from diffusers.utils import load_image
from insightface.app import FaceAnalysis
import cv2
import numpy as np

pipeline = StableDiffusionPipeline.from_pretrained(
    "../checkpoints/Realistic_Vision_V4.0_noVAE",  # Replace with your model weights path
    torch_dtype=torch.float16,
).to("cuda")

pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter("../checkpoints/IP-Adapter", subfolder=None, 
                weight_name="ip-adapter-faceid_sd15.bin", image_encoder_folder=None) #Replace with your model weights path
pipeline.set_ip_adapter_scale(1.0)

#Replace with your model weights path
app = FaceAnalysis(name="/root/data1/IP-Face/checkpoints/insightface", providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) #Replace with your model weights path
app.prepare(ctx_id=0, det_size=(384, 384))

image1 = load_image('../test_image/65.jpg')
image2 = load_image('../test_image/27022.jpg')

face1 = cv2.cvtColor(np.asarray(image1),cv2.COLOR_RGB2BGR)
face1 = app.get(face1)
face1_embedding = torch.from_numpy(face1[0].normed_embedding)
face1_embedding = face1_embedding.reshape(1,1,-1)

face2 = cv2.cvtColor(np.asarray(image2),cv2.COLOR_RGB2BGR)
face2 = app.get(face2)
face2_embedding = torch.from_numpy(face2[0].normed_embedding)
face2_embedding = face2_embedding.reshape(1,1,-1)

ref_face_embedding = torch.cat([face1_embedding,face2_embedding])
neg_ref_face_embedding = torch.zeros_like(ref_face_embedding)

batch_id_embeds = torch.cat([neg_ref_face_embedding, ref_face_embedding]).to(dtype=torch.float16, device="cuda")
batch_size = int(batch_id_embeds.shape[0]/2)

generator = torch.Generator(device="cpu").manual_seed(2023)
images = pipeline(
    prompt=["photo of a woman in red dress in a garden"]*batch_size,
    ip_adapter_image_embeds=[batch_id_embeds],
    negative_prompt=["monochrome, lowres, bad anatomy, worst quality, low quality"]*batch_size, 
    num_inference_steps=50, num_images_per_prompt=1,
    generator=generator
).images

Reference Images

The reference images image1 and image2 used as input embeddings:

image1	image2

Generated Images in Batch Mode

Using the demo code above, the following images were generated. These images exhibit features mixed from both references instead of corresponding uniquely to one.

Generated Image 1	Generated Image 2

Expected Behavior

In single-image processing (non-batch mode), the pipeline works as expected, producing distinct images for each reference:

Generated Image 1(no Batch)	Generated Image 2(no Batch)

Logs

No response

System Info

diffusers == 0.30.3
torch == 2.4.1+cu121
insightface == 0.7.3
python == 3.10.0

Who can help?

@asomoza

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Describe the bug

Reproduction

Reference Images

Generated Images in Batch Mode

Expected Behavior

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
	image_embeds = self.prepare_ip_adapter_image_embeds(
	ip_adapter_image,
	ip_adapter_image_embeds,
	device,
	batch_size * num_images_per_prompt,
	self.do_classifier_free_guidance,
	)

	ip_adapter_image_embeds = []
	for i, single_image_embeds in enumerate(image_embeds):
	single_image_embeds = torch.cat([single_image_embeds] * num_images_per_prompt, dim=0)
	if do_classifier_free_guidance:
	single_negative_image_embeds = torch.cat([negative_image_embeds[i]] * num_images_per_prompt, dim=0)
	single_image_embeds = torch.cat([single_negative_image_embeds, single_image_embeds], dim=0)

	single_image_embeds = single_image_embeds.to(device=device)
	ip_adapter_image_embeds.append(single_image_embeds)

	ip_key = to_k_ip(current_ip_hidden_states)
	ip_value = to_v_ip(current_ip_hidden_states)

	ip_key = ip_key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
	ip_value = ip_value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)

	# the output of sdp = (batch, num_heads, seq_len, head_dim)
	# TODO: add support for attn.scale when we move to Torch 2.1
	current_ip_hidden_states = F.scaled_dot_product_attention(
	query, ip_key, ip_value, attn_mask=None, dropout_p=0.0, is_causal=False
	)

Description

Describe the bug

Reproduction

Reference Images

Generated Images in Batch Mode

Expected Behavior

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions