Skip to content

Comments

Wrong Number of Frames in SkyreelsVideoPipeline#90

Open
xh-liu-tech wants to merge 1 commit intoSkyworkAI:mainfrom
xh-liu-tech:main
Open

Wrong Number of Frames in SkyreelsVideoPipeline#90
xh-liu-tech wants to merge 1 commit intoSkyworkAI:mainfrom
xh-liu-tech:main

Conversation

@xh-liu-tech
Copy link

It seems that num_latent_frames was incorrectly used in SkyreelsVideoPipeline because the number of latent frames is calculated inside the prepare_latents. The current implementation will generate only 25 frames instead of 97 frames.

    def prepare_latents(
        self,
        batch_size: int,
        num_channels_latents: int = 32,
        height: int = 720,
        width: int = 1280,
        num_frames: int = 129,
        dtype: Optional[torch.dtype] = None,
        device: Optional[torch.device] = None,
        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
        latents: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        if latents is not None:
            return latents.to(device=device, dtype=dtype)

        shape = (
            batch_size,
            num_channels_latents,
            (num_frames - 1) // self.vae_scale_factor_temporal + 1,
            int(height) // self.vae_scale_factor_spatial,
            int(width) // self.vae_scale_factor_spatial,
        )
        if isinstance(generator, list) and len(generator) != batch_size:
            raise ValueError(
                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
            )

        latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
        return latents

@pftq
Copy link

pftq commented Mar 1, 2025

What's the rationale here? Latent frames are not the same as actual frames in the video - iirc each latent frame represents 4 actual video frames, which checks out with the 24 latent frames -> 96 video frames

Can you clarify what issue was in the test and what it resolved?

@xh-liu-tech
Copy link
Author

What's the rationale here? Latent frames are not the same as actual frames in the video - iirc each latent frame represents 4 actual video frames, which checks out with the 24 latent frames -> 96 video frames

Can you clarify what issue was in the test and what it resolved?

It is because we should pass the original frame count to prepare_latents function and then the number of latent frames will be calculated inside this function (see this line).
However, in current implementation, num_latent_frames is computed outside prepare_latents, which leads to the number of latent frames is downscaled twice (97 -> 25 -> 7). And the frame count of latent_model_input (7) and latent_image_input (25) will be mismatched.

@pftq
Copy link

pftq commented Mar 2, 2025

Interesting, thanks for elaborating. I will test it too. I ended looking at this same block of code this weekend while debugging other issues like multi-GPUs running into division errors for 720x720 or the frame count being soft capped at 192 (after that it loops) - I wonder if it's related.

@pftq
Copy link

pftq commented Mar 2, 2025

It seems to be shuffling problems from one area to another. On 49 frames test with 1920x1088 (previously working), now generates this exception. That said, before the code change it was already having a different issue of becoming text2video after one second. Maybe it's exposing a deeper problem with other areas of the code.

Exception in thread Thread-1 (lauch_single_gpu_infer):
Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/SkyReels-V1/skyreelsinfer/skyreels_video_infer.py", line 236, in lauch_single_gpu_infer
    mp.spawn(
  File "/workspace/venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/workspace/venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 203, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/workspace/venv/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/workspace/SkyReels-V1/skyreelsinfer/skyreels_video_infer.py", line 193, in single_gpu_run
    pipe.damon_inference(request_queue, response_queue)
  File "/workspace/SkyReels-V1/skyreelsinfer/skyreels_video_infer.py", line 165, in damon_inference
    out = self.pipe(**kwargs).frames[0]
          ^^^^^^^^^^^^^^^^^^^
  File "/workspace/venv/lib/python3.11/site-packages/para_attn/context_parallel/diffusers_adapters/hunyuan_video.py", line 208, in new_call
    return original_call(self, *args, generator=generator, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/SkyReels-V1/skyreelsinfer/pipelines/pipeline_skyreels_video.py", line 354, in __call__
    latent_model_input = torch.cat([latent_model_input, latent_image_input], dim=1)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 49 but got size 13 for tensor number 1 in the list.

@xh-liu-tech
Copy link
Author

I just tried this setting on a single RTX 4090, and it seems to be working well.
It still takes another hour to get results, which I think is consistent with the estimation in the README (~1.5h).

python3 video_generate.py \
     --model_id "Skywork/SkyReels-V1-Hunyuan-I2V" \
     --task_type i2v \
     --guidance_scale 6.0 \
     --height 720 \
     --width 720 \
     --num_frames 289 \
     --prompt "FPS-24, An old lady is talking happily" \
     --embedded_guidance_scale 1.0 \
     --quant \
     --offload \
     --high_cpu_memory \
     --image "image.png" \
     --parameters_level \
     --sequence_batch

Could you share your running parameters so that I can test it from my side?

@pftq
Copy link

pftq commented Mar 2, 2025

Are you actually able to get 289 frames without sudden static noise on frame 193? I've been trying to debug that all week. Amazing if you did.

I think my error is specific to multi-gpu. I am testing on a rented service with 8x H100s

python video_generate.py \
--gpu_num 8 \
--model_id "Skywork/SkyReels-V1-Hunyuan-I2V" \
--task_type i2v \
--guidance_scale 8 \
--embedded_guidance_scale 1 \
--width 1920 \
--height 1088 \
--num_frames 49 \
--num_inference_steps 100 \
--image "image.jpg" \
--prompt "FPS-24, " \
--negative_prompt "" \
--mbps 15 \
--video_num 1

@xh-liu-tech
Copy link
Author

xh-liu-tech commented Mar 2, 2025

Are you actually able to get 289 frames without sudden static noise on frame 193? I've been trying to debug that all week. Amazing if you did.

I think my error is specific to multi-gpu. I am testing on a rented service with 8x H100s

python video_generate.py \
--gpu_num 8 \
--model_id "Skywork/SkyReels-V1-Hunyuan-I2V" \
--task_type i2v \
--guidance_scale 8 \
--embedded_guidance_scale 1 \
--width 1920 \
--height 1088 \
--num_frames 49 \
--num_inference_steps 100 \
--image "image.jpg" \
--prompt "FPS-24, " \
--negative_prompt "" \
--mbps 15 \
--video_num 1

I checked the generated video and found that there was indeed static noise at 8s (maybe the 193 frame).

I'm sorry that I couldn't test it on multi-GPU.
However, I guess you didn't correctly apply my change in this PR.
If num_latent_frames is correctly changed to num_frames, the temporal dimension of latent_model_input should be 13 rather than 49 in your case.

@pftq
Copy link

pftq commented Mar 2, 2025

I did - it's just that one line right to num_frames instead of latent right? It actually says it is 13, just that the script is expecting 49. Unfortunate to hear about the 8 sec mark - yeah that's frame 193. Still trying to figure that one out

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 49 but got size 13 for tensor number 1 in the list.

@xh-liu-tech
Copy link
Author

  File "/workspace/SkyReels-V1/skyreelsinfer/pipelines/pipeline_skyreels_video.py", line 354, in __call__
    latent_model_input = torch.cat([latent_model_input, latent_image_input], dim=1)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 49 but got size 13 for tensor number 1 in the list.

I think this means the temporal dimension of latent_model_input is 49, while that of latent_image_input is 13.
Maybe this is related to multi-GPU or my misunderstanding of their codes.
I hope the authors can help figure out this issue.

@pftq
Copy link

pftq commented Mar 7, 2025

Yeah this is messy - I spent much of the week going down a rabbithole with latent model input and latent model input mismatches.
It looks like to me that that the latents were computed twice, as you identified, but then it looks like the rest of the code was compensating for that so it's not enough to just fix that one line (otherwise you get mismatches elsewhere).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants