Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the max_frames for inference on long videos in the paper? #4

Open
whu125 opened this issue Jan 24, 2025 · 2 comments
Open

What is the max_frames for inference on long videos in the paper? #4

whu125 opened this issue Jan 24, 2025 · 2 comments

Comments

@whu125
Copy link

whu125 commented Jan 24, 2025

Thank you for the excellent work.

I saw that in the example code, max_frames is set to 128, but when I use this parameter, I encounter an out-of-memory error. I am using an 80GB A800 GPU.

To replicate the results in your paper's tables, how many frames should I use? I am planning to reproduce the performance of VideoLlama3 on long video datasets.

@whu125 whu125 changed the title How much GPU memory is required for inference with the 7B Video model? What is the max_frames for inference on long videos in the paper? Jan 24, 2025
@whu125
Copy link
Author

whu125 commented Jan 27, 2025

attn_implementation="eager"

I found that the root cause of the problem was that I didn't use the flash-attention library. However, when I run the LLaVA-NEXT code, I still don't use the flash-attention library, and the out-of-memory error does not occur. Does anyone know the reason?

@lixin4ever
Copy link
Collaborator

Thank you for the excellent work.

I saw that in the example code, max_frames is set to 128, but when I use this parameter, I encounter an out-of-memory error. I am using an 80GB A800 GPU.

To replicate the results in your paper's tables, how many frames should I use? I am planning to reproduce the performance of VideoLlama3 on long video datasets.

Sorry for the late reply.

For videos no longer than 3 minutes (180 seconds), we sample the video frames at the rate of 1 fps. For the longer ones, we uniformly sample 180 frames.

We apply the above frame sampling strategy to all of the benchmarks and there is no separate strategy for long video datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants