Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Qwen定製版的vLLM是否不支援低於80算力的GPU? #1349

Open
2 tasks done
xirotech opened this issue Jan 30, 2025 · 0 comments
Open
2 tasks done

[BUG] Qwen定製版的vLLM是否不支援低於80算力的GPU? #1349

xirotech opened this issue Jan 30, 2025 · 0 comments

Comments

@xirotech
Copy link

xirotech commented Jan 30, 2025

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

1、已知報錯RuntimeError: FlashAttention only supports Ampere GPUs or newer.
2、系統從未安裝 flash attention , pip list 無此項
3、官方版 vLLM 0.7.0 可正常運作此模型,但定製版不可


vllm serve \
    LLMs/Qwen2.5-7B-Instruct-1M \
    --dtype half \
    --tensor-parallel-size 4 \
    --api-key openai \
    --gpu-memory-utilization 0.9 \
    --disable-custom-all-reduce \
    --enforce-eager \
    --max-num-seqs 1 \
    --enable-chunked-prefill --max-num-batched-tokens 24488 \
    --max-model-len 188576

INFO 01-30 17:27:09 __init__.py:179] Automatically detected platform cuda.
INFO 01-30 17:27:10 api_server.py:768] vLLM API server version 0.1.dev4204+g47b8640
INFO 01-30 17:27:10 api_server.py:769] args: Namespace(subparser='serve', model_tag='LLMs/Qwen2.5-7B-Instruct-1M', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='openai', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='LLMs/Qwen2.5-7B-Instruct-1M', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=188576, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=24488, max_num_seqs=1, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x79d2121c2980>)
INFO 01-30 17:27:10 api_server.py:195] Started engine process with PID 79163
WARNING 01-30 17:27:10 config.py:2336] Casting torch.bfloat16 to torch.float16.
INFO 01-30 17:27:13 __init__.py:179] Automatically detected platform cuda.
WARNING 01-30 17:27:14 config.py:2336] Casting torch.bfloat16 to torch.float16.
INFO 01-30 17:27:16 config.py:522] This model supports multiple tasks: {'score', 'embed', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 01-30 17:27:16 config.py:1346] Defaulting to use mp for distributed inference
INFO 01-30 17:27:16 config.py:1501] Chunked prefill is enabled with max_num_batched_tokens=24488.
WARNING 01-30 17:27:16 cuda.py:100] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 01-30 17:27:16 config.py:675] Async output processing is not supported on the current platform type cuda.
INFO 01-30 17:27:17 weight_utils.py:245] Loaded sparse attention config from LLMs/Qwen2.5-7B-Instruct-1M/sparse_attention_config.json
INFO 01-30 17:27:20 config.py:522] This model supports multiple tasks: {'generate', 'classify', 'score', 'embed', 'reward'}. Defaulting to 'generate'.
INFO 01-30 17:27:20 config.py:1346] Defaulting to use mp for distributed inference
INFO 01-30 17:27:20 config.py:1501] Chunked prefill is enabled with max_num_batched_tokens=24488.
WARNING 01-30 17:27:20 cuda.py:100] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 01-30 17:27:20 config.py:675] Async output processing is not supported on the current platform type cuda.
INFO 01-30 17:27:21 weight_utils.py:245] Loaded sparse attention config from LLMs/Qwen2.5-7B-Instruct-1M/sparse_attention_config.json
INFO 01-30 17:27:21 llm_engine.py:232] Initializing an LLM engine (v0.1.dev4204+g47b8640) with config: model='LLMs/Qwen2.5-7B-Instruct-1M', speculative_config=None, tokenizer='LLMs/Qwen2.5-7B-Instruct-1M', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=188576, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=LLMs/Qwen2.5-7B-Instruct-1M, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True, 
WARNING 01-30 17:27:22 multiproc_worker_utils.py:298] Reducing Torch parallelism from 56 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 01-30 17:27:22 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 01-30 17:27:22 selector.py:117] Using Dual Chunk Attention backend.
INFO 01-30 17:27:25 __init__.py:179] Automatically detected platform cuda.
INFO 01-30 17:27:25 __init__.py:179] Automatically detected platform cuda.
INFO 01-30 17:27:25 __init__.py:179] Automatically detected platform cuda.
(VllmWorkerProcess pid=79823) INFO 01-30 17:27:26 multiproc_worker_utils.py:227] Worker ready; awaiting tasks
(VllmWorkerProcess pid=79824) INFO 01-30 17:27:26 multiproc_worker_utils.py:227] Worker ready; awaiting tasks
(VllmWorkerProcess pid=79825) INFO 01-30 17:27:26 multiproc_worker_utils.py:227] Worker ready; awaiting tasks
(VllmWorkerProcess pid=79823) INFO 01-30 17:27:26 selector.py:117] Using Dual Chunk Attention backend.
(VllmWorkerProcess pid=79824) INFO 01-30 17:27:27 selector.py:117] Using Dual Chunk Attention backend.
(VllmWorkerProcess pid=79825) INFO 01-30 17:27:27 selector.py:117] Using Dual Chunk Attention backend.
(VllmWorkerProcess pid=79824) INFO 01-30 17:27:28 utils.py:939] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=79824) INFO 01-30 17:27:28 pynccl.py:67] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=79823) INFO 01-30 17:27:28 utils.py:939] Found nccl from library libnccl.so.2
INFO 01-30 17:27:28 utils.py:939] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=79823) INFO 01-30 17:27:28 pynccl.py:67] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=79825) INFO 01-30 17:27:28 utils.py:939] Found nccl from library libnccl.so.2
INFO 01-30 17:27:28 pynccl.py:67] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=79825) INFO 01-30 17:27:28 pynccl.py:67] vLLM is using nccl==2.21.5
INFO 01-30 17:27:28 shm_broadcast.py:256] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_b7a84522'), local_subscribe_port=38349, remote_subscribe_port=None)
INFO 01-30 17:27:28 model_runner.py:1099] Starting to load model LLMs/Qwen2.5-7B-Instruct-1M...
(VllmWorkerProcess pid=79823) INFO 01-30 17:27:28 model_runner.py:1099] Starting to load model LLMs/Qwen2.5-7B-Instruct-1M...
(VllmWorkerProcess pid=79825) INFO 01-30 17:27:28 model_runner.py:1099] Starting to load model LLMs/Qwen2.5-7B-Instruct-1M...
(VllmWorkerProcess pid=79824) INFO 01-30 17:27:28 model_runner.py:1099] Starting to load model LLMs/Qwen2.5-7B-Instruct-1M...
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.28s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.26s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.22s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.24s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.24s/it]

(VllmWorkerProcess pid=79825) INFO 01-30 17:27:34 model_runner.py:1104] Loading model weights took 3.8341 GB
(VllmWorkerProcess pid=79824) INFO 01-30 17:27:34 model_runner.py:1104] Loading model weights took 3.8341 GB
(VllmWorkerProcess pid=79823) INFO 01-30 17:27:34 model_runner.py:1104] Loading model weights took 3.8341 GB
INFO 01-30 17:27:34 model_runner.py:1104] Loading model weights took 3.8341 GB
(VllmWorkerProcess pid=79823) INFO 01-30 17:27:37 model_runner_base.py:119] Writing input of failed execution to /tmp/err_execute_model_input_20250130-172737.pkl...
INFO 01-30 17:27:37 model_runner_base.py:119] Writing input of failed execution to /tmp/err_execute_model_input_20250130-172737.pkl...
(VllmWorkerProcess pid=79824) INFO 01-30 17:27:37 model_runner_base.py:119] Writing input of failed execution to /tmp/err_execute_model_input_20250130-172737.pkl...
(VllmWorkerProcess pid=79823) INFO 01-30 17:27:37 model_runner_base.py:148] Completed writing input of failed execution to /tmp/err_execute_model_input_20250130-172737.pkl.
(VllmWorkerProcess pid=79825) INFO 01-30 17:27:37 model_runner_base.py:119] Writing input of failed execution to /tmp/err_execute_model_input_20250130-172737.pkl...
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] Traceback (most recent call last):
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/worker/model_runner_base.py", line 115, in _wrapper
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/worker/model_runner.py", line 1718, in execute_model
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 498, in forward
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/compilation/decorators.py", line 170, in __call__
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 360, in forward
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     hidden_states, residual = layer(
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                               ^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 267, in forward
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     hidden_states = self.self_attn(
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                     ^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 189, in forward
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     attn_output = self.attn(q,
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                   ^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/attention/layer.py", line 184, in forward
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return torch.ops.vllm.unified_attention(
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/_ops.py", line 1116, in __call__
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self._op(*args, **(kwargs or {}))
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/attention/layer.py", line 279, in unified_attention
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/attention/backends/dual_chunk_flash_attn.py", line 423, in forward
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     out = flash_attn_varlen_func(
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]           ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/vllm_flash_attn/flash_attn_interface.py", line 503, in flash_attn_varlen_func
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     out, softmax_lse = _flash_attn_varlen_forward(
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/vllm_flash_attn/flash_attn_interface.py", line 168, in _flash_attn_varlen_forward
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     out, softmax_lse = torch.ops.vllm_flash_attn_c.varlen_fwd(
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/_ops.py", line 1116, in __call__
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self._op(*args, **(kwargs or {}))
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] RuntimeError: FlashAttention only supports Ampere GPUs or newer.
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] 
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] 
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] Traceback (most recent call last):
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/executor/multiproc_worker_utils.py", line 234, in _run_worker_process
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/utils.py", line 2209, in run_method
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/worker/worker.py", line 200, in determine_num_available_blocks
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     self.model_runner.profile_run()
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/worker/model_runner.py", line 1350, in profile_run
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/worker/model_runner_base.py", line 151, in _wrapper
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     raise type(err)(
(VllmWorkerProcess pid=79823) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250130-172737.pkl): FlashAttention only supports Ampere GPUs or newer.
(VllmWorkerProcess pid=79824) INFO 01-30 17:27:37 model_runner_base.py:148] Completed writing input of failed execution to /tmp/err_execute_model_input_20250130-172737.pkl.
INFO 01-30 17:27:37 model_runner_base.py:148] Completed writing input of failed execution to /tmp/err_execute_model_input_20250130-172737.pkl.
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] Traceback (most recent call last):
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/worker/model_runner_base.py", line 115, in _wrapper
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/worker/model_runner.py", line 1718, in execute_model
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 498, in forward
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/compilation/decorators.py", line 170, in __call__
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 360, in forward
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     hidden_states, residual = layer(
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                               ^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 267, in forward
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     hidden_states = self.self_attn(
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                     ^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 189, in forward
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     attn_output = self.attn(q,
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                   ^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/attention/layer.py", line 184, in forward
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return torch.ops.vllm.unified_attention(
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/_ops.py", line 1116, in __call__
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self._op(*args, **(kwargs or {}))
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/attention/layer.py", line 279, in unified_attention
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/attention/backends/dual_chunk_flash_attn.py", line 423, in forward
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     out = flash_attn_varlen_func(
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]           ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/vllm_flash_attn/flash_attn_interface.py", line 503, in flash_attn_varlen_func
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     out, softmax_lse = _flash_attn_varlen_forward(
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/vllm_flash_attn/flash_attn_interface.py", line 168, in _flash_attn_varlen_forward
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     out, softmax_lse = torch.ops.vllm_flash_attn_c.varlen_fwd(
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/_ops.py", line 1116, in __call__
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return self._op(*args, **(kwargs or {}))
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] RuntimeError: FlashAttention only supports Ampere GPUs or newer.
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] 
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] 
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] Traceback (most recent call last):
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/executor/multiproc_worker_utils.py", line 234, in _run_worker_process
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/utils.py", line 2209, in run_method
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/worker/worker.py", line 200, in determine_num_available_blocks
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     self.model_runner.profile_run()
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/worker/model_runner.py", line 1350, in profile_run
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]   File "/home/ai/qwenvllm/vllm/worker/model_runner_base.py", line 151, in _wrapper
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240]     raise type(err)(
(VllmWorkerProcess pid=79824) ERROR 01-30 17:27:37 multiproc_worker_utils.py:240] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250130-172737.pkl): FlashAttention only supports Ampere GPUs or newer.
(VllmWorkerProcess pid=79825) INFO 01-30 17:27:37 model_runner_base.py:148] Completed writing input of failed execution to /tmp/err_execute_model_input_20250130-172737.pkl.
ERROR 01-30 17:27:37 engine.py:381] Error in model execution (input dumped to /tmp/err_execute_model_input_20250130-172737.pkl): FlashAttention only supports Ampere GPUs or newer.
ERROR 01-30 17:27:37 engine.py:381] Traceback (most recent call last):
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/worker/model_runner_base.py", line 115, in _wrapper
ERROR 01-30 17:27:37 engine.py:381]     return func(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/worker/model_runner.py", line 1718, in execute_model
ERROR 01-30 17:27:37 engine.py:381]     hidden_or_intermediate_states = model_executable(
ERROR 01-30 17:27:37 engine.py:381]                                     ^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-30 17:27:37 engine.py:381]     return self._call_impl(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-30 17:27:37 engine.py:381]     return forward_call(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 498, in forward
ERROR 01-30 17:27:37 engine.py:381]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 01-30 17:27:37 engine.py:381]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/compilation/decorators.py", line 170, in __call__
ERROR 01-30 17:27:37 engine.py:381]     return self.forward(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 360, in forward
ERROR 01-30 17:27:37 engine.py:381]     hidden_states, residual = layer(
ERROR 01-30 17:27:37 engine.py:381]                               ^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-30 17:27:37 engine.py:381]     return self._call_impl(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-30 17:27:37 engine.py:381]     return forward_call(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 267, in forward
ERROR 01-30 17:27:37 engine.py:381]     hidden_states = self.self_attn(
ERROR 01-30 17:27:37 engine.py:381]                     ^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-30 17:27:37 engine.py:381]     return self._call_impl(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-30 17:27:37 engine.py:381]     return forward_call(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 189, in forward
ERROR 01-30 17:27:37 engine.py:381]     attn_output = self.attn(q,
ERROR 01-30 17:27:37 engine.py:381]                   ^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-30 17:27:37 engine.py:381]     return self._call_impl(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-30 17:27:37 engine.py:381]     return forward_call(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/attention/layer.py", line 184, in forward
ERROR 01-30 17:27:37 engine.py:381]     return torch.ops.vllm.unified_attention(
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/_ops.py", line 1116, in __call__
ERROR 01-30 17:27:37 engine.py:381]     return self._op(*args, **(kwargs or {}))
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/attention/layer.py", line 279, in unified_attention
ERROR 01-30 17:27:37 engine.py:381]     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/attention/backends/dual_chunk_flash_attn.py", line 423, in forward
ERROR 01-30 17:27:37 engine.py:381]     out = flash_attn_varlen_func(
ERROR 01-30 17:27:37 engine.py:381]           ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/vllm_flash_attn/flash_attn_interface.py", line 503, in flash_attn_varlen_func
ERROR 01-30 17:27:37 engine.py:381]     out, softmax_lse = _flash_attn_varlen_forward(
ERROR 01-30 17:27:37 engine.py:381]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/vllm_flash_attn/flash_attn_interface.py", line 168, in _flash_attn_varlen_forward
ERROR 01-30 17:27:37 engine.py:381]     out, softmax_lse = torch.ops.vllm_flash_attn_c.varlen_fwd(
ERROR 01-30 17:27:37 engine.py:381]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/_ops.py", line 1116, in __call__
ERROR 01-30 17:27:37 engine.py:381]     return self._op(*args, **(kwargs or {}))
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381] RuntimeError: FlashAttention only supports Ampere GPUs or newer.
ERROR 01-30 17:27:37 engine.py:381] 
ERROR 01-30 17:27:37 engine.py:381] The above exception was the direct cause of the following exception:
ERROR 01-30 17:27:37 engine.py:381] 
ERROR 01-30 17:27:37 engine.py:381] Traceback (most recent call last):
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/engine/multiprocessing/engine.py", line 372, in run_mp_engine
ERROR 01-30 17:27:37 engine.py:381]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 01-30 17:27:37 engine.py:381]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/engine/multiprocessing/engine.py", line 120, in from_engine_args
ERROR 01-30 17:27:37 engine.py:381]     return cls(ipc_path=ipc_path,
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/engine/multiprocessing/engine.py", line 72, in __init__
ERROR 01-30 17:27:37 engine.py:381]     self.engine = LLMEngine(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/engine/llm_engine.py", line 274, in __init__
ERROR 01-30 17:27:37 engine.py:381]     self._initialize_kv_caches()
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/engine/llm_engine.py", line 414, in _initialize_kv_caches
ERROR 01-30 17:27:37 engine.py:381]     self.model_executor.determine_num_available_blocks())
ERROR 01-30 17:27:37 engine.py:381]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/executor/executor_base.py", line 77, in determine_num_available_blocks
ERROR 01-30 17:27:37 engine.py:381]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 01-30 17:27:37 engine.py:381]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/executor/executor_base.py", line 258, in collective_rpc
ERROR 01-30 17:27:37 engine.py:381]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/executor/mp_distributed_executor.py", line 183, in _run_workers
ERROR 01-30 17:27:37 engine.py:381]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 01-30 17:27:37 engine.py:381]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/utils.py", line 2209, in run_method
ERROR 01-30 17:27:37 engine.py:381]     return func(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 01-30 17:27:37 engine.py:381]     return func(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/worker/worker.py", line 200, in determine_num_available_blocks
ERROR 01-30 17:27:37 engine.py:381]     self.model_runner.profile_run()
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 01-30 17:27:37 engine.py:381]     return func(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/worker/model_runner.py", line 1350, in profile_run
ERROR 01-30 17:27:37 engine.py:381]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 01-30 17:27:37 engine.py:381]     return func(*args, **kwargs)
ERROR 01-30 17:27:37 engine.py:381]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 01-30 17:27:37 engine.py:381]   File "/home/ai/qwenvllm/vllm/worker/model_runner_base.py", line 151, in _wrapper
ERROR 01-30 17:27:37 engine.py:381]     raise type(err)(
ERROR 01-30 17:27:37 engine.py:381] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250130-172737.pkl): FlashAttention only supports Ampere GPUs or newer.
INFO 01-30 17:27:37 multiproc_worker_utils.py:126] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/ai/qwenvllm/vllm/worker/model_runner_base.py", line 115, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/worker/model_runner.py", line 1718, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 498, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/compilation/decorators.py", line 170, in __call__
    return self.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 360, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 267, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/model_executor/models/qwen2.py", line 189, in forward
    attn_output = self.attn(q,
                  ^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/attention/layer.py", line 184, in forward
    return torch.ops.vllm.unified_attention(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/attention/layer.py", line 279, in unified_attention
    return self.impl.forward(query, key, value, kv_cache, attn_metadata,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/attention/backends/dual_chunk_flash_attn.py", line 423, in forward
    out = flash_attn_varlen_func(
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/vllm_flash_attn/flash_attn_interface.py", line 503, in flash_attn_varlen_func
    out, softmax_lse = _flash_attn_varlen_forward(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/vllm_flash_attn/flash_attn_interface.py", line 168, in _flash_attn_varlen_forward
    out, softmax_lse = torch.ops.vllm_flash_attn_c.varlen_fwd(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ai/qwenvllm/vllm/engine/multiprocessing/engine.py", line 383, in run_mp_engine
    raise e
  File "/home/ai/qwenvllm/vllm/engine/multiprocessing/engine.py", line 372, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/engine/multiprocessing/engine.py", line 120, in from_engine_args
    return cls(ipc_path=ipc_path,
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/engine/multiprocessing/engine.py", line 72, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/engine/llm_engine.py", line 274, in __init__
    self._initialize_kv_caches()
  File "/home/ai/qwenvllm/vllm/engine/llm_engine.py", line 414, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/executor/executor_base.py", line 77, in determine_num_available_blocks
    results = self.collective_rpc("determine_num_available_blocks")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/executor/executor_base.py", line 258, in collective_rpc
    return self._run_workers(method, *args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/executor/mp_distributed_executor.py", line 183, in _run_workers
    driver_worker_output = run_method(self.driver_worker, sent_method,
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/utils.py", line 2209, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/worker/worker.py", line 200, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/worker/model_runner.py", line 1350, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/worker/model_runner_base.py", line 151, in _wrapper
    raise type(err)(
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250130-172737.pkl): FlashAttention only supports Ampere GPUs or newer.
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /home/ai/qwenvllm/vllm/engine/multiprocessing/client.py:180> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/home/ai/qwenvllm/vllm/engine/multiprocessing/client.py", line 186, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/zmq/_future.py", line 400, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Traceback (most recent call last):
  File "/home/ai/miniconda3/envs/QwenVllm/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ai/qwenvllm/vllm/scripts.py", line 201, in main
    args.dispatch_function(args)
  File "/home/ai/qwenvllm/vllm/scripts.py", line 42, in serve
    uvloop.run(run_server(args))
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/entrypoints/openai/api_server.py", line 796, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/entrypoints/openai/api_server.py", line 125, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ai/qwenvllm/vllm/entrypoints/openai/api_server.py", line 219, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/ai/miniconda3/envs/QwenVllm/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '```

### 期望行为 | Expected Behavior

正常啟動

### 复现方法 | Steps To Reproduce

conda create -n QwenVllm python=3.12 -y
conda activate QwenVllm
conda install -c conda-forge gxx_linux-64  # 获取较新版本GLIBCXX

git clone -b dev/dual-chunk-attn --single-branch https://github.com/QwenLM/vllm.git
cd vllm
pip install -e . -v


### 运行环境 | Environment

```Markdown
- OS:Ubuntu 24.04.1
- Python:python 3.12
- Transformers:4.48.1
- PyTorch:2.5.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.4
- GPU:RTX 2080 Ti (sm_75)

备注 | Anything else?

另附ChatGPT分析:

從堆疊來看,錯誤是由自訂的「FlashAttention」呼叫引發的,具體觸發點在flash_attn_interface.py 的_flash_attn_varlen_forward(),它最終調用了torch.ops.vllm_flash_attn_c.varlen_fwd(),而該C++/CUDA擴充功能直接拋出“FlashAttention only supports Ampere GPUs or newer” 例外。

更直觀地說:

qwen2.py 裡 forward -> self.self_attn -> self.attn -> vllm/attention/layer.py 的 unified_attention()
再到 vllm/attention/backends/dual_chunk_flash_attn.py 裡 forward()
最後在 vllm_flash_attn/flash_attn_interface.py 的 _flash_attn_varlen_forward() 呼叫 torch.ops.vllm_flash_attn_c.varlen_fwd() 時,因顯示卡非 Ampere 架構而報錯
因此,根源就是這個阿里通義千問定製版vLLM強制走了新的FlashAttention 分支,沒有為舊架構GPU 提供回退實作(官方vLLM 有時會對舊卡禁用或回退到普通注意力計算),導致在非Ampere 的卡上直接拋錯。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant