CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond | vLLM Blog

omnivore gpu-programming good-read!

Read on Omnivore | Read Original

Highlights

In summary, when using the CUDA core dump feature, it is recommended to use the following combination of environment variables:

CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 CUDA_COREDUMP_SHOW_PROGRESS=1 CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory' CUDA_COREDUMP_FILE="/persistent_dir/cuda_coredump_%h.%p.%t" ⤴️

This blogpost analyzed the principles and use cases of CUDA core dump. This debugging method is effective for issues like improper kernel launches and kernel exceptions within CUDA graphs, making it a powerful tool for debugging illegal memory access issues and beyond. ⤴️