ONNX Runtime


Created: 04 May 2023, 06:05 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, tools


GitHub - daquexian/onnx-simplifier: Simplify your onnx model GitHub - usefulsensors/onnx_shrink_ray: Shrinks ONNX files by quantizing large float constants into eight bit equivalents.

I am seeing high latency variance.

On some platforms, onnxruntime may exhibit high latency variance during inferencing. This is caused by the constant cost model that onnxruntime uses to parallelize tasks in the thread pool. For each task, the constant cost model will calculate a granularity for parallelization among threads, which stays constant to the end of the task execution. This approach can bring imbalanced load sometimes, causing high latency variance. To mitigate this, onnxruntime provides a dynamic cost model which can be enabled as a session option:

sess_options.add_session_config_entry('session.dynamic_block_base','4')

Whenever set with a positive value, the onnxruntime thread pool will parallelize internal tasks with a decreasing granularity. Specifically, assuming there is a function expected to run N number of times by the thread pool, with the dynamic cost model enabled, each thread in the pool will claim

residual_of_N/(dynamic_block_base * num_of_threads)

whenever it is ready to run. So over a period of time, threads in the pool are likely to be better load balanced, thereby lowering the latency variance.

Due to the same reason, the dynamic cost model may also improve the performance for cases when threads are more likely be preempted. Per our tests, by far the best configuration for dynamic_block_base is 4, which lowers the variance while keeping good performance.

From <https://onnxruntime.ai/docs/performance/tune-performance/troubleshooting.html#i-am-seeing-high-latency-variance>

LIST FROM [[]]