Minimizing Intel® ITT API Overhead
Intel® ITT API overhead and its impact on the overall application performance depends on the amount of instrumentation code added to the application. When instrumenting an application with the Intel ITT API, you should balance between application performance and instrumentation detail in order to minimize API overhead while collecting sufficient performance data.
There are two key approaches to achieving good balance between overall performance of the instrumented application and instrumentation detail:
- Instrument only those paths within your application that are important for analysis
- Use Intel ITT domains to enable filtering among specific aspects of your application that can be analyzed separately
Minimizing Instrumentation Overhead
The table below shows the relative magnitude of overhead (as measured in CPU clocks) for the Intel® ITT API calls as implemented by Intel® GPA. When the ITT domain given in an API call is disabled (thus filtering out the associated call) the overhead is always less than 10 clocks regardless of the API.
__itt_id_make |
<~10 |
__itt_id_create |
<~100 |
__itt_id_destroy |
<~100 |
__itt_task_begin |
<~100 |
__itt_task_begin_fn |
<~100 |
__itt_task_end |
<~100 |
__itt_task_group |
<~100 |
__itt_marker |
<~100 |
__itt_metadata_add (float) |
<~100 (up to 4 elements) |
__itt_metadata_add (double) |
<~100 (up to 4 elements) |
__itt_metadata_add (s16) |
<~100 (up to 4 elements) |
__itt_metadata_add (s32) |
<~100 (up to 4 elements) |
__itt_metadata_add (s64) |
<~100 (up to 4 elements) |
__itt_metadata_add (u16) |
<~100 (up to 4 elements) |
__itt_metadata_add (u32) |
<~100 (up to 4 elements) |
__itt_metadata_add (u64) |
<~100 (up to 4 elements) |
__itt_metadata_add (str) |
<~500 |
__itt_metadata_add (wstr) |
<~500 |
__itt_relation_add |
<~100 |
__itt_relation_add_to_current |
<~100 |
__itt_thread_set_nameA |
<~1000 |
__itt_thread_set_nameW |
<~1000 |
Measured overhead values may vary depending on a few variables such as the CPU architecture, overall system workload. So the values above may be different from system to system.
From the table above, __itt_task_begin /__itt_task_end are likely the most frequently used APIs. Although their overhead is relatively low, their use is not free. Therefore be careful when enabling these APIs inside hot spots of your profiled application.
The metadata API can be used to trace either a single item or an array of items for a given type, therefore the overhead of the metadata API depends on the number of items included to be traced.
The Intel ITT API contains a subset of functions working with strings ASCII (UTF-8) or Unicode (UTF-16). ASCII functions are wrappers for correspondent Unicode functions, and consequently they utilize more CPU clocks than their Unicode analogues.
Using Intel® ITT Domains
Intel® ITT API has a subset of functions which create domains and string handles. These functions always return identical handles for the same domain names and strings. This requires these functions to perform string comparisons and table lookups, which can incur serious performance penalties. In addition, the performance of these functions is proportional to the log of the number of created domains or string handles. To minimize the performance impact on the application, API functions should be invoked as rarely as possible, and outside of the critical paths of the profiled application. It is best to create domains and string handles at global scope, or during application startup.
// Create two domains in the global scope. We can use these domains to control the level of detail
// that gets written to the trace file.
__itt_domain* basic = __itt_domain_create(L”MyFunction.Basic”);
__itt_domain* detailed = __itt_domain_create(L”MyFunction.Detailed”);
// Create string handles at global scope.
__itt_string_handle* h_my_funcion = __itt_string_handle_create(L”MyFunction”);
__itt_string_handle* h_my_funcion_arg = __itt_string_handle_create(L”Arg”);
void MyFunction(int arg)
{
__itt_task_begin(basic, __itt_null, __itt_null, h_my_function);
__itt_metadata_add(basic, __itt_null, h_my_function_arg, __itt_metadata_s32, 1, &arg);
Foo(arg);
FooEx();
__itt_task_end(basic);
}
__itt_string_handle* h_foo = __itt_string_handle_create(L”Foo”);
void Foo(int arg)
{
// Skip tracing detailed data if the detailed domain is disabled.
__itt_task_begin(detailed, __itt_null, __itt_null, h_foo);
// Do some work here…
__itt_task_end(detailed);
}
__itt_string_handle* h_foo_ex = __itt_string_handle_create(L”FooEx”);
void FooEx()
{
// Skip tracing detailed data if the detailed domain is disabled.
__itt_task_begin(detailed, __itt_null, __itt_null, h_foo_ex);
// Do some work here…
__itt_task_end(detailed);
}
// This is my entry point.
int main(int argc, char** argv)
{
MyFunction(atoi(argv[1]));
}
The best approach to balance between application performance and instrumentation detail is to use the minimal amount of instrumentation to convey the required information. For example, the following ‘slow’ code fragment calls __itt_metadata_add() in a loop, while the ‘faster’ fragment gets the same job done by hoisting the counter out of the loop and calling __itt_metadata_add().
Slow:
__itt_string_handle* h_cnt = __itt_string_handle_create(L”SomethingOccurred”);
__itt_domain* d_foo = __itt_domain_create(L”MyFunction.Foo”);
void Foo()
{
for(int i 0; i < 1000; ++i)
{
if(something(i)) __itt_metadata_add(d_foo, __itt_null, h_cnt, __itt_metadata_s32, 1, &i);
}
}
Faster:
__itt_string_handle* h_cnt = __itt_string_handle_create(L”SomethingOccurred”);
__itt_domain* d_foo = __itt_domain_create(L”MyFunction.Foo”);
void Foo()
{
int nSomethings = 0;
for(int i 0; i < 1000; ++i)
{
if(something(i)) nSomethings++;
}
__itt_metadata_add(d_foo, __itt_null, h_cnt, __itt_metadata_s32, 1, &nSomethings);
}
See Also