Minimizing Intel® ITT API Overhead

Intel® ITT API overhead and its impact on the overall application performance depends on the amount of instrumentation code added to the application. When instrumenting an application with the Intel ITT API, you should balance between application performance and instrumentation detail in order to minimize API overhead while collecting sufficient performance data.

There are two key approaches to achieving good balance between overall performance of the instrumented application and instrumentation detail:

Instrument only those paths within your application that are important for analysis
Use Intel ITT domains to enable filtering among specific aspects of your application that can be analyzed separately

Minimizing Instrumentation Overhead

The table below shows the relative magnitude of overhead (as measured in CPU clocks) for the Intel® ITT API calls as implemented by Intel® GPA. When the ITT domain given in an API call is disabled (thus filtering out the associated call) the overhead is always less than 10 clocks regardless of the API.

__itt_id_make	<~10
__itt_id_create	<~100
__itt_id_destroy	<~100
__itt_task_begin	<~100
__itt_task_begin_fn	<~100
__itt_task_end	<~100
__itt_task_group	<~100
__itt_marker	<~100
__itt_metadata_add (float)	<~100 (up to 4 elements)
__itt_metadata_add (double)	<~100 (up to 4 elements)
__itt_metadata_add (s16)	<~100 (up to 4 elements)
__itt_metadata_add (s32)	<~100 (up to 4 elements)
__itt_metadata_add (s64)	<~100 (up to 4 elements)
__itt_metadata_add (u16)	<~100 (up to 4 elements)
__itt_metadata_add (u32)	<~100 (up to 4 elements)
__itt_metadata_add (u64)	<~100 (up to 4 elements)
__itt_metadata_add (str)	<~500
__itt_metadata_add (wstr)	<~500
__itt_relation_add	<~100
__itt_relation_add_to_current	<~100
__itt_thread_set_nameA	<~1000
__itt_thread_set_nameW	<~1000

Measured overhead values may vary depending on a few variables such as the CPU architecture, overall system workload. So the values above may be different from system to system.

From the table above, __itt_task_begin /__itt_task_end are likely the most frequently used APIs. Although their overhead is relatively low, their use is not free. Therefore be careful when enabling these APIs inside hot spots of your profiled application.

The metadata API can be used to trace either a single item or an array of items for a given type, therefore the overhead of the metadata API depends on the number of items included to be traced.

The Intel ITT API contains a subset of functions working with strings ASCII (UTF-8) or Unicode (UTF-16). ASCII functions are wrappers for correspondent Unicode functions, and consequently they utilize more CPU clocks than their Unicode analogues.

Using Intel® ITT Domains

Intel® ITT API has a subset of functions which create domains and string handles. These functions always return identical handles for the same domain names and strings. This requires these functions to perform string comparisons and table lookups, which can incur serious performance penalties. In addition, the performance of these functions is proportional to the log of the number of created domains or string handles. To minimize the performance impact on the application, API functions should be invoked as rarely as possible, and outside of the critical paths of the profiled application. It is best to create domains and string handles at global scope, or during application startup.

// Create two domains in the global scope. We can use these domains to control the level of detail

// that gets written to the trace file.

__itt_domain* basic = __itt_domain_create(L”MyFunction.Basic”);

__itt_domain* detailed = __itt_domain_create(L”MyFunction.Detailed”);

// Create string handles at global scope.

__itt_string_handle* h_my_funcion = __itt_string_handle_create(L”MyFunction”);

__itt_string_handle* h_my_funcion_arg = __itt_string_handle_create(L”Arg”);

void MyFunction(int arg)

{

__itt_task_begin(basic, __itt_null, __itt_null, h_my_function);

__itt_metadata_add(basic, __itt_null, h_my_function_arg, __itt_metadata_s32, 1, &arg);

Foo(arg);

FooEx();

__itt_task_end(basic);

}

__itt_string_handle* h_foo = __itt_string_handle_create(L”Foo”);

void Foo(int arg)

{

// Skip tracing detailed data if the detailed domain is disabled.

__itt_task_begin(detailed, __itt_null, __itt_null, h_foo);

// Do some work here…

__itt_task_end(detailed);

}

__itt_string_handle* h_foo_ex = __itt_string_handle_create(L”FooEx”);

void FooEx()

{

// Skip tracing detailed data if the detailed domain is disabled.

__itt_task_begin(detailed, __itt_null, __itt_null, h_foo_ex);

// Do some work here…

__itt_task_end(detailed);

}

// This is my entry point.

int main(int argc, char** argv)

{

MyFunction(atoi(argv[1]));

}

The best approach to balance between application performance and instrumentation detail is to use the minimal amount of instrumentation to convey the required information. For example, the following ‘slow’ code fragment calls __itt_metadata_add() in a loop, while the ‘faster’ fragment gets the same job done by hoisting the counter out of the loop and calling __itt_metadata_add().

Slow:

__itt_string_handle* h_cnt = __itt_string_handle_create(L”SomethingOccurred”);
__itt_domain* d_foo = __itt_domain_create(L”MyFunction.Foo”);

void Foo()

{

for(int i 0; i < 1000; ++i)

{

if(something(i)) __itt_metadata_add(d_foo, __itt_null, h_cnt, __itt_metadata_s32, 1, &i);

}

Faster:

__itt_string_handle* h_cnt = __itt_string_handle_create(L”SomethingOccurred”);

__itt_domain* d_foo = __itt_domain_create(L”MyFunction.Foo”);

void Foo()

{

int nSomethings = 0;

for(int i 0; i < 1000; ++i)

{

if(something(i)) nSomethings++;

}

__itt_metadata_add(d_foo, __itt_null, h_cnt, __itt_metadata_s32, 1, &nSomethings);

}

Intel Graphics Performance Analyzers

Minimizing Intel® ITT API Overhead

Minimizing Instrumentation Overhead

Using Intel® ITT Domains

Minimizing Intel® ITT API Overhead

Intel Graphics Performance Analyzers > Minimizing Intel® ITT API Overhead

Intel Graphics Performance Analyzers