From Open Watcom
No matter how fast computer hardware is, there are always programs that don't run fast enough; this is inevitable because as hardware gets faster, software gets more complex and is used to solve more complex problems. On the lower end of performance spectrum is embedded development, where hardware performance is often severely constrained by cost, power consumption, thermal dissipation, and other factors.
Open Watcom provides tools that help programmers optimize application performance. Examples of how those tools might be used will be given.
Improving Application Performance
Regardless of the underlying reason, when hardware isn't fast enough, it is software that must be made more efficient. Improving software performance is a two-step process:
- Analyzing performance and locating bottlenecks
- Replacing poorly performing algorithms with more efficient ones
The second step is highly application specific and there are no universally applicable methods - no silver bullets to magically improve performance. On the other hand, performance analysis can be to a large extent generalized and its methods may be applied in a wide variety of scenarios. Specialized tools called profilers exist to aid programmers in this task.
Performance analysis is essential to improving performance of an application, because it determines which parts of the application will benefit most from a performance-oriented optimization. Without proper analysis, programmers might expend considerable effort rewriting a function or a module, only to be rewarded by no discernible performance improvement because the function was perhaps only called once or was not called at all in a typical usage scenario.
The larger an application is, the more important profiling becomes, because it would be extremely difficult or impossible to guess where the application is spending most of its execution time. Programmers' intuition is more likely to be a hindrance than a benefit.
Addititional information is available about performance tuning in the context of Open Watcom tools.
Principles of Performance Profiling
There are two fundamentally different approaches to profiling. We will call them stochastic profiling and instrumented profiling. Each has specific benefits and drawbacks and depending on specific circumstances, one or the other may be preferable.
Stochastic (ie. probabilistic) profiling usually involves two separate tools. One is called sampler - wsample in the case of Open Watcom - and the other is the actual profiler, wprof in our case.
The sampler collects performance samples and it may be helpful to explain its operation in detail. The sampler works on a principle very similar to a debugger; wsample uses the same local trap file that the Open Watcom debugger uses, and the profiler also shares MAD and DIP files with the debugger (an overview of the debugger architecture may be of interest).
The analyzed application is first run under the control of the sampler. The sampler will periodically interrupt the application and record current instruction pointer location. When the application finishes, the samples will be written to a file, along with information identifying the application (executable name) and its components (DLLs). The profiler can be used later to analyze the information.
This type of profiling is called stochastic because the samples are taken more or less randomly. Note that the sampling period can be set by the user but depends on operating system capabilities. Typical sampling periods range roughly between 1 and 55 milliseconds. This method may seem inaccurate, but isn't. As long as a sufficient number of samples is taken (at least one hundred, preferably thousand or more), the results tend to be highly accurate because probability is working for us: the more time the application spends executing certain piece of code, the more likely it is that it will be interrupted at that point.
The other profiling method works by instrumenting the application with profiling hooks and does not require an external sampler; in effect, the application profiles itself. The compiler inserts calls to profiler hooks in the prologue and epilogue of each function. The idea is to record current time at entry and exit of each function, which lets us determine how long the function executed. Calls to epilogue and prologue hooks can be turned on with the -ep and -ee switches, respectively.
Open Watcom compilers support generic profiler hooks as well as specialized 'Pentium profiling' which utilizes the RDTSC (read time-stamp counter) instruction available on Pentium and later CPUs, as well as later 486 models. This method allows cycle-exact timing and provides extremely high precision; it can be turned on with the -et switch. An application built with this switch will write a formatted activity record to a file with the extension .prf.
Pros and Cons of Each Approach
Because their respective methods of operation are so different, it should be obvious that the two profiling approaches will each have its own advantages and disadvantages. Briefly, the biggest advantage of instrumented profiling that it can provide highly precise timing data when the necessary hardware support is available. The major drawback is that insertion of the profiling code may significantly alter the performance characteristics of the application, and the profile data, while very precise, may not accurately reflect true application performance. In other words, the application with profiling hooks is not the same as the original application.
The biggest advantage of stochastic profiling is that it requires no modifications to the application; only line number debug information needs to be available to the profiler so that it could correlate recorded instruction pointer locations to the application's source code. The disadvantage of stochastic profiling is that it needs enough samples to provide accurate picture of application performance, and is thus unusable for applications that execute quickly (of course, such applications usually don't need profiling!).
Both profiling approaches may provide misleading data if the profiled application spends significant time waiting for external events to occur (timers, I/O, etc.). Generally speaking, the more CPU intensive an application is, the more accurate the profiling will be - fortunately, very CPU intensive applications are usually just the ones that need to be profiled. Needless to say, multitasking will also interfere and profiling should only be run on systems that are otherwise idle.
Multi-threading is another challenge. As long as there is only a single CPU in the system, stochastic profiling can effectively ignore threads. When a sample is taken, only the active thread is examined; from the profiler's perspective, threads aren't significant. Because only one thread is executing at a time, this simplification in fact corresponds to how the CPU is utilized. Instrumented profiling is more problematic - any interruption of execution in the time period between function's entry and exit will skew the results.
Following is a summary of advantages and disadvantages of the two approaches:
- Stochastic Sampling
- Works with unmodified executable
- Requires no special hardware support
- Easy to use
- Unaffected by threading
- Needs enough samples to be accurate
- Unusable for applications that execute quickly
- Requires external tool to take samples
- Instrumented profiling
- Delivers highly precise results
- Does not need separate sampler
- Debug information isn't needed to interpret data
- Application must be recompiled for profiling
- Results may be skewed by insertion of profiling hooks