Evaluating the performance of benchmarks and programs is difficult. There are many “moving parts” in the modern software stack, from frequency governors to system load, garbage collectors, and CPU schedulers. All of these will affect the execution times of programs, which means that when evaluating program performance, a certain amount of statistical rigour is required.
Over the weekend I started hacking on srtime, a tool which, like the time (1) command, executes a program and reports how long it took to run. Unlike time however, this tool performs multiple executions of a program and returns the mean time and a confidence interval. For example:
$ srtime ./my_benchmark
95% confidence values from 96 iterations:
103.665 103.699 103.733
Future work will add an API for programmatic invocation of the timer interface, with the intention of writing a small suite of tools for performing rigorous and reproducible performance evaluations.