mirror of https://github.com/google/benchmark.git
Update tools.md with more documentation about U-test Fixes https://github.com/google/benchmark/issues/1491
This commit is contained in:
parent
4931aefb51
commit
b5aade1810
140
docs/tools.md
140
docs/tools.md
|
@ -186,6 +186,146 @@ Benchmark Time CPU Time Old
|
||||||
This is a mix of the previous two modes, two (potentially different) benchmark binaries are run, and a different filter is applied to each one.
|
This is a mix of the previous two modes, two (potentially different) benchmark binaries are run, and a different filter is applied to each one.
|
||||||
As you can note, the values in `Time` and `CPU` columns are calculated as `(new - old) / |old|`.
|
As you can note, the values in `Time` and `CPU` columns are calculated as `(new - old) / |old|`.
|
||||||
|
|
||||||
|
### Note: Interpreting the output
|
||||||
|
|
||||||
|
Performance measurements are an art, and performance comparisons are doubly so.
|
||||||
|
Results are often noisy and don't necessarily have large absolute differences to
|
||||||
|
them, so just by visual inspection, it is not at all apparent if two
|
||||||
|
measurements are actually showing a performance change or not. It is even more
|
||||||
|
confusing with multiple benchmark repetitions.
|
||||||
|
|
||||||
|
Thankfully, what we can do, is use statistical tests on the results to determine
|
||||||
|
whether the performance has statistically-significantly changed. `compare.py`
|
||||||
|
uses [Mann–Whitney U
|
||||||
|
test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test), with a null
|
||||||
|
hypothesis being that there's no difference in performance.
|
||||||
|
|
||||||
|
**The below output is a summary of a benchmark comparison with statistics
|
||||||
|
provided for a multi-threaded process.**
|
||||||
|
```
|
||||||
|
Benchmark Time CPU Time Old Time New CPU Old CPU New
|
||||||
|
-----------------------------------------------------------------------------------------------------------------------------
|
||||||
|
benchmark/threads:1/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 27 vs 27
|
||||||
|
benchmark/threads:1/process_time/real_time_mean -0.1442 -0.1442 90 77 90 77
|
||||||
|
benchmark/threads:1/process_time/real_time_median -0.1444 -0.1444 90 77 90 77
|
||||||
|
benchmark/threads:1/process_time/real_time_stddev +0.3974 +0.3933 0 0 0 0
|
||||||
|
benchmark/threads:1/process_time/real_time_cv +0.6329 +0.6280 0 0 0 0
|
||||||
|
OVERALL_GEOMEAN -0.1442 -0.1442 0 0 0 0
|
||||||
|
```
|
||||||
|
--------------------------------------------
|
||||||
|
Here's a breakdown of each row:
|
||||||
|
|
||||||
|
**benchmark/threads:1/process_time/real_time_pvalue**: This shows the _p-value_ for
|
||||||
|
the statistical test comparing the performance of the process running with one
|
||||||
|
thread. A value of 0.0000 suggests a statistically significant difference in
|
||||||
|
performance. The comparison was conducted using the U Test (Mann-Whitney
|
||||||
|
U Test) with 27 repetitions for each case.
|
||||||
|
|
||||||
|
**benchmark/threads:1/process_time/real_time_mean**: This shows the relative
|
||||||
|
difference in mean execution time between two different cases. The negative
|
||||||
|
value (-0.1442) implies that the new process is faster by about 14.42%. The old
|
||||||
|
time was 90 units, while the new time is 77 units.
|
||||||
|
|
||||||
|
**benchmark/threads:1/process_time/real_time_median**: Similarly, this shows the
|
||||||
|
relative difference in the median execution time. Again, the new process is
|
||||||
|
faster by 14.44%.
|
||||||
|
|
||||||
|
**benchmark/threads:1/process_time/real_time_stddev**: This is the relative
|
||||||
|
difference in the standard deviation of the execution time, which is a measure
|
||||||
|
of how much variation or dispersion there is from the mean. A positive value
|
||||||
|
(+0.3974) implies there is more variance in the execution time in the new
|
||||||
|
process.
|
||||||
|
|
||||||
|
**benchmark/threads:1/process_time/real_time_cv**: CV stands for Coefficient of
|
||||||
|
Variation. It is the ratio of the standard deviation to the mean. It provides a
|
||||||
|
standardized measure of dispersion. An increase (+0.6329) indicates more
|
||||||
|
relative variability in the new process.
|
||||||
|
|
||||||
|
**OVERALL_GEOMEAN**: Geomean stands for geometric mean, a type of average that is
|
||||||
|
less influenced by outliers. The negative value indicates a general improvement
|
||||||
|
in the new process. However, given the values are all zero for the old and new
|
||||||
|
times, this seems to be a mistake or placeholder in the output.
|
||||||
|
|
||||||
|
-----------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Let's first try to see what the different columns represent in the above
|
||||||
|
`compare.py` benchmarking output:
|
||||||
|
|
||||||
|
1. **Benchmark:** The name of the function being benchmarked, along with the
|
||||||
|
size of the input (after the slash).
|
||||||
|
|
||||||
|
2. **Time:** The average time per operation, across all iterations.
|
||||||
|
|
||||||
|
3. **CPU:** The average CPU time per operation, across all iterations.
|
||||||
|
|
||||||
|
4. **Iterations:** The number of iterations the benchmark was run to get a
|
||||||
|
stable estimate.
|
||||||
|
|
||||||
|
5. **Time Old and Time New:** These represent the average time it takes for a
|
||||||
|
function to run in two different scenarios or versions. For example, you
|
||||||
|
might be comparing how fast a function runs before and after you make some
|
||||||
|
changes to it.
|
||||||
|
|
||||||
|
6. **CPU Old and CPU New:** These show the average amount of CPU time that the
|
||||||
|
function uses in two different scenarios or versions. This is similar to
|
||||||
|
Time Old and Time New, but focuses on CPU usage instead of overall time.
|
||||||
|
|
||||||
|
In the comparison section, the relative differences in both time and CPU time
|
||||||
|
are displayed for each input size.
|
||||||
|
|
||||||
|
|
||||||
|
A statistically-significant difference is determined by a **p-value**, which is
|
||||||
|
a measure of the probability that the observed difference could have occurred
|
||||||
|
just by random chance. A smaller p-value indicates stronger evidence against the
|
||||||
|
null hypothesis.
|
||||||
|
|
||||||
|
**Therefore:**
|
||||||
|
1. If the p-value is less than the chosen significance level (alpha), we
|
||||||
|
reject the null hypothesis and conclude the benchmarks are significantly
|
||||||
|
different.
|
||||||
|
2. If the p-value is greater than or equal to alpha, we fail to reject the
|
||||||
|
null hypothesis and treat the two benchmarks as similar.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
The result of said the statistical test is additionally communicated through color coding:
|
||||||
|
```diff
|
||||||
|
+ Green:
|
||||||
|
```
|
||||||
|
The benchmarks are _**statistically different**_. This could mean the
|
||||||
|
performance has either **significantly improved** or **significantly
|
||||||
|
deteriorated**. You should look at the actual performance numbers to see which
|
||||||
|
is the case.
|
||||||
|
```diff
|
||||||
|
- Red:
|
||||||
|
```
|
||||||
|
The benchmarks are _**statistically similar**_. This means the performance
|
||||||
|
**hasn't significantly changed**.
|
||||||
|
|
||||||
|
In statistical terms, **'green'** means we reject the null hypothesis that
|
||||||
|
there's no difference in performance, and **'red'** means we fail to reject the
|
||||||
|
null hypothesis. This might seem counter-intuitive if you're expecting 'green'
|
||||||
|
to mean 'improved performance' and 'red' to mean 'worsened performance'.
|
||||||
|
```bash
|
||||||
|
But remember, in this context:
|
||||||
|
|
||||||
|
'Success' means 'successfully finding a difference'.
|
||||||
|
'Failure' means 'failing to find a difference'.
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
Also, please note that **even if** we determine that there **is** a
|
||||||
|
statistically-significant difference between the two measurements, it does not
|
||||||
|
_necessarily_ mean that the actual benchmarks that were measured **are**
|
||||||
|
different, or vice versa, even if we determine that there is **no**
|
||||||
|
statistically-significant difference between the two measurements, it does not
|
||||||
|
necessarily mean that the actual benchmarks that were measured **are not**
|
||||||
|
different.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### U test
|
### U test
|
||||||
|
|
||||||
If there is a sufficient repetition count of the benchmarks, the tool can do
|
If there is a sufficient repetition count of the benchmarks, the tool can do
|
||||||
|
|
Loading…
Reference in New Issue