Chapter 4
Results and Analysis
To properly discuss the results a quantitative measuring scale must be established that will give a meaningful context to the timing data. As such both the physical timings are discussed as well as the speedup ratio of the baseline algorithm to the 4 other algorithms.
4.1 Definition of Speed Up
where:
TB=timing of baseline algorithm over a specific matrix dimension Ti=timing of test algorithm over the same matrix dimension
Si=Speedup Ratio for the specific dimension
So:
When S<1 then the test algorithm is slower than the baseline algorithm
When S=1 then the test algorithm is no faster nor slower than the baseline algorithm
When S>1 then the test algorithm is faster than the baseline algorithm
Note: The 5 algorithms were run on various hardware ranging from a single core PII, a Core2Duo dual-core processor to a Core2Quad processor and represent.
4.2 Dataset results
Serial Performance (timing) – Pentium II
Figure1: Timings taken on a single Pentium II 450Mhz w/ 512Kb cache using g++ version 3.4.4 and with the -O2 compile switch.
As is clear from the timings graph the baseline algorithm takes longer than either of the two serial versions. This is especially evident at the 37 mark and larger. Overall the gaussjordanSerial_Combo seem to be marginally faster than the gaussjordanSerial algorithm.
Serial Performance (Speedup Ratio) – Pentium II
Figure1b: Speedup based on a single Pentium II 450Mhz w/ 512Kb cache using g++ version 3.4.4 and with the -O2 compile switch.
The trend that is visible is that at no time does the baseline algorithm register as the fastest method. Indeed both the gaussjordanSerial and gaussjordanSerial_Combo functions are between 1.6 and 2.5 times faster than gaussj over the range 3 through 300. However at 155 it is clear that something has changed with the implementation of the algorithm since there are some higher and lower peaks. It is probable that there is a hardware limitation that is slowing data from being sent to the processor in a timely fashion since at 151×151 a matrix would be approximately 700Kbits in size. The author would theorise that this is undoubtedly related to either front side bus speed or cache size.
Serial Performance (Speedup) – Core2Duo
Figure2a: Timings taken on a Core2Duo processor 2.33Ghz w/ 4Mb cache using g++ version 4.0.1 and with the -O2 compile switch with OpenMP flag off.
Again the choice of processor does little to change the speedup ratio’s implications that the gaussj() function is still the worst performing Gauss-Jordan Inversion algorithm for serialised use compared with the two other algorithms.
Also the gaussjordanSerial_combo() algorithm is still faster of the two algorithms written by the author.
Parallel Performance (Timing) – Core2Duo
Figure 2b: Timings taken on a Core2Duo processor 2.33Ghz w/ 4Mb cache using Intel compiler version 9.1.039 with the -O2 compile switch and OpenMP flag On.
Interesting events occur when examining the performance of the parallel algorithm in relation to the serial algorithm. At a dimension of 43 it is clear that the gaussjordanOMP algorithm is faster than either of the three serial implementations. The gaussjordanOMP_Combo algorithm becomes the faster than the serial versions at approximately 53. Before wither of these points the obvious optimal algorithm is the gaussjordanSerial_Combo.
It is interesting to note that the serial combination algorithm is continually faster than the two stage version but when comparing parallel algorithms it is clear that for values before 83 the parallel combination algorithm has a lower speedup ratio than the two stage version.
In fulfilling this project it was required to create a ‘wrapper function’ that generally performed the fastest algorithm at each dimension.
As such an overall gaussjordan() function was used.
Figure 2c: Timings taken on a Core2Duo processor 2.33Ghz w/ 4Mb cache using Intel compiler version 9.1.039 with the -O2 compile switch and OpenMP flag On.
When examining the general trend it is clear that the gaussjordan() wrapper uses the most optimal algorithm based on dimension size.
As such it has used the intersection point found on a prior run to optimally change algorithm implementation.
The various peaks and troughs are probably cache-misses and a also a result of running the computation on a busy system.
Parallel Performance (Timing) – Core2Quad
Figure 3a: Timings taken on a Core2Quad processor 1.86Ghz w/ 4Mb cache using Intel compiler version 9.1.046 with the -O2 compile switch and OpenMP flag On.
Again the values follow a similar pattern to the previous example except for the fact that the parallel combination algorithm is faster than the other parallel version over all values.
Over a larger dataset:
Figure 3a: Timings taken on a Core2Quad processor 1.86Ghz w/ 4Mb cache using Intel compiler version 9.1.046 with the -O2 compile switch and OpenMP flag On.
What is very interesting to see here is that both parallel algorithms seem to have peek efficiencies approximately at 870 for the gaussjordanOMP() and at 899 for the gaussjordanOMP_Combo().
This implies that beyond this point the use of computing clusters may be a good way of continuing the remarkable speedup ratios.