Implementation and Methods
3.1 Program Outline
The main program is split into 10 C++ files of which 5 are header files. There are 5 Gauss-Jordan algorithms, each made contained as a function. In addition to this there exists a ‘wrapper’ function, gaussjordan(), that is used in the final stage of the project when the best algorithms for a particular range of dimensions is established.
The optimisation of the gaussjordan() function is the crucial element to the project. Within it is the method for deciding upon which algorithm is used for a specific matrix dimension.
The determination of the value to which algorithm is used is a two stage process.
3.2.1 Stage One
Upon running the ‘make all’ command a default value for “GO_R” is read in by the compiler during pre-processing. This starting value is arbitrary since upon first running of the program the exact crossover point between the various serial and parallel algorithms is undetermined since not all processors behave the same. Once the program is compiled and run it will cycle through the various dimensional sizes several times (as set by the “NO_AVG_VAL” variable in main.h). The range of dimensions is set by the “DIM_MIN_MAT” and “DIM_MAX_MAT” variables in main.h.
Upon completion of the range of dimensions the function stati() will calculate the intersection point of the best serial and parallel algorithm. This is done by comparing the speedup ratios and selecting the first value where by the it and the following two speedup ratios are greater than or equal to one.
Once this crossover value has been determined it is passed back to the main program whereby it is written out to the optimised.h file by the writeout() function.
3.2.2 Stage Two
The program is then recompiled for a second time and in doing so the gaussjordan.o object file now contains the optimal solution for finding the best algorithm.
3.3 Generalised Structure
The main.cpp code is written as follows:
1. Load text file of randomly generated numbers into an array
2. Dimension loop that cycles through the range of dimensions
a. Repeat Loop 5 times or more to get averaged run timing
i. Create matrices in memory and load values
ii. Run each algorithm and time them individually
iii. Delete matrices in memory
b. Average the timings
3. Print out results and alter “optimised.h”
This is the primary function that generates timings and allows for comparisons to be established. The actual Gauss-Jordan Inversion algorithms are contained in two separate files, gaussj.cpp and gaussjordan.cpp, and are split so that the baseline function and authors created functions are kept distantly separate for easy of readability.
3.4 Implementation Decisions
There are a number of variables beyond the code that needed to be reduced so that it is manageable from a programming point of view. One of which was which optimisation level was to be passed to the compiler at runtime. From experimentation O2 level optimisation was superior to both O1 and O3 options. As such the implementation decision was that O2 optimisation was the default for the project for both compilers.
Additionally there was a choice to be made between the variable type of the matrices with either float or double being the realistic types. Under testing conditions both behaved in a similar way and so it was not a critical issue when it came to discussing the results and conclusion so it was decided that floats would be the default type. It is however possible to swap between floats and doubles by editing a single line in main.h.
Finally since this is a proof of concept rather than a full implementation it was decided that, given that there are 5 competing algorithms to compare for which is the best one given a specific dimension, the problem was minimised to choosing between the best two solutions (one serial and one parallel) that were found through experimentation. As such the gaussjordan() function chooses between gaussjordanSerial_Combo() and gaussjordanOMP_Combo().
3.5 Problems Encountered
When trying to calculate the number of processing cycles each function took to complete it was found to be impossible to accurately use the Intel processor instruction, RDTSC. The RDTSC instruction works by counting the number of processor ‘ticks’ that goes by as a function is run. Since multi-core and SMP systems often transfer a running program and/or threads between processors any counting method relying on a single processor is likely to fail unless its affinity is set to a single processor, but in doing so setting an affinity it is relatively pointless to test a parallelised algorithm since threading on a single processor doesn’t improve the performance.
3.6 Function Descriptions
This is the baseline function to which the other Gauss-Jordan Inverse functions are compared. It is from the text Numerical Recipes in C++ (2nd edition) which contains what would be best described as the defacto standard for numerical computing.
This non-parallelised algorithm is written to follow the classical technique for Gauss-Jordan Inversion of reducing the lower triangle of the first matrix to zeros and then repeating the same steps on the upper triangle.
This algorithm has the same code as the gaussjordanSerial() function except that it also contains OpenMP pragma statements. Both commands for optimising for-loops and for separating sections, that can be run in parallel independently, are used in the code.
Instead of being systematic in waiting for the first matrix’s lower triangle to be reduced to zeros the code is set up as such that both upper and lower triangles are reduced via elementary row operations at the same time and not in the two stage process that the previous algorithms used.
This algorithm is identical in structure to gaussjordanSerial_Combo() except that it has several OpenMP pragma statements placed in it to parallelise the function.
This is the wrapper algorithm that uses the “GO_R” value set in optimised.h to decide the point whereby it switches between using serial and parallel versions.