This week the Queens’ Computer Scientists were given a fascinating talk by Katie regarding her Part II project on Compiler Optimisation and Automatic Loop Parallelisation.
The project looks at exploiting parallelism automatically through compiler optimisations and distributing work across multiple cores. The benefits of which are quite remarkable: if 50% of a program is parallelizable, the speedup from parallelisation can be as much as 2x, but a program with a parallelizable code fraction of 95% could see a speedup of up to 20x!
To begin with, we were told about the basics of compiler optimisation.
Compiler optimisation aims to make code:
- Smaller E.g. Dead code elimination, Unreachable code elimination
- Faster E.g. Auto-Vectorisation, Auto-Parallelisation
- Cheaper E.g. Register colouring, Cache management
There are many types of optimisation, and each type corresponds to the stage in which the compiler should be optimised. In the image above, we can see that it is possible to optimise the parse tree, intermediate code or target code. Katie chose to run her optimisations on the program’s intermediate code, because in doing so, the intermediate representation is translated from the original source code and so should have obvious basic blocks, meaning a block of code without control flow. As such, distinct units of execution are created, upon which optimisations can be carried out — this saves time and space compared to if the optimisation had been run and stored per individual instruction. Intermediate code also has a smaller number of instructions than the original source code and so it reduces the number of instruction combinations and structure that the optimisation needs to look for.
Optimisation is carried out by analysis and transform passes. The transform pass changes the code written by the programmer, and the analysis pass prevents these changes being dangerous, e.g., leading to unexpected behaviour or segmentation faults.
Unreachable elimination is one property that can be checked when optimising code, it allows for transforming the code to optimise the size of the program. However, with this comes the issue of the transform being safe; will any code be broken?
When the analysis searches for a particular program property that cannot be decided, the optimisation approximates cautiously. One example is overestimating reachability:
Katie adopted a similar approach in her project to implement a way of optimising a compiler to parallelise loops. The aim was to gain program speed up by utilising many cores.
It is possible to take a risk and try to parallelise code that is non-parallelizable, but the safer option is to underestimate parallelizability.
After carrying out safety checks in the analysis of the above loop: checking that there are no inner-loop dependencies, ensuring later iterations are independent of earlier ones, checking that arrays don’t alias to prevent dependencies between loops, bounding the loops with known constants, ensuring there are no unexpected breaks or returns, and omitting loops with reduction variables that occur on a per thread basis, Katie was able to optimise the code, extract loops and run them as separate threads distributed across multiple cores:
To finish her presentation, Katie talked about how she ran a Mandelbrot set calculation with a maximum iteration of 1000 to test the compiler optimisations she had created to spread the iterations across multiple cores. The graph below shows the data from the test: