Performance Optimization of Signal Processing Algorithms for SIMD Architectures

Detta är en Master-uppsats från KTH/Reglerteknik

Sammanfattning: Digital Signal Processing(DSP) algorithms are widely implemented in real time systems.In fields such as digital music technology, many of these said algorithms areimplemented, often in combination, to achieve the desired functionality. When itcomes to implementation, DSP algorithms are performance critical as they havetight deadlines. In this thesis, performance optimization using Single InstructionMultiple Data(SIMD) vectorization technique is performed on the ARM Cortex-A15 architecture for six commonly used DSP algorithms; Gain, Mix, Gain and Mix,Complex Number Multiplication, Envelope Detection and Cascaded IIR Filter. Toensure optimal performance, the instructions should be scheduled with minimalpipeline stalls. This requires execution time to be measured with fine time granularity.First, a technique of accurately measuring the execution time using thecycle counter of the processor’s Performance Management Unit(PMU) along withsynchronization barriers is developed. It was found that the execution time measuredby using the operating system calls have high variations and very low timegranularity, whereas the cycle counter method was accurate and produced reliableresults. The cost associated with the cycle counter method is 75 clock cycles. Usingthis technique, the contribution by each SIMD instruction towards the executiontime is measured and is used to schedule the instructions. This thesis also presentsa guideline on how to schedule instructions which have data dependencies usingthe cycle counter timing execution time measurement technique, to ensure that thepipeline stalls are minimized. The algorithms are also modified, if needed, to favorvectorization and are implemented using ARM architecture specific SIMD instructions.These implementations are then compared to that which are automaticallyproduced by the compiler’s auto-vectorization feature. The execution times of theSIMD implementations was much lower compared to that produced by the compilerand the speedup ranged from 2.47 to 5.11. Also, the performance increaseis significant when the instructions are scheduled in an optimal way. This thesisconcludes that the auto-vectorized code does poorly for complex algorithms andproduces code with a lot of data dependencies causing pipeline stalls, even with fulloptimizations enabled. Using the guidelines presented in this thesis for schedulingthe instructions, the performance of the DSP algorithms have significant improvementscompared to their auto-vectorized counterparts.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)