APIII - Advancing Practice, Instruction & Innovation Through Informatics

Marriott City Center, Pittsburgh, PA | September 20 - 23, 2009

Real-Time Mutual-Information-Based Linear Registration on the Cell Broadband Engine Processor

Shahrokh Daijavad ; IBM ;

Content:

The emerging generation of multi-core processors can accelerate medical imaging applications by exploiting the parallelism that is available in their algorithms. We have implemented a mutual-information-based three-dimensional (3D) linear registration algorithm on the Cell Broadband Engine" (Cell/B.E.) processor. This implementation parallelizes the code for multiple cores and organizes the data structure for reducing the amount of memory traffic. With two Cell/B.E. processors, our algorithm can compute the mutual information and register a pair of 256 x 256 x 30 3D images, about 33 million pixel pairs, in one second, which is significantly faster than a conventional implementation on a traditional microprocessor or even a custom hardware implementation. This work was done in collaboration with Mayo Clinic and the experimental results were obtained on a set of clinical MRI images that were collected at the Mayo Clinic.

Technology:

The Cell Broadband Engine" (Cell/B.E.) processor used in our implementation is the same processor that is used in Sonys PLAYSTATION?3 (PS3) game console and is gaining traction in medical imaging applications. This is an asymmetric, high-performance multi-core processor that combines eight synergistic processing elements (SPEs) and a Power Processing Element (PPE), which is a general-purpose IBM Power PC? core. Each SPE has a 4-way Single Instruction, Multiple Data (SIMD) engine, a high-speed local store and a direct memory access (DMA) engine. A high speed bus (for example, an IBM BladeCenter? QS20 that can run 16 SPEs in parallel) connects two of these processors.

Design:

To accelerate the image registration on the Cell.B.E. processor, it was necessary to optimize the program to exploit the parallelism at both the task and SIMD instruction levels and to use the available memory bandwidth efficiently. Because of the unique characteristics of the Cell/B.E. architecture, additional techniques can further accelerate the application, namely task partitioning and code optimization for the SPE SIMD pipeline. We focused on code optimization and data partitioning techniques, which significantly improve the utilization of the SIMD pipeline and the memory bandwidth, respectively.

Results:

The graph in Figure 1 shows the computation time per input fixed-image pixel for four sets of experiments. We gathered the total computation time to perform a registration for 97 pairs of image sets and divided it by the total of input fixed-image pixels. The image sets are clinical MRI images that were collected at the Mayo Clinic after Institutional Review Board approval. In the graph, the bar labeled SEQ corresponds to the Intel Xeon 5160 processor, which performs the registration sequentially by using one processor core. The remaining bars show the registration time with two Cell/B.E. processors: PAR corresponds to a parallelized version for 16 SPEs; SIMD corresponds to a SIMDized version in addition to the parallelization; and OPT corresponds to a version with optimized data partitioning in addition to the SIMDization and the parallelization.

Conclusion:

We have shown that exploiting the multiple processing cores of the Cell/B.E. processor can accelerate the image registration algorithm significantly. Our optimized code on two Cell/B.E. processors at 3.2 GHz is about 11 times faster than a sequential code on an Intel Xeon 5160 processor at 3.0 GHz.

Search