Real-Time Signal Processing in Optical Coherence Tomography (OCT) at Fraunhofer IPT
Tomographic imaging, ophthalmology
Optical coherence tomography (OCT) systems are developed at the Fraunhofer Institute for Production Technology (IPT) for medical and industrial applications. OCT is an imaging technique that relies on a signal processing chain. This chain used in Fraunhofer’s OCT systems covers, e.g., the acquisition of data from a line scan camera, image-related adjustments, a Fourier transformation, and the reduction of signal noise. Since these OCT systems are required to provide high spatial and temporal resolution in production setups, e.g., video frame rate (25 frames/s), the parallelization and tuning of the signal processing chain is crucial.
To achieve video frame rate, we have developed a CUDA version of Fraunhofer’s OCT signal processing chain that targets consumer GPUs as used in customers’ production environments. We reformulated numerous components of the signal processing chain as matrix transformations to enable the usage of well-tuned libraries such as CUBLAS and CUFFT. Furthermore, we focused on reducing data transfer times by overlapping data and compute operations. To this end, we wrap independent parts of the image – so-called B-scans – in CUDA streams and process them asynchronously.
For the performance validation of our CUDA implementation, we establish a performance model that predicts kernel runtimes and data transfer runtimes in such an asynchronous setup. Our performance model relies on existing approaches and interprets and combines them for our real-world code.
Starting point of our performance evaluations is the original serial code version (MSVS) using the Microsoft Visual Compiler (v14.0). Due to the new algorithmic matrix formulations, we also created parallel CPU versions of the code with little additional effort: With the Microsoft Compiler, we used OpenBLAS (MSVS_BLAS) and, with the Intel compiler (v17.0), we used MKL. Our new CUDA implementation was evaluated on two NVIDIA Pascal consumer GPUs (with similar results). To show the benefit of our optimizations with respect to asynchronous and overlapping operations, we present performance results of CUDA_SYNC and CUDA_ASYNC code versions.
Overall, our CUDA version (CUDA_ASYNC) achieves ~50 frames/s even for the largest tested data set and, hence, fulfills our requirement for video frame rate. This means a speedup of 8-16 over the serial MSVS version, and a speedup of 4-5 over the fastest CPU parallel version. Our corresponding GPU performance model is in line with our runtime measurements (deviation below 15 %) and predicts that triple-sized images can still be processed in video frame rate with our CUDA_ASYNC version.
Lines of code: ~6000 LOC for the used test setup
Programming languages & models: C++, CUDA, OpenMP, BLAS Libraries
Development & Cooperation
Tobias Schrödter, RWTH Aachen University
David Pallasch, Fraunhofer IPT, Aachen
Sandra Wienke, Christian Terboven, IT Center/ Chair for High Performance Computing, RWTH Aachen University
Schrödter T., Pallasch D., Wienke S., Schmitt R., Müller M.S. (2019) Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography. In: Mencagli G. et al. (eds) Euro-Par 2018: Parallel Processing Workshops. Euro-Par 2018. Lecture Notes in Computer Science, vol 11339. Springer, Cham