Text Mining on Free-Text Based Anatomic Pathology Information Systems: A Front-End Data Integration Approach
Zhuang Zuo MD, PhD; University of South Alabama; Carole W. Boudreaux MD; University of South Alabama;
Content:
A majority of today's anatomic pathology information systems are still unstructured free-text based, which are difficult to query and maintain data integrity. While upgrading to new systems that support relational data storage and synoptic reporting provides a solution to new entries, migration of existing data poses a significant challenge. In this project, we developed a front-end data integration solution using text mining to extend relational database capabilities to a free-text based system. This effort will not only transform and enrich the data, but can also serve as an add-on interface and middleware for synoptic reporting on a legacy system.
Technology:
We are currently in transition from Cerner Classic laboratory information system (LIS) to CoPathPlus. The Cerner client is Reflection terminal emulation software that supports VBScript. The main components of this project were developed using Microsoft SQL Server and Visual Basic.
Design:
Front-end data integration was implemented to transform free-text data into relational data. Relational tables were designed on a database independent of the Cerner LIS. The diagnosis portion of each pathology report was extracted from the Cerner terminal by VBScript modules and saved as text files. The free text data was parsed into database tables, followed by data cleaning and integration using Visual Basic. A graphic user interface was also developed for queries and displaying search results. The query functions support Boolean search, refined search, searching in multiple fields and data aggregation. The result table displays a brief view of cases that match the search criteria, and allows backward and forward navigation through search history. An output function saves the search result into an HTML file for review and print.
Results:
To date, 377,172 free-text anatomic pathology reports (198,187 unique medical record numbers) have been successfully parsed into the project database. Searches using diagnosis and/or protocol keywords in combination with general case information were tested on a PC and returned results almost instantly. The search results were thorough, specific, and significantly faster and of higher quality than SNOMED searches on Cerner LIS. Data mining, synoptic input, and CoPathPlus interfaces are currently under development.
Conclusion:
This project provided a relatively simple solution to transform unstructured data into relational data and to extend query capabilities of legacy LIS without significant investments in hardware and software. Front-end data integration has zero impact to the existing LIS system, functions independently and is therefore more flexible and easily adaptable across platforms. This application can also serve as middleware for third part software, data transferring and data mining. This design is readily scalable from a Microsoft Access database application to an enterprise database Web application.
