Marriott City Center, Pittsburgh, PA | September 20 - 24, 2009

Use of Self-Configuring Lexical Analysis Approaches for Automated and Semi-Automated Anatomic Pathology Data Extraction and Transformation

Honorable Mention - Bioinformatics

Ulysses J. Balis ; University of Michigan; Jerome Yu Cheng ; University of Michigan;

Content:

Legacy Anatomic Pathology (AP) information systems often lack sophisticated search, lexical extraction and cross-load tools, thus creating barriers to effective utilization of archival information. While numerous technical approaches and solutions have been proposed and demonstrated for this need, none so far have made use of recent advances in self-configuring lexical analysis algorithmic/heuristic approaches. These are compelling, in that they allow for minimization or even elimination of incremental development and/or customization of standard lexical packages, in order to render a fully-operable extraction pipeline. To overcome these historical limitations, we propose and demonstrate a self-configuring Extraction-Transformation-Load (ETL) tool suite which avoids the complexity associated with requisite customization inherent in the conventional AP ETL turnkey solutions that have been reported to date.

Technology:

Active State Perl, Visual Basic 6 (VB), PHP, HTML, SQL Server 2005, with the predicate data source exemplar being AP data with Cerner Pathnet v3.06.

Design:

A CCL script was utilized to extract approximately 8900 AP cases, as unformatted streaming text. Upon initial heuristic parsing of these datasets, self-configuring / dynamically-adaptive heuristic lexical analysis methods were employed to identify an optimal / near-optimal set of hierarchical regular expressions which would allow for: 1) separation of the text stream into case-level granularity and subsequently 2) further reduction of the this case-level data into concept-level atomic data elements. The data pipeline included use of regular expression matching modules in Perl and heuristic lexical analysis and regular expression generation modules in VB. Resultant granular data was converted to CSV file format and bulk-inserted into the SQL database. Both VB and PHP/HTML-based search portals were created to peruse and validate the integrity of the extracted data sets.

Results:

Heuristically-derived regular expression patterns were successful in driving a PERL-based lexical extraction engine with this module being applied to the test set (~40 megabytes). Extraction times for the entire set were less than 10 seconds. Resultant data set integrity was validated by use of a plurality of diagnostic term queries, initiated from both front-ends.

Conclusion:

Self-configuring lexical analysis approaches hold significant promise for simplifying and automating the process of reliably extracting hierarchical, structured data from highly-variegated and bulk text legacy AP repositories.

Search