2005 Scientific Session Abstracts
Utilizing Legacy System Data to Initiate Large Linked Oncology Database Searches
Michael Graiser, PhD (michael_graiser@emoryhealthcare.org) 1, Tracey Krogstad 1, Rochelle Victor 1, Michael Keehan, PhD 2, Christopher Flowers, MD 1, Jonathan Simons, MD 1, Milton Datta, MD 1
1 Winship Cancer Institute, Department of Hematology and Oncology, Emory University School of Medicine, Atlanta, GA, 2NuTec Health Systems, Atlanta, GA
Context : Studies of Winship Cancer Institute oncology databases have identified Cancer Registry data as the gold standard for disease identification using discrete diagnosis codes. These studies utilized GeneSys SI, an oncology research application which allows for retrieval of patient data via searches across heterogeneous data sources. The application allows for externally-derived patient lists to be imported and relevant clinical data mined, provided that patient identifiers are compatible across systems. We describe a process of data scrubbing enabling data compatibility.
Technology: GeneSys SI (GSI) is a SQL-Server oncology research application which receives daily updates of patient data originating from Emory Hospital and Clinic administrative and clinical systems stored in the Emory Healthcare data warehouse, an Oracle relational database. Emory Clinic diagnosis code data originates from IDX (IDX Systems Corporation, Burlington, Vermont). Cancer Registry data is supplied from MRS Cancer Registry (IMPAC Medical Systems, Inc., Cambridge, Massachusetts).
Design: The Cancer Registry MRS system was queried to identify Emory Hospital prostate cancer patients. Patient lists were imported into GSI to obtain treatment and outcome data plus identify biopsy surgical specimens. Data scrubbing was performed to obtain IDX Emory Clinic medical record numbers needed to link to GSI.
Results: One prostate cancer list from MRS contained 2129 patients. The list was linked to the IDX patient registration table via Social Security number (SSN) to obtain medical record numbers. 185 mismatches were identified and classified as: patients expired prior to the 1994 implementation of IDX (89), SSN incorrect in MRS (56), SSN incorrect/blank in IDX (28), SSN incorrect in both MRS and IDX (1), SSN mismatch with correct value uncertain (4) and patients not found in IDX (7). An additional 30 patients had a SSN of all 9’s or all 0’s. Three patients had their SSN match to IDX but to the wrong patient. SSN errors were corrected after which 2038 medical record numbers retrieved. This scrubbed list was used to identify 851 patients in GSI, all of whom were shown to be correctly linked. The pre-scrubbed list identified 864 patients, of whom 48 resulted from incorrect links.
Conclusion: An import function in large linked databases provides a powerful tool to supply clinical details of a pre-defined population with high data integrity. Significant data scrubbing may be required to allow for linkage between systems. The value of scrubbing is evident in the error rate going from 5.6% to 0%.
