2005 Scientific Session Abstracts
Data Mining Methods from Heterogeneous Data: Identifying Prostate Cancer Patients Utilizing a Large Linked Oncology Database
Michael Graiser, PhD (michael_graiser@emoryhealthcare.org) 1, Tracey Krogstad 1, Rochelle Victor 1, Michael Keehan, PhD 2, Christopher Flowers, MD 1, Jonathan Simons, MD 1, Milton W. Datta, MD 1 , 1Winship Cancer Institute, Department of Hematology and Oncology, Emory University School of Medicine, Atlanta, GA, 2NuTec Health Systems, Atlanta, GA
Context: Previous Winship studies revealed Cancer Registry data to be the most accurate for patient identification using diagnosis codes in a large linked oncology database known as GeneSys SI (GSI). A group of investigators initiated a database search to identify prostate cancer patients and clinically relevant data. Ultimately prostate biopsy samples are to be studied in the hope of identifying molecular biomarkers. Based on previous results, the search focused on Cancer Registry diagnosis data.
Technology: GSI is a Java and web-based oncology research database application of the Winship Cancer Institute. Its SQL-Server data warehouse receives daily updates from Emory Hospital and Clinic administrative and clinical systems stored in the Emory Healthcare data warehouse, an Oracle relational database. Cancer Registry data is supplied from MRS Cancer Registry (IMPAC Medical Systems, Inc., Cambridge, Massachusetts). GSI also contains historical pathology reports from TAPIOCA, a web-based archival search engine developed by the Emory Hospital Department of Pathology.
Design: Parallel queries were developed for the MRS and GSI oncology databases to identify all prostate cancer patients with Emory prostate biopsies. Resulting populations were compared to identify the extent of overlap. Patient lists derived from MRS were also imported into GSI in an effort to obtain treatment and outcomes data, the history of follow-up data and tissue bank specimen numbers.
Results: Queries against the Cancer Registry MRS database identified 2376 Emory prostate cancer patients with biopsies. A comparable GSI query returned 1730 patients. The smaller GSI population was traced, in part, to MRS diagnosis data going back to 1977 whereas GSI is based on diagnosis codes collected from a system implemented in 1994. All queries identified a combined total of 2430 patients, 647 unique to MRS, 1729 in both MRS and GSI and 54 unique to GSI. Verification of diagnosis is ongoing to determine the specificity and sensitivity for each query. GSI also provided PSA results, date of last patient contact and tissue bank specimen numbers from pathology reports. TAPIOCA was also used to retrieve prostate tissue specimen numbers from pathology reports too old to be found in GSI.
Conclusion: A combination of a legacy source system plus a large linked database containing heterogeneous data proved most effective in identifying a cohort of cancer patients. While the legacy system may contain more historically complete data, the large linked database proved robust in being a single source to supply an array of clinically relevant data points.
