2005 Scientific Session Abstracts

Quality Assurance of a Clinical Breast Disease Database and Preliminary Data Mining

Weihong Sun, MD MS 1 (w.sun@wriwindber.org), Yonghong Zhang 1 PhD, Henry Brzeski 1 PhD, David Rosendale 1 B.S, Jeffrey Hooke 2 MD, Craig D. Shriver 2 MD, Michael N. Liebman 1 PhD, and Hai Hu 1 PhD 1Windber Research Institute, Windber, PA 2Walter Reed Army Medical Center, Washington, DC

Content: In the Clinical Breast Care Project, subjects are consented and administered with up to four questionnaires containing a wide range of ~450 data fields. Pathological diagnoses are made or reviewed by a single pathologist. A new clinical data tracking system was implemented in November 2004 to track the processes including use of double data-entry for the questionnaires. However, legacy data collected through historically evolving questionnaires concerning ~1500 subjects with single data-entry, needed to be reassessed and imported into the new system.

Technology/Design: After initial assessment we added a QA step to manually compare the electronic data against the original questionnaires. Next a preliminary data mining study was performed by selecting four categories of the patients; benign (n=542), atypical hyperplasia (Atypical, n=47), carcinoma in situ (IS, n=85) and invasive carcinoma (Invasive, n=258). Sixteen data fields including age and BMI were selected for statistical analysis using SPSS.

Results: During QA we corrected 5550 data-entry errors, mostly typos or misinterpretations of the original handwriting, which gave an error rate of <1% for single data-entry. Logical data inconsistencies were also noticed, e.g. (live birth #) < (pregnancy #), which prompted the development of a QA matrix. In data mining, one-way ANOVA indicated age differences among the 4 patient groups. Further analysis shows that Benign (mean±95% CI = 46.40±1.29, p<0.001) patients were younger than others, but there were no differences among Atypical (58.19±4.30), IS (57.89±2.48) and Invasive (58.89±1.71). The BMI data have a skewed distribution and non-parametric analysis indicated a difference among the 4 groups (n=818, p=0.015). Further analysis suggested that Benign has a lower BMI than Cancer (p<0.038) and IS (p<0.019) but not Atypical.

Conclusions: After this QA step and preliminary data mining we are confident that this legacy data is of high quality. Our clinical database is potentially important to breast disease studies.