Presented at the 2000 APIII Conference Return to 2000 Abstract Index
SET THEORY DEFINITION AND ALGORITHM FOR MEDICAL DE-IDENTIFICATION
Baltimore
VA Medical Center
Baltimore, Maryland
G. William Moore,
MD, PhD
G. William Moore, MD, PhD1,2,3, Lawrence A. Brown, MD2,3,
Robert E. Miller, MD3.
1Departments of Pathology, Baltimore VA Maryland Health Care System 2University of Maryland School of Medicine 3The Johns Hopkins Medical Institutions, Baltimore, MD.
Background: There is increasing interest in distributing individually identifiable medical records over the Internet for tissue-archival and epidemiologic studies. However, a record containing sufficient medical detail to have value for these applications might also point unambiguously to a specific patient. We propose a set-theory definition and algorithm for medical de-identification, that would prevent even a person with complete knowledge of a particular medical record from positively identifying it on the public database.
Design: As a model, we employ a rectangular medical database with rows=patients and columns=features expressed as Unified Medical Language System (UMLS) codes, designated as positive, negative, or missing-value. Publicly-known features (age, gender, etc.) are numbered consecutively from 1 to q; private features from q+1 to q+r; and 'public set', Q = {-q,...,-1,1,...,q}. The `posting' for patient i is the set Pi, where k belongs to Pi if the kth feature is positive; -k belongs to Pi if the kth feature is negative; and neither belongs to Pi if the kth feature is missing-value. Posting Pi is 'weakly private' if and only if there exists another posting, Pj, such that (Pi^Q)=(Pj^Q) and Pi is a subset of Pj; 'strongly private' if Pi=Pj (^=set-intersection).
Results: This privacy definition motivates an algorithm for removing ('scrubbing') data-elements from the public posting, such that no posting can be matched to a specific patient. With the strong privacy condition, even the patient cannot know that a particular posting belongs to himself/herself.
Conclusion: The proposed algorithm produces a de-identified medical database. A theoretical issue highlighted by the algorithm is the inadequacy of many statistical tests for managing missing-values. Another important issue is the translation of existing medical records into UMLS-coded databases.
Related URL: http://www.netautopsy.org/apep00st.htm
