APIII - Advancing Practice, Instruction & Innovation Through Informatics

Marriott City Center, Pittsburgh, PA | September 20 - 23, 2009

Presented at the 2000 APIII Conference                        Return to 2000 Abstract Index


LINGUISTIC INVENTORY OF THE JOHNS HOPKINS SURGICAL PATHOLOGY DATABASE

Baltimore VA Medical Center
Baltimore, Maryland
G. William Moore, MD, PhD

G. William Moore, MD, PhD1,2,3, Robert E. Miller, MD1

1Departments of Pathology, The Johns Hopkins Medical Institutions
2Baltimore VA Maryland Health Care System
3University of Maryland School of Medicine, Baltimore, Maryland

Background: There is increasing interest in encoding free-text surgical pathology reports for data-mining applications, including tissue-archival and epidemiologic studies. A useful first step is to conduct a linguistic inventory of the database.

Design: The linguistic content of the Johns Hopkins Surgical Pathology (JHSP) database was tabulated by machine translation and natural language processing methods. The database spans sixteen years, from March, 1984, to the present, with patient identifiers, accession and release dates, a free-text brief clinical history, and a free-text surgical pathology diagnosis.

Results: On June 1, 2000, the JHSP database contained 159,071 patients with surgical pathology cases, 361,957 surgical pathology cases, and 694,443 surgical pathology specimens. Age/sex demographics were complete for 99.3% of patients, including 60.1% females and 39.2% males. Organ-systems in the database included: gastrointestinal, 28.7%; lymphoreticular, 15.1%; gynecologic, 14.0%; bone, 7.1%; breast, 5.8%. There were 9,004,337 words, 27,139 distinct words and 15,589 multiply-occurring words. Words ranged in frequency from 222,175 occurrences of the word 'and' to the 11,550 singly-occurring words, with an estimated 0.1% misspelling rate. Common parts-of-speech included: nouns, 4,458,102; adjectives, 2,187,808; prepositions, 709,617; noun-or-verbs, 275,683; conjunctions, 262,589. Common multiple word terms (collocations) included: chronic inflammation, 38,401; lymph nodes, 20,328; soft tissue, 16,104; bone marrow, 14,456. Among correctly spelled words, there were 8,406,088 (93.5%) exact or approximate matches to UMLS concept unique identifiers. In a pilot study, 82.8% of 2,302,366 sentences could be parsed, using the Backus-Naur linguistic model.

Conclusion: Results suggest that short sentences in free-text surgical pathology reports with a low misspelling rate can be parsed and pointed to UMLS codes, for pathology informatics studies.

Related URL: http://www.netautopsy.org/apep00li.htm

Search