EXTRACTING CANCER TERMS FROM PUBLICLY AVAILABLE NOMENCLATURES
http://65.222.228.150/jjb/ca_terms.txt
Jules J. Berman, PhD, MD
National Institutes of Health
Rockville, MD USA
Technology: The National Library of Medicine's UMLS (Unified Medical Language System) is a compilation of approximately 100 medical nomenclatures and contains many neoplastic terms missing from ICD-O-3, interspersed in the many UMLS source vocabularies.
A Perl script was created that draws from ICD-O-3 and the Jan. 2003 version of the UMLS metathesaurus, automatically compiling a coded listing of cancer terms. The metathesaurus files used are MRCON, a 151 MByte file containing over 2 million medical terms; and MRCXT, a 1.7 Gbyte UMLS file with more than 27 million records expressing the relationships for terms contained in MRCON.
Design: The Perl script (ca_terms.pl) collects all UMLS terms with a "neoplasms" relationship and all ICD-O terms not included in UMLS, preserving UMLS and ICD-O codes. It then executes three transformations on terms: 1) expanding the number of terms by including grammatically equivalent expressions (e.g., adenocarcinoma of colon -> colon adenocarcinoma ->colonic adenocarcinoma), 2) normalizing terms by converting every term to lower-case and obliterating most plural forms by truncating the trailing "s" character, and 3) removing duplicate terms.
Results: The Perl script produces an output file in about 7 minutes on a 1.6 GHz computer. The output contains approximately 29,500 cancer terms, of which 23,600 are English. Thirteen foreign languages are included in the terms. An example of a single concept entry is hepatocellular carcinoma, which encompasses 78 terms under the UMLS identifier C0019204, including adenocarcinoma of liver, cancer of liver, carcinoma of liver, hcc, hepatic adenocarcinoma hepatic cancer, hepatic carcinoma, hepatocarcinoma, hepatocellular adenocarcinoma, hepatocellular cancer, hepatocellular carcinoma hepatoma, lcc, liver adenocarcinoma, liver cancer, liver carcinoma liver cell carcinoma, and malignant hepatoma.
Conclusion: A Perl script is entered into the public domain that extracts more than 29,000 codified cancer terms, from two publicly available medical nomenclatures (UMLS and ICD-O), accounting for more than ten times the number of cancer terms contained in the most recent version of the ICD-O.