EXTRACTING GENETIC CONDITIONS THAT PREDISPOSE TO CANCER, USING THE ONLINE MENDELIAN INHERITANCE IN MAN

http://65.222.228.150/jjb/omimpre.txt

Jules J. Berman, PhD, MD
National Institutes of Health
Rockville, MD USA


Background: A complete terminology of lesions/conditions related to cancer would contain: 1) a comprehensive nomenclature of tumors; 2) a comprehensive nomenclature of precancers (morphologically identifiable lesions that precede the development of cancer); 3) a comprehensive nomenclature of acquired conditions that increase the risk of cancer (e.g. AIDS, and radiation exposure); and 4) a comprehensive nomenclature of genetic conditions that predispose to cancer (such as Li-Framumeni syndrome and Xeroderma Pigmentosum). A complete cancer terminology is currently unavailable to researchers and pathologists. The purpose of such a nomenclature would be to facilitate the integration of biomedical data with lesions of interest to cancer researchers. Data integration enables researchers to discover the medical relevance of heterogeneous data elements. The author has published informatics techniques used to compile nomenclatures 1 and 2. This abstract describes a way of compiling nomenclature 4, using the Online Mendelian Inheritance in Man (OMIM).

Technology: OMIM is a publicly available comprehensive and curated collection of all inherited conditions in man. It can be downloaded through anonymous ftp at: ftp.ncbi.nih.gov /repository/OMIM. The June 23, 2003 OMIM file was used. This file is 87,722,918 bytes in length and contains descriptions of 15,113 different inherited conditions of man. Conditions that are associated with the development of tumors are provided with a listing of the tumors that have been reported.

Design: The Perl script (omimpre.pl) collects OMIM conditions predisposed to neoplastic development. It extracts the following information from OMIM records: 1) the OMIM number of the condition; 2) the name of the condition and its synonymous or closely related terms; and 3) the names of tumors associated with the condition. The script requires an external file (look-up list) containing a comprehensive listing of neoplastic terms. Instructions for obtaining such a file can be obtained from http://65.222.228.150/jjb/ca_terms.txt. The extracted information is collected into an XML file. A version of the raw XML output file can be downloaded from htttp://65.222.228.150/jjb/omimpre.xml.

Results: The Perl script produces an output file in about 10 seconds using a 1.6 GHz computer. The output contains 518 conditions. Lynch cancer family syndrome, hereditary nonpolyposis colorectal cancer, cheilitis glandularis, Pasini typ epidermolysis bullosa dystrophica, hereditary desmoid disease, Aase-Smith syndrome, familial type thyroid carcinoma, Michelin tire baby syndrome, Oslam syndrome, and Maffucci syndrome are a small sampling of extracted conditions.

Conclusion: A Perl script is entered into the public domain that extracts from OMIM inherited conditions that predispose man to cancer. The Perl script is available at: http://65.222.228.150/jjb/omimpre.txt The output file is XML, supporting the facile integration of data elements (such as the OMIM identifier and the names of tumors) with other biological databases. The output file can be easily updated with newer versions of OMIM.