2005 Scientific Session Abstracts
Exploring Harmonization of UML models for caBIG™
Lewis J. Frey, Ph.D. (lewis.j.frey@vanderbilt.edu); Department of Biomedical Informatics, Vanderbilt University, Nashville, TN.
Context : Interoperability across the grid is an important issue for caBIG™. Interoperability will enable applications on the grid (e.g., such as pathology tools within caTISSUE) to communicate with each other effectively. To bring about interoperability, the caDSR currently uses the Semantic Connector to harmonize common data elements (CDE). This work explores the use of correlations between frequency counts of words, associated with UML classes. This is done to find relationships between the classes in different models and points of harmonization between models that can assist developers in achieving interoperability.
Technology : An approach similar to Latent Semantic Analysis (LSA) (http://lsa.colorado.edu/) is performed to find similarity between classes in UML models. Rankings of class associations are obtained with pearson’s correlation. The UML models are generated from information in XML schema for mass spectroscopy (MS) data exchange (i.e., mzXML and mzData).
Design : Classes from mzData are ranked in similarity to classes in mzXML using a “bag of words” approach to find similarities between classes in different MS standards. The “bag of words” technique groups words for each class in a UML model and gets frequency counts on the number of times a word occurs in each class. The words come from the name and attribute lists of a class. For mzXML there are 32 classes with a total of 55 words occurring across the classes. MzData has 33 classes, with each one being compared against the 32 classes in mzXML. This gives a one too much correlation value ranking between each class in mzData and all the classes in mzXML. These rankings are compared against a “gold” standard obtained by an expert familiar with both models choosing the best matching class from mzXML for each class in mzData. As a control a shuffled version of mzXML frequency counts are used to rank classes in relation to mzData. The comparison is how many of the “gold” standard matches occur in the top five out of thirty-two.
Results : Using the mzXML frequency data, mzData has 14 of 26 “gold” standard classes occur in the top five ranking. Due to the limited number of words in the frequency count (55), 7 classes for mzXML frequency data do not have enough word overlap to get a ranking. For the shuffled frequency data, mzData has 2 “gold” standard classes occur in the top five ranking.
Conclusion : This initial exploration of finding similarity between UML model classes supports the application of LSA like approaches to the task. With a limited vocabulary (i.e., 55 words), the approach outperformed the shuffled by 14 to 2. The UMLs generated by projects will highly influence the CDE created for caBIG™, which in turn determines the standards that will be accepted by caBIG™. Tools that relate UML models can be used by the community to compare and contrast the models that different projects are proposing as candidate standards for caBIG™. The goal would be to facilitate the community in the development of harmonized models that support interoperability.
Acknowledgement : Thanks to Patrick McConnell of the Duke Bioinformatics Shared Resource for help with the mzXML and mzData UML model generation from XML schema.
