Analysis of Microarray Data with THEA


Microarray technology makes it possible to measure thousands of variables and to compare their values under hundreds of conditions. Once microarray data are quantified, normalized and classified, the analysis phase is essentially a manual and subjective task based on visual inspection of classes in the light of the vast amount of information available. Currently, data interpretation clearly constitutes the bottleneck of such analyses and there is an obvious need for tools able to fill the gap between data processed with mathematical methods and existing biological knowledge. The THEA project (Tools for High-throughput Experiments Analysis) is dedicated to the elaboration of tools and methods suited for the analysis of post-genomic data. The first module developed in the frame of the project (available from allows to automatically annotate data issued from classification systems with selected biological information coming from a knowledge base, to manually search and browse through these annotations and to automatically generate meaningful generalizations according to statistical criteria. THEA makes use of ontologies which constitute a popular way to modelize biological concepts and their relationships. Presently THEA includes Gene Ontology (GO), a controlled vocabulary which can be used to annotate a gene product in regard to its molecular functions, cellular localizations and biological processes, specific vocabularies describing the developmental stages and the anatomy of Drosophila melanogaster, the Medical Subject Headings (MeSH), used for indexing biomedical and health-related documents and InterPro, a collection of protein families, domains and functional sites. The Graphical User Interface of THEA allows users to explore biological data in a convenient way. It is possible to browse the ontologies, look for a particular field of knowledge and visualize associated leaves in the classification tree with colors markers. Successively using a few terms will immediately reveal which of the clusters simultaneously pertain to different fields of knowledge, or if the classification can be broadly divided in different parts. However, such a manual exploration is skewed as it is driven by the user’s knowledge and his field of interest. To overcome this bias, THEA includes several data mining algorithms allowing an entire classification to be automatically annotated. For each cluster of transcripts, THEA performs a statistical analysis under the null hypothesis of a uniform distribution of annotations. A given cluster is considered significantly enriched for a term if the number of transcripts associated with it exceeds the number expected by chance. A second type of research handled by THEA consists of highlighting the possible correlations between expression levels of certain genes and their localization on the genome. Under a null hypothesis of a uniform distribution of genes on each chromosome, we calculate a P-value for a group of co-localized genes to appear in the same cluster. The use of automatic description of the nodes to interpret massive data-sets in terms of concepts, only briefly described here, changes drastically our way of analyzing chip data. The use of statistical criteria rather than manual inspection according to selected hypotheses makes this analysis as unbiased as possible.

Lyon’s International Multidisciplinary Meeting on Post-Genomics: Integrative Post-Genomics (IPG'04)