Omics Data Mining

Overview of our analysis pipeline.

Active since 2000

Research rationale

In the late 1990s, new techniques for measuring the expression of genes at the level of entire genomes have emerged. By combining these quantitative measurements with biological knowledge, this breakthrough has paved the way for deciphering the activity of genes, their interactions and their involvement in various biological processes. However, the analysis of this mass of data remained a manual task. Firstly, because, although there were many different sources storing biological data, all these sources were completely independent of each other, and secondly, because tools to analyse these data in an automated way did not yet exist. Solutions to integrate heterogeneous data and to automate their analysis were more than ever needed.

Results

A methodology to ease data integration using Semantic Web technologies

Our research started with the idea that Semantic Web technologies, which provide a common framework allowing data to be shared and reused between applications, might be applied to the management of disseminated biological data. We studied and reported the specificities of biological data that made the application of these technologies to the life sciences a real challenge. Then, we proposed a methodology to facilitate data integration using Semantic Web whose precepts were very close to the rules that currently govern the Web of Data 1 2. We implemented our ideas in AllOnto, a Knowledge Base System capable of storing and performing queries on large sets of RDF/OWL specifications (including the storing and querying of reified statements). The software was designed to handle the provenance of information and included reasoning capabilities dealing with type inference, transitivity and built-in OWL constructs like owl:sameAs and owl:inverseOf.

Allonto was applied to collect and integrate data used in three different approaches of data mining we investigated.

Automatic gene annotation through a data-driven approach

The data-driven approach involves first identifying groups of genes whose expression shows similar variation and then integrating knowledge about genes. The research on this topic took the form of an integrated system called THEA (Tools for High-throughput Experiments Analysis). The software integrates several data mining algorithms to automatically annotate groups of genes sharing similar expression profiles with biological information (various ontologies, chromosomal localization, link with diseases). Experiments show that using THEA not only makes it easy and quick to obtain all manually highlighted results, but also to pinpoint new findings 3 4.

Automatic gene annotation through a knowledge-driven approach

The knowledge-driven approach consists of first finding co-annotated groups of genes and then, in a second step, integrating data on expression profiles. The CGGA method (Co-expressed Gene Groups Analysis), which we developed, is part of this approach. The tests that have been carried out show that the functional annotations provided by CGGA reduce the complexity of the data analysis problem by integrating various types of information about genes. The experimental results showed the interest of the approach and made it possible to identify relevant information on the biological processes studied 5 6 7 8 9 10.

Extraction of association rules from a heterogeneous set of gene data

We have proposed the use of Association Rule Discovery (ARD) as a method capable of identifying rules linking any pieces of biological data and which does not impose any ordering over the use of data sources. We developed an application called GenMiner to fully exploit the capacities of ARD in the context of biological data mining. GenMiner allows the joint use of knownledge about genes and their level of expression under certain conditions in order to discover the relationships between a priori knowledge and experimental measurements. Our method includes a new algorithm, called NorDI (Normal DIscretization algorithm) to discretize gene expression measurements and generate expression profiles. The experiments we conducted confirmed the advantages of GenMiner over known approaches. GenMiner allows to search for association rules using a much smaller minimum support than what is possible with traditional approaches. In addition, GenMiner significantly reduces the number of extracted rules, making it much easier for the end user to explore and interpret 11 12 13.

Alongside these activities, research was also carried out on the parallelization of the Blast algorithm 14. Collaborations with biologists still continue, resulting in collaborative research where we are primarily responsible for data analysis 15 16 17 18 19 20 21.

Funding

ProgramInter-EPST Program on bioinformatics
Year2002-2004
FunderCNRS, INSERM, INRA, INRIA, Ministry of Research
Grant nameThe use of a knowledge base system to analyze microarray data
Project coordinatorClaude Pasquier
ProgramCNRS Bio-STIC-LR
Year2005-2007
FunderCNRS, INSERM, INRA, INRIA, Ministry of Research
Grant nameTowards an editor for the subdivision of trees into sub-trees collections formals and functionals criteria for the subdivision process, intra-inter collection trees comparaisons
Project coordinatorFrançois Chevenet
ProgramCNRS post-doctoral grant
Year2008-2010
FunderCNRS
Grant nameTranscriptome mass data use and interpretation using the Massively Parallel Signature Sequencing (MPSS) technologies
Grant recipientRonnie Alves
Project coordinatorClaude Pasquier
ProgramANR Methylclonome
Year2013-2015
FunderANR
Grant nameAnalyse de l’héritabilité des traces épigénétiques dans la reproduction clonale
Grant idANR-12-BSV6-0006
Project coordinatorAlain Robichon
ProgramGliosplice
Year2017-2020
FunderInstitut National du Cancer (INCa)
Grant nameCharacterization of alternative splicing networks coordinating brain tumor heterogeneity and treatment resistance commitment
Project coordinatorMathieu Gabut

Softwares

  • AllOnto: Knowledge Base System to store and query RDF/OWL specifications
  • CGGA: Extraction of bi-clusters of genes
  • GenMiner: Mining equivalence classes and minimal non-redundant association rule from gene expression data
  • NORDI: Discretization of gene expression data according to the distribution of the dataset
  • THEA: Integrated information processing system dedicated to the annotation of transcriptomic results
  • Thea-Interact: Analysis of the interaction network of Drosophila genes
  • Thea-Online: Web portal using Semantic Web technologies to integrate, query and display information from multiple sources

  1. ↩︎

  2. ↩︎

  3. ↩︎

  4. (2004). Analysis of Microarray Data with THEA. Lyon’s International Multidisciplinary Meeting on Post-Genomics: Integrative Post-Genomics (IPG'04).

    ↩︎

  5. (2005). Exploratory Analysis of Cancer SAGE Data. 9th European Conferences on Principles and Practice of Knowledge Discovery in Databases (PKDD'05), Discovery Challenge.

    PDF Conference Site

    ↩︎

  6. ↩︎

  7. ↩︎

  8. ↩︎

  9. ↩︎

  10. ↩︎

  11. (2007). GenMiner: Mining Informative Association Rules from Genomic Data. IEEE International Conference on Bioinformatics and Biomedicine (BIBM'07).

    PDF DOI Article Link Conference Site

    ↩︎

  12. ↩︎

  13. (2008). Mining Association Rule Bases from Integrated Genomic Data and Annotations. 5th International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB'08).

    PDF

    ↩︎

  14. (2004). Distributed BLAST with ProActive. 1st Grid PlugTest. ETSI Headquarters. ProActive User Group.

    ↩︎

  15. ↩︎

  16. ↩︎

  17. ↩︎

  18. ↩︎

  19. ↩︎

  20. ↩︎

  21. ↩︎

Claude Pasquier
Claude Pasquier
Researcher in Computer Science / Computational Biology

Université côte d’Azur, CNRS, I3S Laboratory, Sophia Antipolis

Related