Biologie Computationnelle

Biologie Computationnelle Pour quoi faire? Protein family Gene sequence Structure Gene networks Gene order Non-coding regions

Bioinformatique Translationnelle PLOS Computational Biology A Peer-Reviewed, Open Access Journal Translational Bioinformatics : a collection of Education Articles, 2012 http://www.ploscollections.org/article/browseissue.action?issue=info:doi/10.1371/issue.pcol.v03.i11 Terrain d'illustration de l'impact de la Biologie Computationnelle : - intégrer des quantités considérables de données moléculaires et cliniques pour une meilleure compréhension des bases moléculaires de maladies qui à son tour modifiera les pratiques cliniques et comme horizon améliorer la santé humaine. Informatics : the study of how to represent, store, search, retrieve and analyze information Imaging Informatics : focuses on images Medical Informatics : concerns medical information BioInformatics : concerns basic biological information Clinical Informatics : focuses on the clinical delivery part of medical informatics Biomedical Informatics : merges bioinformatics and medical informatics

Translationnelle : - Comment améliorer efficacement le diagnostic, le pronostic et le traitement des patients? Création d'appareils médicaux Diagnostic moléculaire Thérapeutique via de petites molécules Vaccins Etc. Maîtrise de l'explosion des connaissances en biologie moléculaire, génétique et génomique. Y a-t-il eu transfert de la connaissance de la structure en double hélice de l'adn vers une amélioration pratique avec un impact sur la santé humaine d'un point de vue de l'innovation technologique? Ce qui est certain, on est capable aujourd'hui de mesurer rapidement : - les séquences ADN (à l'échelle d'un génome entier!) - les séquences ARN et leur expression - les séquences protéique, leur structure, expression et modification - la structure, la présence et quantité de petits métabolites moléculaires

Genes ; DNA ; RNA messengers ; MicroRNAs ; Proteins Signaling molecules and cascades ; Metabolites ; Cellular communication processes ; Cellular organization PubMED : http://www.ncbi.nlm.nih.gov/pubmed UMLS : Unified Medical Language System http://www.nlm.nih.gov/research/umls/ Molecular Informatics : concerns molecular/cellular information Genbank http://www.ncbi.nlm.nih.gov/genbank/ Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo Protein Data Bank http://www.wwpdb.org/ KEGG Clinical Informatics : focuses on the clinical delivery part of medical informatics Translational Bioinformatics : the development and application of informatics methods that connect molecular entities to clininal entities : http://www.genome.jp/kegg/ MetaCyc Impact : Médecine personnalisée? Reactome Association de génomes, fouille de marqueurs génétiques Analyse de données génomiques personnelles, fouille de données issues d'enregistrements électroniques etc. http://metacyc.org http://www.reactome.org

4 Grands Chapitres pour cette session : - Imagerie Quantitative - Machine Learning / Data Mining - Représentation par graphes et réseaux - Représentation des connaissances : données, banque de données, ontologies Des technologies/ environnements de développement : - Java : ImageJ, Weka - Python : Biopython, Numpy, Scipy, Matplotlib, Enthought Python Distribution et Canopy - scripts : Perl, Gawk - Inkscape, ImageMagik / Sphinx / XML, SBML, BioPax, GPML, JSON, SQL, nosql, Hadoop Des softwares : - Clustal T-coffee, PathwayAPI, BioGRID, PatternHunter. Des choix, un co-design du cours...

Des travaux pratiques sur : - Intégration de Connaissance (Ontologies, données) - Construction de Réseaux d'interaction de gènes : R, statistiques - Visualisation de l'interactome : PPI, maladies, Cytoscape software, analyse topologique de graphes (network analyzer, arbres de Steiner trees, MST) - Assemblage de connaissance et interprétation : GSEA, GO, KEGG, BioLattice, pathology and ontology based analysis - Priorisation de gènes pathogènes : MLP, Weka - Analyse d'images : ImageJ, cellprofiler, DcellIQ, NeuronStudio, ITK - Visualisation -omiques : R, python - Algorithmes de Graphes, Complexité

Une perspective biologique permanente : - Gene feature recognition : TIS (Translation Initiation Site) TSS (Transcriptional Start Sites) Feature Generation Feature Selection Feature integration Gene finding - Gene expression analysis : Affymetrix Gene Chip Data Gene expression Profile Classification Gene expression Profile Clustering Gene Regulatory Circuits Reconstruction : Differentially Expressed Genes, Gene Interaction Prediction - Sequence Alignment / Comparaison / Homology : Multiple Sequence Alignment (Dynamic Programming) Function assignment to protein sequence (Guilt by Association) Discovery of Active Site or Domain of a function PPI / Proteomic Profile Analysis key mutation site identification - Phylogenetic tree : Construction Comparison -Biological Networks / Graph of Interactions : Natural pathways PPI Networks Protein Complex Prediction

TIS : Translation Initiation Site Recognition/Prediction Un échantillon de cdna 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT......iEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE Pourquoi le second ATG est-il un TIS? 80 160 240 80 160 240

Gene prediction

T-Cell Epitopes Prediction By Artificial Neural Network Honeyman et al., Nature Biotechnology 16:966-969, 1998

PAS Prediction BEGI N I ncomi ng sequences Feat ure generat i on Feat ure sel ect i on Feat ure i nt egrat i on END SVM in Weka

Histone Promoter Recognition Programs

9 Motifs Discovered by MEME algo in Histone Promoter 5 Region [250,-1] among 127 histone promoters

Diagnosis of Childhood Acute Lymphoblastic Leukemia (ALL) and Optimization of Risk-Benefit Ratio of Therapy Immunophenotyping

Affymetrix GeneChip Micro Array Analysis E2APBX1 MLL T-ALL Hyperdiploid >50 BCR-ABL Novel TEL-AML1

Proteomics Data : Guilt-by-Association Compare T with seqs of known function in a db Assign to T same function as homologs Discard this function as a candidate Confirm with suitable wet experiments

Proteomics Data: Subgraphs in Protein Interaction 318 edges 329 nodes In nucleus of S. cerevisiae Maslov & Sneppen, Science, v296, 2002 Topology of Protein Interaction Networks: Hubs, Cores, Bipartites

Yeast SH3 domain-domain Interaction network: 394 edges, 206 nodes Tong et al. Science, v295. 2002 8 proteins containing SH3 5 binding at least 6 of them

a b Configuration a is less likely than b in protein interaction networks Graph/Network Mining

Prognosis based on Gene Expression Profiling > a1 Decision Tree Based In-Silico Cancer Diagnosis x1 x2 x3 > a2 x4 A B B A A Root node Internal nodes B A Leaf nodes A B B A A

Given a test sample, at most 3 of the 4 genes expression values are needed to make a decision! Yeoh et al., Cancer Cell 1:133-143, 2002; Differentiating MLL subtype from other subtypes of childhood leukemia Training data (14 MLL vs 201 others), Test data (6 MLL vs 106 others), Number of features: 12558

Discovery of Diagnostic Biomarkers for Ovarian Cancer Motivation: cure rate ~ 95% if correct diagnosis at early stage Proteomic profiling data obtained from patients serum samples The first data set by Petricoin et al was published in Lancet, 2002 Data set of June-2002. 253 samples: 91 controls and 162 patients suffering from the disease; 15154 features (proteins, peptides, precisely, mass/charge identities) SVM: 0 errors; Naïve Bayes: 19 errors; k-nn: 15 errors.

Mining Errors from Bio Databases Uninformative sequences Invalid values Undersized sequences Ambiguity Example Meaningless Seqs Dubious sequences Vector contaminated sequence Crossannotation error RECORD Annotation error Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases, 3,327 protein sequences are shorter than four residues (as of Sep, 2004). In Nov 2004, the total number of undersized protein sequences increases to 3,350. Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004). In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711. Sequence structure violation Undersized protein sequences in major databases SINGLE SOURCE DATABASE Sequence redundancy Data provenance flaws Number of records 1200 1015 DDBJ 1000 EMBL 800 600 400 200 GenBank 528 383 364 218 171 116 123 3 0 51 2 0 151 42 1 Erroneous data transformation Incompatible schema 2 Sequence Length PDB SwissProt 125 12 23 0 MULTIPLE SOURCE DATABASE Undersized nucleotide sequences in major databases 3 PIR Number of records ATTRIBUTE 233 228 250 200 DDBJ 150 108 100 50 115108 73 81 69 45 40 6 2 9 3 104 77 51 55 67 2 3 Sequence Length 4 GenBank PDB 24 0 1 EMBL 5

Duplicates detected by association rules 60 49.4 Rule 1. Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates. FP% and FN% 50 40 36.3 32.7 30 20 10 6 2.4 1.8 Rule 2. Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates. Rule 3. Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates. 3.8 0.3 0.1 0 le Ru 1 le Ru 2 le Ru 3 9.4 7.9 7.5 5.2 5.7 le Ru 4 le Ru 5 le Ru 6 le Ru Association rules FP% FN% x 1000 Rule 1 S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%) Rule 2 S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%) Rule 3 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%) Rule 4 S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%) Rule 5 S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%) Rule 6 S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%) Rule 7 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%) 7

Phylogenetic tree construction Mbuti Pygmy Africa Europe Ethiopian Italian English Tibetan Root Asia Estimate order in which populations evolved Based on assimilated freq of many different genes But Japanese Navajo America Cherokee Indonesian Oceania Polynesian Papuan Australian Time since split Austalasia is human evolution a succession of population fissions? Is there such thing as a protoanglo-italian population which split, never to meet again, and became inhabitants of England and Italy?

Predicting interactions using phylogenetic profile Pellegrini et al. PNAS 96, 4285-4288 (1999)

Comparative genomics

GWAS Genome Wide Association Studies