Le cloud IFB et ses services bioinformatiques Christophe BLANCHET Institut Français de Bioinformatique - IFB French Institute of Bioinformatics - ELIXIR-FR CNRS UMS3601 - Gif-sur-Yvette - FRANCE Ecole Cumulo NumBio 2 juin 2015, Aussois
IFB - Ins(tut Français de Bioinforma(que Mission : to make available core bioinformatics resources to the national/ international life science research community. To provide support for national biology programs To provide an IT infrastructure devoted to management and analysis of biological data To act as a middleman between the life science community and the bioinformatics/computer science research community http://www.france-bioinformatique.fr ELIXIR French Node optimizing the interactions and coordination between the national level and ELIXIR and other ESFRI infrastructures in biomedical and environmental field, promoting consistency and complementarities between the components offered by the ELIXIR French node and those of other European nodes 2
Life Sciences Pla<orms in France National platforms (GIS IBISA) Nb Cellular imaging 19 Genomic, Transcriptomic 16 Proteomic 13 Structural biology, biophysic 11 NGS BI C IMG Biological platform (Genomics, IMaGing, PROteomics...) Bioinformatics center Cloud resources Scientists PRO C NGS BI NGS PRO BI C French NGS platforms PRO BI IMG PRO NGS PRO C IMG BI C C Source: omicsmaps.com Regional centers distribute the load in terms of computing and storage, and provide better interactions with scientists Des sites intermédiaires permettent de répartir la charge en terme de stockage et de puissance de calcul tout en assurant une meilleure proximité avec les scientifiques Un déluge de donnée. Blanchet C. et Collin O., 2011, Biofutur, 323: 64-67 3
Na(onal and European Infrastructures FR / EU 4
Bioinforma(cs gateways 5
Many public Galaxy instances Abims CBIB GalaxEast Genotoul Genouest IFB-core Institut Pasteur Institut Curie MetaboHub Migale PRABI Sigenae SouthGreen URGI Lot of bioinformatics tools and services to treat and vizualize the biological data 6
Deploy and operate IFB s na(onal IT infrastructure as a cloud 7
IFB e- Infrastructure Mission : to make available core bioinformatics resources to the national/international life science research community. To provide a French IT infrastructure devoted to management and analysis of biological data To collaborate with international infrastructure (ELIXIR) Goal : help scientists and engineers to deploy and use their tools e-infrastructure: provide hardware, data collections and bioinformatics tools Current resources A national hub: IFB-core hosted at CNRS IDRIS SC center A network of regional centers: 31 bioinformatics plateforms - 15,000 cores - 5 PB Create a federation of clouds for life sciences 8
Federa(on of Clouds APLIBIO IFB-NE IFB-GO IFB-core PRABI Current clouds IFB-core GenOuest URGI BISTRO/IPHC BiLille/Univ.Lille IFB-SO IFB-GS Next? 9
IFB- core s cloud IFB-core # Compute Cores # TB Storage # TB RAM Max VM size Technology Location Pilot 200 50 2 40c 256GB StratusLab CNRS-IDRIS, Paris 2016-S1 3,000 500 -?144c 3TB? StratusLab CNRS-IDRIS, Paris 2017 10,000 2,000 -?? StratusLab CNRS-IDRIS, Paris PaaS NGS, imaging, statistics, S Iaa RENATER 10giga Scientists Sha red FS launch jobs SaaS Master Workers Virtualization Layer Frontend Web portal Pdisk storage 10giga eth Hosted @ IDRIS CNRS SC-center iscsi 10giga eth Cloud Hypervisors - std nodes: 32c 128GB - bigmem nodes: 40c 256GB 10
A cloud driven through a web dashboard http://cloud.france-bioinformatique.fr/cloud 11
Storage for biological data CLI (scp, sftp), GUI (Cyberduck, Transmit, Filezilla, ) sftp/http Upload your data Public Data sources Genomes EMBL PDB UNIPROT PROSITE shared (NFS ro) BLAST, Clustal, etc. PaaS IaaS launch jobs ssh Shared FS Master & Storage VM ARIA Workers VM CNS Identity Mgmt j. doe e. martin you chb virtual disks Portal Bioinformatics Cloud cg User data sftp/http Get your results CLI (scp, sftp), GUI (Cyberduck, Transmit, Filezilla, ) 12
Monitor your usage 13
Move VMs rather than data NGS C BI NGS data PRO VM BI VM C PRO VMs IFB s marketplace & VMs repository for life sciences NGS data VM PRO BI C IMG PRO data IMG BI NGS BI IMG PRO Biological platform (Genomics, IMaGing, PROteomics...) Bioinformatics centre C C C Cloud resources Researchers 14
Provide scien(sts with bioinforma(cs resources - data and tools - as cloud appliances 15
Access to local compu(ng resources and tools Bioinformatics Center Scientist tool? pipeline? have access to local resources Availability? No Ask Yes No Run Installation Time++ Computer Resources Usually one cluster for all use Constraint? Yes? Engineers 16
IFB s Cloud Scientist tool? pipeline? have access to IFB s resources Services registry Ask App? No New Engineers, Developers Yes IFB s e-infrastructure Link Run Register Appliance Appliance Appliance Appliance Market place Cloud Bioinformatics or public cloud. Regional, national or a federation. 17
Use case : deploy simple bioinforma(cs applica(on Use case AuthN/Z VMs/disks lifecycle Cloud dashboard (W3) log in + data(+) deploy Application interface (W3) VM user data User AuthN/Z refdata Cloud SW IFB web dashboard StrastusLab Slispstream TRESOR Docker SAML/ XCAML Cluster OpenNaaS Hadoop ( ) HW User s LAN WAN academic + public DC A refdata(++) 18
Use case: deploy complex bioinforma(cs applica(on Use case AuthN/Z User Cloud dashboard (W3) VMs/disks lifecycle log in deploy log in (ssh) + data (scp) AuthN/Z Application interface (W3) master shared data viz worker worker worker worker worker worker worker worker worker worker isolated network Cloud SW IFB web dashboard StrastusLab Slispstream TRESOR Docker SAML/ XCAML Cluster OpenNaaS Hadoop ( ) refdata(+) HW User s LAN WAN academic + public SDN DC A geno mes Sync geno mes DC B refdata(+) 19
Use case: cloud processing of experimental data Use case AuthN/Z User VMs/disks lifecycle Cloud dashboard (W3) VMs/disks lifecycle deploy log in Sequencing Engineer AuthN/Z X11 log in (ssh) + data (scp) Application interface (W3) master shared data viz Archives worker worker worker worker worker worker worker worker worker worker isolated network Cloud SW IFB cloud dashboard StrastusLab Slispstream TRESOR Docker SAML/ XCAML Cluster OpenNaaS Hadoop ( ) HW User s LAN WAN academic + public DC A Genomes refdata(+++) 20
Use case: cloud processing of human biomedical data Use case AuthN/Z VMs/disks lifecycle Cloud dashboard (W3) log in + data(+) deploy Application interface (W3) VM Secure patient data User AuthN/Z refdata Cloud SW IFB web dashboard StrastusLab Slispstream TRESOR Docker SAML/ XCAML Cluster OpenNaaS Hadoop ( ) Secure HW User s LAN Secure WAN academic + public DC A hg19 refdata(++) 21
Integra(on of public data sources shared storage Databases Cloud IFB manage BIOMAJ (VM) All virtual machines 22
Integra(on of bioinforma(cs tools Integration Defining good practices Interoperability Technologies package: rpm, deb, tgz, msi container (LXC): docker, LXD virtual machine software environment infrastructure SaaS Paas IaaS Source: www.docker.com System configuration quattor, puppet, chef, ansible Software integration yum, apt, approver, docker, puppet 23
Create appliances tools BLAST FastA OMSSA ClustalW2 SSearch PeptideShaker ARIA BWA X!tandem HMMer TopHat samtools Galaxy Clustal Muscle fastqc Omega Create new cloud services Virtual Machines R + Linux system Bioinformatics Marketplace Description of the + Appliance Title Contact (and maintainer!) Description (w. controlled voc.) Structures Sequences Proteomics Galaxy... 24
Current bioinforma(cs appliances Remote desktop Web Proteomics Galaxy Scientific apps Galaxy Galaxy MODAL AVIESAN 2013 RSAT Galaxy RADseq PhyML R statistics CLI biocompute Aria Node Imaging Eco Pop Utilities biodata BioMaj BlobSeer biodata NFS Cassandra Neo4j Data mgmt biohadoop Docker CentOS Ubuntu Base OS 25
Assist developers to create appliance Appliances are created by the life science developers/experts of different domains Linux system Appliances in progress BioDataCloud-RNAseq ProFi SymBioWatch Clinical NGS for cancerology REPET manual Galaxy + Tools automatic TriAnnot Galaxy RAD-seq Bacterial genomics, metagenomics Versions 1,2..n & Updates Galaxy Galaxy Galaxy Galaxy NGS RAD-seq MODAL Aviesan Cloud Marketplace Publish VMs 26
Docking bioinforma(cs tools IFB docker hub (@GenOuest) Registry of images pull push IFB s Cloud abyss blast Docker Image size (MB) blast+ docker virtual machine Develop docker virtual machine Use bowtie bwa clustalw2 cufflinks fasta36 fastqc gor4 hmm meme mmseq cufflinks http://docker-ui.genouest.org 0 275 550 825 27
Avantages du cloud Pour les utilisateurs simplicité d utilisation possibilité de modifier son environnement avec de nombreux outils, pipelines et plateformes pré-configurés (cloud marketplace, docker hub, galaxy toolshed) sans perturber les autres usagers Pour les administrateurs se concentrer sur l administration du système (pas les données ni outils scientifiques) outils de déploiement : (quattor), puppet 28
Avantages du cloud (2) Pour les développeurs disposer de sa propre copie (ou plusieurs) de la version stable courante sans perturber les autres développeurs automatisation et intégration continue intégrer et décrire une seule fois leur logiciel sans se soucier de la gestion des identités et quotas proposer un environnement complet pré-configuré vecteurs multi-plateforme de taille légère figer une version de démonstration (congrès, formations, papiers ) pré-production production développement Mix de différentes infrastructures (phys., virtuelle) 29
Use bioinforma(cs cloud services 30
Browse the appliances marketplace and run yours! Proteomics Sequences Galaxy Structures?... IFB Marketplace! 31
RAINBio : Registry of bioinforma(cs tools and VMs topics? tools? Cloud Marketplace e.g. IFB VMs Services Registry e.g. ELIXIR Tools Life science researcher Prototype 32
App Biocompute Standard bioinformatics node With pre-installed standard bioinformatics tools BLAST, FastA, SSearch,HMM,... ClustalW2, Clustal-Omega, Muscle,.. Bowtie(2), BWA, samtools,... MEME, R, etc. Connected to public reference datasets Uniprot, EMBL, genomes, PDB, etc. Automatically shared with the VMs Cluster mode turn several instances in a single virtual cluster shared file system batch scheduling 33
AppS Cloud Galaxy Portal Galaxy portal is widely used in the community analyse NGS data (mainly but not only) connected to community knowledge: data and indexes, tools, workflows Cloud advantages : Versions 1,2..n & Updates User is administrator of his/her own Galaxy instance: install data and tools Preserve workflows and results in cloud storage Help the integration of monthly updates and new tools Cloud permit different appliances to be built from the same base: a basic one with common tools for NGS specific ones for a domain or a set of tools e.g. Galaxy-MODAL, Galaxy-RADseq for training: create a special appliance with dedicated datasets, tools or workflows e.g. AVIESAN school 2013 manual Galaxy NGS Run Linux system Galaxy + Tools Galaxy RADseq Galaxy automatic Galaxy MODAL Aviesan Cloud Marketplace Install Publish VMs 34
App A specialized software suite for the analysis of non-coding sequences motif discovery in promotors of coexpressed genes CHIPseq analysis evolutionary conserved motifs (phylogenetics footprints) Contact: J. van Helden (TGAC) RSAT offers a series of tools dedicated to the detection of regulatory signals in non-coding sequences input a list of genes of interest discover putative regulatory signals, search the matching positions for these signals in your original dataset or in whole genomes, display the results graphically in the form of a feature map. 35
Motivation App Proteomics desktop Collaboration with a mass spectroscopy platform Running out of space on their local resources Protein identification tools Mass experimental data Reference databases : nr, Swiss-Prot Reference screening tools: OMSSA, X! Tandem User interface Remote Virtual Desktop (NX) Reference GUI PeptidShaker 36
App R Sta(s(cal Compu(ng R software environment for statistical computing and graphics include common bioinformatics module RStudio IDE Biobase, BiocGenerics, BiocInstaller, GenomeInfoDb integrated development environment (IDE) for R features: console, syntax-highlighting editor Shiny web framework powerful web framework for building web applications using R. without requiring HTML, CSS, or JavaScript knowledge. Contact: S. Delmotte (PRABI-LBBE) 37
App Hadoop for Life Science Provide turnkey virtual machine with preconfigured mapreduce framework Accelerate biological bigdata analysis Hadoop MapReduce 1.0.4 Appliances (2) provide standard hadoop: including mapreduce and HDFS with integrated bioinformatics tools Example of sequence similarity searching FastA & SSearch deploy database of sequences in HDFS compare each structure to others Databank Developed within the French project MapReduce, ANR ARPEGE FastAMR splits the databank into subsets and puts them in the DFS along with the sequences file FastAMR subset #01 FastA #01 subset #02 Mappers FastA #02...... Each mapper send the score and sequences to reducers Reducers Results score sequence score sequence... Users run the FastAMR script with its sequences and the databank User's Sequences Each mapper runs a FastA program on a part of the databank Reducers copy the best scores of the whole experiment in the DFS 38
Anima(on de la communauté 39
GRISBI Groupe de réflexion sur les InfraStructures BioInformatiques Concertation technologique IFB-core/centres régionaux/ plateformes identification des besoins, choix des orientations, suivi des développements transfert de compétences en technologie cloud Lien avec les partenaires technologiques IDRIS, mésocentres, StratusLab, IDGC, ELIXIR, IFB-core APLIBIO PRABI IFB-GO IFB-SO IFB-GS IFB-NE Parten. Parten. IFB-NE 3 IFB-GS1 2 IFB-SO 2 IFB-GO 5 PRABI 2 IFB-core 6 APLIBIO 13 Grisbi-27 (32) 3 2 11 4 9 2 6 Grisbi-26 (28) 2014-11 2 2 2 2 2 2 4 6 Grisbi-25 (22) 2014-06 2 2 2 1 1 3 3 8 Grisbi-24 (22) 2013-11 http://www.france-bioinformatique.fr/?q=fr/groupe-de-travail/grisbi 2 2 1 2 4 4 Grisbi-23 (15) 2012-12 3 1 2 1 2 6 Grisbi-22 (15) 2012-06 40
Tutoriels Cloud pour la Biologie Objectifs Formation d'initiation à l utilisation du cloud computing pour l'analyse de données biologiques avec les outils usuels de bioinformatique Aborde les concepts et principes généraux du cloud, ainsi que la description des outils, des usages et de son intérêt pour la Bioinformatique. Une partie pratique sur les clouds concernés par la session (IFB-core ou Genocloud). 2014 19 juin 2014-23 participants - IFB-core, Gif-sur-Yvette 6 novembre 2014-20 participants - GenOuest, IFB-GO, Rennes 2015 en sept-oct au PRABI autres? contactez nous! FORMATEURS - Christophe BLANCHET (IFB- core) - Olivier COLLIN (GenOuest) - Jean- François GIBRAT (IFB- core) - Marie GROSJEAN (IFB- core) - Charles LOOMIS (LAL) - Cyril MONJEAUD (GenOuest) - Yvan LE BRAS(GenOuest) - Olivier SALLOU (GenOuest) 09h00-09h30 09h30 10h00 10h00-11h00 11h00-11h30 11h30 13h00 13h00 14h00 14h00 15h30 15h30 16h00 Accueil des participants Présentation de l IFB Salle Markov Cloud computing Salle Markov Pause- café Salle Markov Cloud pour la biologie Salle Markov Déjeuner Hall Amphi G Exemples d application de cloud Salle Markov Pause- café 41
Ecoles Ecole Cumulo Numbio 2015 Organisateur JDEV 2015 - Journées nationales du DEVeloppement LOGiciel de l'enseignement Supérieur et Recherche 30 juin-3 juillet 2015, Bordeaux http://devlog.cnrs.fr/jdev2015 T5 - Infrastructures et interopérabilité : le cloud et les architectures orientées service (SOA) Présentation du Cloud IFB «Une e-infrastructure nationale en bio-informatique» Animation d un groupe de travail (C.Blanchet, Y. Le Bras) «Comment mettre en oeuvre une infrastructure pour ma thématique scientifique?» 42
Conclusion - Current IFB cloud 20 bioinformatics appliances already available + 9 in progress Scientific production - 149 users (May 2015) opened to members of IFB (standard allocated resources) opened to partners, academic and industry, infrastructures and projects: e.g. BioDataCloud, ProFi, MetaboHub, extra resources allocation according to scientific and financial criteria Training - 70 users French tutorials Cloud pour la Biologie tutorial at ECCB 14 about Analysis of Cis-Regulatory Motifs from High-Throughput Sequence Sets Bioinformatics Master in Marseille (2014), in Rouen (2015) 43
Perspec(ves Create more bioinformatics appliances By the experts of the different life sciences domains To make them available to the community Current appliances in progress: BioDataCloud-RNAseq, ProFi, SymBioWatch, Clinical NGS for cancerology, REPET, TriAnnot, BWA-mpi IFB supports different domain-specific developments Microbial Bioinformatics, Evolutionary bioinformatics, Plant bioinformatics, Structural Biology, NGS data processing 2015 Call for proposals to support technological developments and integration of data and tools Pilots Multi-cloud deployment of complex applications Secure cloud for human biomedical data analysis Live remote cloud processing of sequencing data Data registry of multi-cloud datasets Docker hub for life sciences 44
Ques(ons? Acknowledgments IFB members IFB hub: Patricia, Marie, Jean-François, Mohamed, Quentin we are hiring! Appliances developers: Samuel Blanck (Inria Lille), Jacques van Helden (TAGC), Stéphane Delmotte (PRABI-Doua), Bruno Spataro (PRABI) You? CNRS IDRIS: R. Medeiros, C. Gauthey and staff StratusLab members IFB is funded by French program PIA INBS 2012 EU H2020 EGI-Engage (654142) and CYCLONE (644925) projects http://www.france-bioinformatique.fr 45
Sessions pra(ques IBI-1 Introduction au Cloud IFB (Salle du Râteau) Contenu : Pratique des fonctionnalités de base du cloud IFB: le tableau de bord en ligne (Web), le déploiement des machines virtuelles, le stockage des données avec des disques virtuels, leur gestion, les différents types de connexion aux VMs (SSH, Web et bureau à distance). Objectifs : Savoir utiliser le cloud IFB pour ses analyses de données biologiques, exécuter ses propres machines virtuelles et transférer ses données entre son poste de travail et le cloud. IDO-1 Navigation et recherche d'informations dans les bases des données biologiques publiques (Salle de l Echelle) Contenu : A partir d'un cas d'utilisation réel, recherche d'informations dans les bases, expression de requêtes (questions), inspection des résultats, croisement (manuel) des informations issues de plusieurs bases. Objectifs : Savoir utiliser les principaux moteurs de recherche d'informations de données biologiques publiques européens et d'amérique du nord; Savoir utiliser les fonctionnalités avancées de ces outils; Identifier les différents niveaux d'hétérogénéité des données; complémentarité, redondances et apparentes divergences des données. 46