A Pipeline Based Approach for Experimental Neuroscience Data Management

A Pipeline Based Approach for Experimental Neuroscience Data Management THÈSE N O 4863 (2011) PRÉSENTÉE le 22 juillet 2011 À LA FACULTÉ SCIENCES DE LA VIE LABORATOIRE DE NEUROSCIENCE DES MICROCIRCUITS PROGRAMME DOCTORAL EN NEUROSCIENCES ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES PAR Muhammad Asif Jan acceptée sur proposition du jury: Prof. M. Herzog, président du jury Prof. H. Markram, Dr F. Schürmann, directeurs de thèse Prof. A. Ailamaki, rapporteur Prof. J. G. Bjaalie, rapporteur Dr A. Ph. Davison, rapporteur Suisse 2011

Abstract The field of neuroscience is witnessing a huge influx of experimental data thanks to the improvements in the data acquisition tools and techniques. Most of this data is being collected by thousands of experimenters located in various institutions around the world. There is also a growing interest in using this experimental data for building biologically realistic computational models of the neuronal systems. One such project i.e. the Blue Brain Project is developing a bottom-up neuronal modeling and simulation framework using experimental data collected at laboratories worldwide. However, to use the experimental data effectively in computational neuroscience research, the data needs to be annotated with metadata such as details of the experimental protocols and conditions, animal subject used etc. Most of the platforms for experimental neuroscience data management operate under the assumption that the primary data has been validated, and properly annotated before it is uploaded to the system. Thus putting the responsibility of validating, and annotating the experimental data with the experimenter conducting the study. Consequently, most experimenters maintain their own data using ad-hoc systems with non-standard metadata schemes; and only upload the final set of data to the data management platform. As a result, the metadata is usually incomplete, and does not always comply with the requirements of the data management platform, negatively impacting the reusability of their data. The current thesis work explored the question of the experimental data management within the context of the Blue Brain Project, and designs a data management pipeline catering for the acquisition, annotation and validation of the experimental data and associated metadata, in order to ensure that the data is usable for the computational modeling purposes. The pipeline enforces a data review process whereby incoming experimental data was made to go through a series of steps, analogous to the scientific review process, systematically improving the quality of the experimental data and associated metadata. Keywords: Data Management Pipeline, Neuroscience Data Management, Primary Data Management, Neocortical Microcircuit Data 3

RÉSUMÉ Grâce au développement des techniques et des outils d acquisition, nous assistons à une augmentation massive des flux de données dans le domaine des neurosciences. La plupart de ces données proviennent de milliers de chercheurs disséminés dans de nombreuses institutions du monde entier. Il y a donc un intérêt croissant dans l utilisation de ces données pour élaborer des modèles de calcul neuronaux réalistes du point de vue biologique. Un tel projet, en l occurrence le projet Blue Brain, développe une plateforme complète permettant la modélisation et la simulation de systèmes neuronaux en utilisant des données expérimentale en provenance de laboratoire située dans le monde entier. Pour pouvoir utiliser efficacement ces données expérimentales dans la recherche appliquées aux calculs neuronaux, ces dernières doivent être annotées avec des métadonnées comme des détails sur le protocole de l expérience, les conditions, l animal utilisé comme sujet, etc. Il se trouve que les processus mis en place par les différents expérimentateurs ne sont pas conformes avec les standards exigés pour la mise en place d une plateforme de gestion des données. La plupart des plateformes pour la gestion des données expérimentales dans le domaine des neurosciences reposent sur l hypothèse que les données ont à la base été validées et correctement annotées avant d être insérée dans le système. De cette manière la responsabilité de valider et d annoter correctement les données et confiée à l expérimentateur qui effectue sa recherche. En conséquence, la plupart des expérimentateurs gèrent leurs données en se basant sur des schémas non standard pour les métadonnées et ne chargent que tardivement toutes leur données sur la plateforme de gestion des données. Ainsi leur métadonnées sont la plupart du temps incomplètes et ne respectent pas les règles d utilisation de la plateforme de gestion des données avec un impact négatif sur la possibilité de réutiliser leurs données. Ce travail de thèse couvre les questions liées à la gestion des données dans le cadre du Projet Blue Brain et propose une architecture pour la gestion séquentielle par regroupement des données pour l acquisition, l annotation et la validation des données expérimentales et leurs métadonnées, ceci afin d assurer que ces données sont exploitables dans des modèles de calcul. Le traitement séquentiel impose une révision des données par le biais duquel les données expérimentales doivent passer au travers d un certain nombre d étapes, à l instar du processus de validation 4

scientifique, améliorant de manière systématique la qualité des données et métadonnées expérimentales. Mot clés: Gestion séquentielle des donnée, gestion des données en neurosciences, gestion des données de base, données sur les microcircuit néo-corticaux 5

Table Of Contents LIST OF FIGURES... 11 CHAPTER 1... 15 INTRODUCTION... 15 1.1 RESEARCH ENVIRONMENT... 15 1.2 LNMC AND BLUE BRAIN PROJECT... 17 1.2.1 MORPHOLOGY... 18 1.2.2 ELECTROPHYSIOLOGY... 19 1.2.3 GENE EXPRESSION... 23 1.3 NEUROINFORMATICS AND THE DATA CHALLENGE... 23 1.4 BUILDING BLOCKS FOR EXPERIMENTAL NEUROSCIENCE DATA MANAGEMENT... 26 1.5 RESEARCH OBJECTIVES... 30 1.6 STRUCTURE OF THE THESIS DOCUMENT... 32 CHAPTER 2... 35 RELATED RESEARCH... 35 2.1 DATA MANAGEMENT APPROACHES... 35 2.1.1 FLAT FILES... 35 2.1.2 XML... 36 2.1.3 RELATIONAL DATABASE... 37 2.1.4 PERSISTENT COLLECTIONS... 39 2.2 NEUROSCIENCE DATABASES... 40 2.2.1 CELL CENTERED DATABASE (CCDB)... 41 2.2.2 SENSELAB... 42 2.2.3 NEUROSYS... 44 2.2.4 SUMATRA... 45 2.2.5 COCODAT... 46 2.2.6 ALLEN BRAIN ATLAS (ABA)... 46 2.3 NEED FOR DATA INTEROPERABILITY AND SHARING... 47 2.3.1 COMMON DATA MODEL (CDM) FOR THE NEUROPHYSIOLOGY DATA... 49 2.3.2 MINIMUM INFORMATION ABOUT A NEUROSCIENCE INVESTIGATION (MINI)... 49 2.4 ENABLING TOOLS FOR SEAMLESS DATA MANAGEMENT AND ACCESS... 51 2.4.1 STORAGE RESOURCE BROKER AND IRODS: ENABLING A SECURE DATA GRID ENVIRONMENT... 51 2.4.2 ECLIPSE RICH CLIENT PLATFORM (RCP)... 52 2.5 RESEARCH CONTRIBUTIONS... 54 2.6 CHAPTER SUMMARY... 57 CHAPTER 3... 59 A PIPELINE BASED APPROACH FOR END TO END MANAGEMENT OF EXPERIMENTAL NEUROSCIENCE DATA... 59 3.1 DATA MANAGEMENT PIPELINE... 60 3.1.1 DATA COLLECTION TIER... 63 7

3.1.2 DATA MANAGEMENT TIER... 64 3.1.3 DATA UTILIZATION TIER... 66 3.2 END TO END WORKFLOW... 66 3.2.1 DATA COLLECTION TIER... 67 A1/A2. Data Acquisition & Metadata Extraction...67 A3. Data Staging...68 A4. Data Validation and Annotation...69 3.2.2 DATA MANAGEMENT TIER... 69 B1. Metadata Shredding and Import...71 B2. Associating Primary Data with Cell Metadata...71 B3. Connectivity Extraction...72 B4. Associating Primary Data with Connection Metadata...72 3.2.3 DATA UTILIZATION TIER... 73 C1. Data Quality Validation...73 C2. Data Browsing and Download...74 C3. Normalization...74 C4. Automated Analysis...74 3.3 CHAPTER SUMMARY... 75 CHAPTER 4... 77 DESIGN METHODOLOGY... 77 4.1 DESIGN APPROACH... 77 4.1.1 XML SCHEMA FOR METADATA MANAGEMENT... 78 4.1.2 PERSISTENT COLLECTION FOR PRIMARY DATA STORAGE... 82 4.1.3 RELATIONAL DATABASE FOR KEY FEATURE DATA... 83 4.1.4 PROCESSES ENABLING END TO END WORKFLOW... 84 4.2 COMPARING RELATIONAL DATABASE WITH PERSISTENT COLLECTION FOR PRIMARY DATA STORAGE... 85 4.2.1 FORMAT OF THE DATA... 85 4.2.2 DATA ACCESS BY USERS AND APPLICATION PROGRAMS... 86 4.2.3 AUTHENTICATION AND AUTHORIZATION... 87 4.2.4 DISTRIBUTED ARCHITECTURE... 87 4.3 GENERIC VERSUS DOMAIN SPECIFIC SCHEMA DESIGN... 88 4.3.1 GENERIC SCHEMA DESIGN... 88 4.3.2 DOMAIN SPECIFIC SCHEMA DESIGN... 90 4.3.3 COMPARING GENERIC AND DOMAIN SPECIFIC SCHEMA DESIGN... 92 4.4 CHAPTER SUMMARY... 98 CHAPTER 5... 99 RESULTS... 99 5.1 DATA MODEL FOR THE NEOCORTICAL MICROCIRCUIT DATA... 99 5.1.1 GENERAL INFORMATION...102 Project... 102 Experiment... 102 Animal (Species)... 104 Setup... 105 Layer... 106 5.1.2 CELL INFORMATION...106 Electrophysiology and Morphology Type of the Cell... 109 5.1.3 CONNECTION INFORMATION...109 5.1.4 PRIMARY DATA...111 8

Electrophysiology Data... 111 Morphology Data... 112 5.1.5 CELL AND CONNECTION PROFILES...114 5.2 TOOLS CONSTITUTING END TO END DATA MANAGEMENT PIPELINE...115 5.2.1 DATA COLLECTION TIER...116 A1. Igor elabbook... 116 A2. Excel2Xml and Text2Xml... 116 A3. ExperimentCopy... 118 A4. Web elabbook... 119 5.2.2 DATA MANAGEMENT TIER...120 B1. Xml2Sql... 120 B2. PopulateTraceDB... 122 B3/B4. Connectivity Extractor... 123 5.2.3 DATA UTILIZATION...123 C1. TraceQualityCheck... 123 UserBase and LabBase Systems... 125 UserBase System... 125 LabBase System... 129 Facilitating Circuit Building for Simulations... 132 5.3 CHAPTER SUMMARY...133 CHAPTER 6...135 CONCLUSIONS...135 A PIPELINE BASED APPROACH TO EXPERIMENTAL NEUROSCIENCE DATA MANAGEMENT...136 DATA MODEL FOR THE NEOCORTICAL MICROCIRCUIT DATA METADATA MODEL AND RELATIONAL DATABASE MODEL...138 DEMONSTRATION OF A HYBRID APPROACH FOR SCIENTIFIC DATA MANAGEMENT...139 EXTENDING THE SEARCH FUNCTIONALITY OF A PERSISTENT COLLECTIONS SERVER...140 CHAPTER 7...143 OUTLOOK...143 REFERENCES...145 WEB SITE REFERENCES:...154 APPENDIX A...155 A REPORT ON COMPARISON OF EXISTING SOFTWARE STRUCTURES AND DATA MODELS IN SINGLE NEURON DATABASING...155 PRESENTED TO FACETS CONSORTIUM (FP6-2004-IST-FETPI 15879)...155 APPENDIX B...175 KEYWORD INDEXING OVER STORAGE RESOURCE BROKER...175 PRESENTED AT INTERNATIONAL GRID COMPUTING, HIGH PERFORMANCE AND DISTRIBUTED COMPUTING (GADA) 2007...175 APPENDIX C...189 ACKNOWLEDGEMENTS...205 CURRICULUM VITAE...207 9