Qu est ce qu un super-calculateur?

Documents pareils

A1-1 - Aspects théoriques et algorithmiques du calcul réparti

Génération de code binaire pour application multimedia : une approche au vol

Règles et paramètres d'exploitation de Caparmor 2 au 11/12/2009. Pôle de Calcul Intensif pour la mer, 11 Decembre 2009

Développement logiciel pour le Cloud (TLC)

Architecture des ordinateurs

XtremWeb-HEP Interconnecting jobs over DG. Virtualization over DG. Oleg Lodygensky Laboratoire de l Accélérateur Linéaire

Architecture des calculateurs

Ecole Mohammadia d Ingénieurs Systèmes Répartis Pr. Slimane Bah, ing. PhD G. Informatique Semaine 24

PACKZ System Requirements. Version: Version: Copyright 2015, PACKZ Software GmbH. 1

Eléments d architecture des machines parallèles et distribuées

THÈSE. présentée à TÉLÉCOM PARISTECH. pour obtenir le grade de. DOCTEUR de TÉLÉCOM PARISTECH. Mention Informatique et Réseaux. par.

Chapitre 2. Cluster de calcul (Torque / Maui) Grid and Cloud Computing

Initiation au HPC - Généralités

Instructions Mozilla Thunderbird Page 1

Utiliser une WebCam. Micro-ordinateurs, informations, idées, trucs et astuces

T. Gasc 1,2,3, F. De Vuyst 1, R. Motte 3, M. Peybernes 4, R. Poncet 5

EN UNE PAGE PLAN STRATÉGIQUE

WEB page builder and server for SCADA applications usable from a WEB navigator

3615 SELFIE. HOW-TO / GUIDE D'UTILISATION

DOCUMENTATION - FRANCAIS... 2

Application Form/ Formulaire de demande

Chapitre 1. Infrastructures distribuées : cluster, grilles et cloud. Grid and Cloud Computing

UML : Unified Modeling Language

Quantification d incertitude et Tendances en HPC

Installation de VirtualPOPC-1 sur Ubuntu Server LTS 64bits

DOCUMENTATION - FRANCAIS... 2

Forthcoming Database

INFO-F-404 : Techniques avancées de systèmes d exploitation

Le projet Gaïa, le Big Data au service du traitement de données satellitaires CRIP - 16/10/2013 Pierre-Marie Brunet

Instructions pour mettre à jour un HFFv2 v1.x.yy v2.0.00

Plan. Department of Informatics

NIMBUS TRAINING. Administration de Citrix NetScaler 10. Déscription : Objectifs. Publics. Durée. Pré-requis. Programme de cette formation

Optimized Protocol Stack for Virtualized Converged Enhanced Ethernet

INF6500 : Structures des ordinateurs. Sylvain Martel - INF6500 1

Package Contents. System Requirements. Before You Begin

Cedric Dumoulin (C) The Java EE 7 Tutorial

Editing and managing Systems engineering processes at Snecma

Tier 1 / Tier 2 relations: Are the roles changing?

Outils d'analyse de la sécurité des réseaux. HADJALI Anis VESA Vlad

How to Login to Career Page

Kick Off SCC EMC l offre EXTREMIO. fmarti@fr.scc.com Philippe.rolland@emc.com. Vers de nouveaux horizons

Guide d'installation rapide TFM-560X YO.13

Once the installation is complete, you can delete the temporary Zip files..

SCHOLARSHIP ANSTO FRENCH EMBASSY (SAFE) PROGRAM APPLICATION FORM

Exemple PLS avec SAS

Face Recognition Performance: Man vs. Machine

Improving the breakdown of the Central Credit Register data by category of enterprises

Xtremweb-HEP : A best effort virtualization middleware

Services à la recherche: Data Management et HPC *

Comprendre l impact de l utilisation des réseaux sociaux en entreprise SYNTHESE DES RESULTATS : EUROPE ET FRANCE

Protection de l Investissement Virtualisation, Efficacité énergétique

La Forge INRIA : bilan et perspectives. Hervé MATHIEU - 11 mai 2010

Mon Service Public - Case study and Mapping to SAML/Liberty specifications. Gaël Gourmelen - France Telecom 23/04/2007

Les environnements de calcul distribué

Language requirement: Bilingual non-mandatory - Level 222/222. Chosen candidate will be required to undertake second language training.

Quick Start Guide This guide is intended to get you started with Rational ClearCase or Rational ClearCase MultiSite.

Anticiper et prédire les sinistres avec une approche Big Data

Object Oriented Parallel Discrete Event Simulation: The PROSIT Approach

Institut français des sciences et technologies des transports, de l aménagement

Un exemple de cloud au LUPM : Stratuslab

English Q&A #1 Braille Services Requirement PPTC Q1. Would you like our proposal to be shipped or do you prefer an electronic submission?

Cloud Computing. Introduction. ! Explosion du nombre et du volume de données

Motivations (many2many) Motivations (many2one) Sur le thème de la Version. La gestion de version. La gestion de configuration.

Le passé composé. C'est le passé! Tout ça c'est du passé! That's the past! All that's in the past!

Les mésocentres HPC àportée de clic des utilisateurs industriels

en SCÈNE RATIONAL Rational Démonstration SDP : automatisation de la chaîne de développement Samira BATAOUCHE sbataouche@fr.ibm.com

Génie logiciel. Systèmes et sous-systèmes. Modèliser des grands systèmes. Problématique. SS S-Syst1 SS S-Syst2 SS S-Syst3. Système.

SERVEUR DÉDIÉ DOCUMENTATION

physicien diplômé EPFZ originaire de France présentée acceptée sur proposition Thèse no. 7178

Optimiser votre reporting sans déployer BW

Gestion de clusters de calcul avec Rocks

M2-Images. Rendu Temps Réel - OpenGL 4 et compute shaders. J.C. Iehl. December 18, 2013

Retours d expériences et perspectives sur les aspects énergétiques

Lamia Oukid, Ounas Asfari, Fadila Bentayeb, Nadjia Benblidia, Omar Boussaid. 14 Juin 2013

LES APPROCHES CONCRÈTES POUR LE DÉPLOIEMENT D INFRASTRUCTURES CLOUD AVEC HDS & VMWARE

Pascale Borla-Salamet Consultante Avant Vente Oracle France. Oracle Exadata Performance et Optimisation de votre Datawarehouse

Software and Hardware Datasheet / Fiche technique du logiciel et du matériel

Windows Server Chapitre 1: Découvrir Windows Server 2008

APPENDIX 6 BONUS RING FORMAT

THE EVOLUTION OF CONTENT CONSUMPTION ON MOBILE AND TABLETS

INSTRUMENTS DE MESURE SOFTWARE. Management software for remote and/or local monitoring networks

ADHEFILM : tronçonnage. ADHEFILM : cutting off. ADHECAL : fabrication. ADHECAL : manufacturing.

MANAGEMENT SOFTWARE FOR STEEL CONSTRUCTION

WiFi Security Camera Quick Start Guide. Guide de départ rapide Caméra de surveillance Wi-Fi (P5)

SAP SNC (Supply Network Collaboration) Web Package. (Français / English) language. Edition 2013 Mars

VMware Infrastructure The New Computing Platform. Stéphane CROIX Systems Engineer

Plateforme Technologique Innovante. Innovation Center for equipment& materials

Performances. Gestion des serveurs (2/2) Clustering. Grid Computing

Quatre axes au service de la performance et des mutations Four lines serve the performance and changes

Administration de Citrix NetScaler 10 CNS-205-1I

MapReduce. Malo Jaffré, Pablo Rauzy. 16 avril 2010 ENS. Malo Jaffré, Pablo Rauzy (ENS) MapReduce 16 avril / 15

Thank you for choosing the Mobile Broadband USB Stick. With your USB Stick, you can access a wireless network at high speed.

Cycle Innovation & Connaissance 12 petit déjeuner Mardi 15 mai Cloud Computing & Green IT : nuages ou éclaircies?

TABLE DES MATIERES A OBJET PROCEDURE DE CONNEXION

lundi 3 août 2009 Choose your language What is Document Connection for Mac? Communautés Numériques L informatique à la portée du Grand Public

Acronymes et abréviations. Acronymes / Abbréviations. Signification

calls.paris-neuroscience.fr Tutoriel pour Candidatures en ligne *** Online Applications Tutorial

Deadline(s): Assignment: in week 8 of block C Exam: in week 7 (oral exam) and in the exam week (written exam) of block D

POSITION DESCRIPTION DESCRIPTION DE TRAVAIL

Post-processing of multimodel hydrological forecasts for the Baskatong catchment

Transcription:

Supercomputers Qu est ce qu un super-calculateur? La machine CURIE hébergée au TGCC de Bruyères-le-Châtel 44 / 81

Supercomputers Qu est ce qu un super-calculateur? La machine CURIE hébergée au TGCC de Bruyères-le-Châtel 45 / 81

Supercomputers Qu est ce qu un super-calculateur? La machine CURIE hébergée au TGCC de Bruyères-le-Châtel 46 / 81

Supercomputers Qu est ce qu un super-calculateur? Ce sont calculateurs dont la puissance de calcul est proche des limites de la techonologie contemporaine C est une infra-structure complexe: occupe souvent un bâtiment entier, consommation électrique ª qq MW à qq 10MW; refroidissement/climatisation très important assemblage très spécifique de composants / matériels informatiques; petit nombre de vendeurs (Cray, IBM, HP, DELL, SGI, Intel, Bull,...) Utilisation originale: le calcul scientifique 47 / 81

Supercomputers Définitions: Qu est ce qu un super-calculateur? Supercomputer: A computing system exhibiting high-end performance capabilities and resource capacities within practical constraints of technology, cost, power, and reliability. Thomas Sterling, 2007. Supercomputer: a large very fast mainframe used especially for scientific computations. Merriam-Webster Online. Supercomputer: any of a class of extremely powerful computers. The term is commonly applied to the fastest high-performance systems available at any given time. Such computers are used primarily for scientific and engineering work requiring exceedingly high-speed computations. Encyclopedia Britannica Online. 48 / 81

Supercomputers Survol historique Qu est ce qu un super-calculateur? CRAY-1: 1976, 80MHz, 64-bit/data, 24-bit/adress, vector register file, 160 MIPS, 250 MFLOPS, 8MB RAM, 5.5 tonnes, ª 200-kW (cooling included) les premiers supercalculateurs sont des machines utilisant du matériel spécialement conçu pour cette utilisation; les supercalculateurs des années 1980 et les ordinateurs personnels ont très peu de choses en commun depuis la fin des années 90, la tendance s inverse; on utilise de plus en plus de composants commerciaux (Off-the-shelf ) source: http://en.wikipedia.org/wiki/supercomputer 49 / 81

Supercomputers Qu est ce qu un super-calculateur? Leading technology paths (to exascale) using TOP500 ranks (Nov. 2012) Multicore: Maintain complex cores, and replicate (x86, SPARC, Power7) (#3, 6, and 10) Manycore/Embedded: Use many simpler, low power cores from embedded (IBM BlueGene) (#2, 4, 5 and 9) GPU/MIC/Accelerator: Use highly specialized processors from gaming/graphics market space (NVidia Fermi, Cell, Intel Phi (MIC) ), (# 1, 7, and 8) 50 / 81

Supercomputers Qu est ce qu un super-calculateur? Figure : Horst Simon, LBNL 51 / 81

Supercomputers Qu est ce qu un super-calculateur? Parallel processing models (not anymore?)-used in supercomputer Communicating Sequential Processing - MPI Shared memory multiple thread - OpenMP / pthread SIMD - vector instruction, lowest level Accelerators - GPU / XeonPhi Alternative models Vector machines: Hardware execution of value sequences to exploit pipelining Systolic: An interconnection of basic arithmetic units to match algorithm Data Flow: Data precedent constraint self-synchronizing fine grain execution units supporting functional (single assignment) execution 52 / 81

Supercomputers / Mesure de performance Comment mesurer / évaluer les performances un programme (parallèle)? Quelles métriques utiliser? Liste des supercalculateurs les plus puissants: TOP500 Ca peut dépendre du type d algorithme: Algorithme dit compute bound (ex: tri): FLOPS Algorithme dit memory bound (ex: tri): bande-passante mémoire Métriques une quantité mesurable représentant le taux d exécution d une tâche Instructions par seconde Puissance électrique (1 MW ª 1 Me) Performance par Watt (Green500) 53 / 81

Supercomputers / Mesure de performance Comment mesurer / évaluer les performances un programme (parallèle)? Quelles métriques utiliser? Utiliser un benchmark classement TOP500: GFLOPS obtenu sur l execution du LINPACK benchmarks parallèles: NPB - NAS parallel benchmark from NASA Advanced Supercomputing Division hpcc Parboil Rodinia (application sur architectures hétérogènes - GPU) SHOC (application sur architectures hétérogènes - GPU) HPCtoolkit HOMB: solveur de Laplace (hybrid MPI / OpenMP) benchmarks sur les IO parallèles (filesytem / hard drive): IOR, ParallelIO,... 54 / 81

Supercomputers / Mesure de performance / Benchmarks - Mini-apps Lawrence Livermore National Lab mini-apps LULESH Explicit lagrangian shock hydrodynamics on unstructured mesh representation AMG2013 Algebraic Multi Grid Mulard Unstructured mesh, finite element, implicit multigroup radiation diffusion UMT Unstructured mesh Transport MCB Monte Carlo Particle Transport LKS Suite of kernels in a unified framework for testing compilers SIMD and threading DRIP 2D interpolation on tabular data LUAU3D Material advection on an unstructured hexahedral mesh MANTEVO (Sandia National Lab) 55 / 81

Supercomputers / Mesure de performance Outils d aide à l analyse / à la mesure de performance: temps d éxecution: time, gettimeofday Profiling: gprof, perf, PAPI Tracing: TAU, scalasca, scorep 56 / 81

Supercomputers / Mesure de performance MPI MPIBlib: P2P, collective MPI communication benchmark 57 / 81

Serial hardware Von Neumann architecture Multi-core CPU Figure : source: multicore tutorial (SC12) by Georg Hager and Gerhard Figure : Peter Pacheco, SanFrancisco U. Wellein 58 / 81

Serial hardware Processor Core Micro architecture Execution Pipeline Stages of functionality to process issued instructions Hazards are conflicts with continued execution Forwarding supports closely associated operations exhibiting precedence constraints Out of Order Execution Uses reservation stations hides some core latencies and provide fine grain asynchronous operation supporting concurrency Branch Prediction Permits computation to proceed at a conditional branch point prior to resolving predicate value Overlaps follow-on computation with predicate resolution Requires roll-back or equivalent to correct false guesses Sometimes follows both paths, and several deep 59 / 81

Hardware: memory hierarchy - low / high latency Most programs have a high degree of locality in their access spatial locality: accessing things nearby previous accesses temporal locality: reusing an item that was previously accessed Main memory (DRAM - off) has high latency compared to on-chip register ) need for intermediate staging area: cache memory Memory hierarchy tries to exploit locality Figure : source: Scientific Computing for engineers, CS594, J. Dongarra 60 / 81

Hardware: memory hierarchy - low / high latency Figure : source: T. Sterling, Louisiana State University, SC12 Tutorial 61 / 81

Hardware: memory hierarchy - low / high latency Figure : source: multicore tutorial (SC12) by Georg Hager and Gerhard Wellein 62 / 81

Hardware: memory hierarchy - cache How Locality affects scheduling algorithm selection: poor locality leads to long latency to fetch data from main memory ) thread is blocked Figure : source: T. Sterling, Louisiana State University, SC12 Tutorial 63 / 81

Hardware: floating-point peak performance on multicore CPU Figure : source: multicore tutorial (SC12) by Georg Hager and Gerhard Wellein 64 / 81

Hardware: linux tools to probe hardware features cat /proc/cpuinfo /bin/egrep processor model name cache size core sibling physical numérotation attribuée par le noyau linux core id: numero d un cœur de CPU physical id: numero de socket siblings: nombre de processing unit (PU) / hardware thread d un CPU (socket) pour un même physical id, si le nombre de cœur est égal à siblings, alors l hyperthreading est déactivé (sudo) lspci hwloc-ls / lstopo (hwloc - hardware locality), outil utilisé par MPI pour le placement de tâche lshw et lshw-gtk (hardware LiSter) 65 / 81

Hardware: linux tools to probe hardware features cat /proc/cpuinfo /bin/egrep processor model name cache size core sibling physical numérotation attribuée par le noyau linux core id: numero d un cœur de CPU physical id: numero de socket siblings: nombre de processing unit (PU) / hardware thread d un CPU (socket) pour un même physical id, si le nombre de cœur est égal à siblings, alors l hyperthreading est déactivé (sudo) lspci hwloc-ls / lstopo (hwloc - hardware locality), outil utilisé par MPI pour le placement de tâche lshw et lshw-gtk (hardware LiSter) TODO: lancer hwloc-ls sur gin (environnement: module load hwloc) et retrouver les caractéristiques des nœuds de calcul 66 / 81

MPI on multicore One MPI process per core Each MPI process is a single thread One MPI process per node MPI processes are multithreaded One thread per core aka Hybrid model 67 / 81

Cluster GIN (ENSTA / UMA) - hardware Les nœuds de calcul Le cluster de l UMA: machine gin 1 frontale interactive gin.ensta.fr équipées de : 2 processeurs - quadcore - AMD Opteron 2378 8 Go de mémoire utilisation: compilation des applications; lancer les jobs 31 nœuds de calculs gin[1-31] hétérogènes: 68 / 81

Cluster GIN (ENSTA / UMA) - hardware Autres caractéristiques: Le cluster de l UMA: machine gin pas de système de fichiers parallèle Réseau: infiniband QDR Système d exploitation: Rocks 6.2 (dérivé CentOS 6.6) Gestionnaire des ressources / tâches (jobs): GridEngine (SGE) connection depuis la salle de cours: tappez simplement ssh gin. Vous êtes alors connecté sur la frontale de gin 69 / 81

Cluster GIN (ENSTA / UMA) - environnement logiciel Environnement module (http://modules.sourceforge.net/): La gestion de l environnement est faite via l utilisation de module : Liste les softs disponibles : module avail Liste les softs intégrés /chargés dans l environnement courant : module list Intègre/charge le logiciel tool dans l environnement courant : module load tool/version Supprime le soft tool de l environnement courant : module unload tool/version Remplace la version old-version du soft tool par la version new-version dans l environnement courant : module switch tool/old-version tool/new-version Affiche les variables d environnement modifiées/restaurées quand on charge/décharge un module : module show tool/version 70 / 81

Cluster GIN (ENSTA / UMA) - gestionnaire de travaux GridEngine (ex. SGE) de la distrib Rocks. Définition d un job: un script shell, qui contient les commandes à exécuter (e.g. mpirun) et des commentaires commençant par #$ permettant de spéficier les ressources nécessaires à l exécution du travail (e.g. nombre de tâches MPI, nombre de nœuds de calcul, mémoire, temps maximal d exécution, etc...) les commandes utiles de base: qsub: soumettre un travail au calculateur. Une fois dans la queue des travaux, le travail reçoit un numéro (jobid) qsub job.qsub qstat: liste des travaux en cours, et leur état (en exécution, en attente des ressources disponibles, etc...) qstat; qstat -j jobid qdel: enlever/tuer un job de la queue qdel jobid man sge_intro pour plus d information. Voir aussi http://en.wikipedia.org/wiki/oracle_grid_engine 71 / 81

Cluster GIN (ENSTA / UMA) - soumission de travaux script de soumission 1 #!/bin/bash # 3 # Name of the job ( used to build the name of the standard output stream) #$ N test_job 5 # # Number of MPI task requested 7 #$ pe orte 4 # 9 # The job is located in the current working directory #$ cwd 11 # # Merge standard error and standard output streams 13 #$ j y # 15 # openmpi 1.6 ( deprecated) 17 #mpirun bycore bind to core report bindings./helloworld 19 # openmpi 1.8 ( current) mpirun map by core bind to core report bindings../../code/submit_job_gin.qsub./ helloworld 72 / 81

Cluster GIN (ENSTA / UMA) - Grid Engine -qstat Exemple de sortie de la commande qstat: kestener@gin:~/coursmpi/cours_mpi_code/c$ qstat 2 job ID prior name user state submit/ s t a r t at queue s l o t s 4 11 0.00000 test_job kestener qw 09/02/2014 15:53:17 4 qstat_output.txt 73 / 81

Topologie de la frontale de GIN use command lstopo to explore hardware on gin 74 / 81

Parallel programming models Definition: the languages and libraries that create an abstract view of the machine (hide low-level details) Control How is parallelism created? How are dependencies enforced? Data Shared or private? How is shared data accessed or private data communicated? Synchronization What operations can be used to coordinate parallelism What are the atomic (indivisible) operations? Slide derived from M. Zahran 75 / 81

Parallel programming models You can run any paradigm on any hardware (e.g. an MPI on shared - memory) The same program can have different type of parallel paradigms The hardware itself can be heterogeneous The whole challenge of parallel programming is to make the best use of the underlying hardware to exploit the different type of parallelisms Slide derived from M. Zahran 76 / 81

Parallel programming models...... on multicore multisocket nodes Figure : source: multicore tutorial (SC12) by Georg Hager and Gerhard Wellein 77 / 81

Supercomputing trends Figure : Horst Simon, LBNL 78 / 81

Ten things every programmer must know about hardware Data types Boolean algebra Caches - memory hierarchies Cache coherence Virtual Memory Pipelining Memory layout of data structures (arrays, linked lists,...) Some assembly programming Basic compiler optimizations Memory bandwidth constratins source: http://www.futurechips.org/tips-for-power-coders/programmer-hardware.html 79 / 81

Optimizing parallel programs - OpenMP example http://www.futurechips.org/tips-for-power-coders/ writing-optimizing-parallel-programs-complete.html 80 / 81

Sources of Performance Degradation Latency: Waiting for access to memory or other parts of the system Overhead: Extra work that has to be done to manage program concurrency and parallel resources the real work you want to perform Starvation: Not enough work to do due to insufficient parallelism or poor load balancing among distributed resources Contention: Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint. 81 / 81