SIMULATION HAUTES PERFORMANCES POUR LA PHYSIQUE DES RÉACTEURS À L AIDE D ACCÉLÉRATEURS DE CALCUL. Performances of Krylov Solvers for Reactor

SIMULATION HAUTES PERFORMANCES POUR LA PHYSIQUE DES RÉACTEURS À L AIDE D ACCÉLÉRATEURS DE CALCUL Performances of Krylov Solvers for Reactor Physics Simulation on Petascale Architectures. C. CALVIN 1, J. DUBOIS 1 1 Direction de l Energie Nucléaire - CEA Saclay - France TGCC «CURIE un an après» Christophe CALVIN christophe.calvin@cea.fr MAI 2013 COLLOQUE «CURIE, UN AN APRÈS» CEA 10 AVRIL 2012 PAGE 1

Les Grands Outils pour le Développement du Nucléaire Environnement de simulation Maitrise des Incertitudes (URANIE) Architecture des calculateurs combustibles Fabrication Plutonium MOX Uranium Naturel Combustible us é R é acteurs Propulsion Gen 2, 3 Gen 4 exp é rimentaux navale R é acteurs Cycle du combustible Stockage des d é chets Combustible Conception des installations via l utilisation d applicatifs métiers Explorer des domaines difficilement accessibles par l expérimentation Réduire les durées d étude Limiter les investissements Plates-formes des grandes disciplines du nucléaire SALOME (couplage de plates-formes) 1 - Neutronique 2 - Thermo-hydraulique 3 - Mécanique, thermique 4 - Chimie du cycle 5 - Matériaux Méthodes numériques Architecture logicielle Nécessité d amplifier cette activité Etudes amont Modélisation Expérimentation Recherche scientifique et technologique de base Exemple: Couplage Multi-échelle en thermo-hydraulique 28 MAI 2013 CEA MAI 2013 PAGE 2 Echelle fine Echelle locale Echelle composant Assemblage combustible Echelle système CATHARE

APOLLO3 : CODE DE TRANSPORT DÉTERMINISTE New generation deterministic 3D transport code developed at CEA. Objectives: provide numerical toolboxes giving the capability to build calculation schemes to: model and simulate GENII, GEN III and GEN IV reactors carry out R&D studies and validation processes requiring energy and spatial refined meshes. Main features: advanced lattice and core solvers, parallel computation and functionalities for all kind of reactors. INTEGRATION PROJET V&V DEVELOPPEMENT Goal oriented project organized around 3 axes: development, integration and V&V In partnership with AREVA and EDF. 28 MAI 2013 CEA MAI 2013 PAGE 3

LES SOLVEURS DANS APOLLO3 La résolution de l équation du transport des neutrons Différentes approches, méthodes pour résoudre cette équation : Approche simplifiée : diffusion ou transport simplifié : MINOS Transport EF en non structuré : MINARET Transport par méthodes des caractéristiques (courtes, longues ) : IDT, TDT, Globalement il s agit de la résolution d un problème aux valeurs propres généralisés 28 MAI 2013 CEA MAI 2013 PAGE 4

KRYLOV SOLVER FOR THE NEUTRON TRANSPORT EQUATION Example of different solvers for the neutron transport equation with DENOVO/CASL Les méthodes de Krylov offrent un potentiel très intéressant pour la résolution de problèmes de valeurs propres en séquentiel et en parallèle par rapport aux méthodes classiquement utilisées MAY 28, 2013 CEA MAI 2013 PAGE 5

EXPLICITELY RESTARTED ARNOLDI METHOD Krylov Method for non symmetric eigenvalue problems ERAM MAY 28, 2013 CEA MAI 2013 PAGE 6

PARALLEL ERAM Two levels parallelization Intra-node: multithreaded implementation based on BLAS kernels On CPU: multi-threaded BLAS through MKL On GPU: CUBLAS Inter-node: using MPI Matrix decomposed using a chess pattern in order to minimize communication times #MPI process 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #rows x columns 1x1 2x1 3x1 2x2 5x1 3x2 7x1 4x2 3x3 5x2 11x1 4x3 13x1 7x2 5x3 4x4 Perfomance study Weak and strong scaling On CPUs and GPUs MAY 28, 2013 CEA MAI 2013 PAGE 7

TEST MACHINE CURIE MACHINE One of the European PRACE supercomputer located at TGCC CURIE thin nodes: 5040 nodes B510 bullx, - Each node: 2 8-cores Intel Sandy Bridge EP (E5-2680) 2.7 GHz 64 GB 80,640 cores 1.6 Pflops CURIE fat nodes: 360 nodes S6010 bullx, - Each node: 4 8-cores Intel Nehalem-EX(X7560) 2.26 GHz 128 GB 11,520 cores 105 TFlops CURIE hybrid nodes: 16 nodes B505 bullx - Each node: 9 hybrid lames B505 with 2 Intel Westmere 2.66 GHz/ 2 Nvidia M2090 288 processors Intel + 288 processors Nvidia. 192 TFlops. 28 MAI 2013 CEA MAI 2013 PAGE 8

WEAK SCALING Dense square matrix N=26,000 6,76.10 8 elements 8 eigenpairs Krylov subspace size=16, e=10-8 28 MAI 2013 CEA MAI 2013 PAGE 9

WEAK SCALING ANALYSIS T Communication time model: N ( p, N) 2 comm _ mvp Where: PCIe PCIe 2 log p MPI N MPI total transfer time (6GB/s) gpu transfer time p: number of GPUs 0,001 N: vol. of data to transfer (Matrix order/ p) 0,0008 PCIe= PCIe latency (around 100ns) 0,0006 MPI= QDR/MPI latency around1 s 0,0004 T exec PCIe= PCI-express bandwidth= 1 GB/s MPI= QDR/MPI bandwidth = 6 GB/s ( p, N) T T _ ( p, N) comp comm mvp 0,0002 0 1 9 25 49 81 121 169 225 289 361 Communication for the Matrix-Vector product MAY 28, 2013 CEA MAI 2013 PAGE 10

STRONG SCALING Dense square matrix N=22,000 5.11.10 8 elements 135 eigenpairs Krylov subspace size=225, e=10-8 MAY 28, 2013 CEA MAI 2013 PAGE 11

STRONG SCALING ANALYSIS For strong scaling, 2 parameters are involved for performances Computing performances which decreases due to decreasing amount of data Communication ratio which increases as the computing time is decreasing Performance evolution on a single GPU depending on the matrix size Three main phases: 1. Exec. Time decreases : Comp. Time is greater than Comm. Time 2. Optimum point. 3. Exec Time increases: Comm. Time greater than Comp. Time MAY 28, 2013 CEA MAI 2013 PAGE 12

STRONG SCALING ON CPU Fat Nodes Thin Nodes 28 MAI 2013 CEA FEB 2013 PAGE 13

PERFORMANCE ON CPU AND GPUS Partition Execution time(s) Perf. reached / Perf. max = % # of nodes # of cores or GPUs Fat 4,38 169 / 302 = 56% 15 480 Thin 3,46 214 / 437 = 49% 24 384 Hybrid 2,93 306 / 900 = 34% 18 36 28 MAI 2013 CEA MAI 2013 PAGE 14

CONCLUSION / PERSPECTIVES Efficient ERAM implementation on GPUs Improvement vs CPU Weak scaling: 60% efficiency on 256 GPUs But scaling is quite limited on large configurations: Can overcome this limit using co-method (MERAM) Auto and smart tuning techniques are required to automatically reach best performances as possible - Existing work on matrix format for sparse computations - Ongoing work on smart-tuning strategies for ERAM/MERAM Need to optimize communications for ultra-scale supercomputers, especially using accelerators - Direct connection between accelerators - Communication overlap - communication avoiding : re-design the algorithms in order to minimize global communications MAY 28, 2013 CEA MAI 2013 PAGE 15

CONCLUSIONS PERSPECTIVES Il n y a pas que le GPU dans la vie Nous nous intéressons également à d autres accélérateurs de calcul comme le Xeon Phi Même si la performance crête est à ce jour moins intéressantes, le MIC offre un certain nombre d avantages dans le contexte de codes à vocation industrielle à très longue durée de vie et nécessairement multiplateformes Les optimisations apportées à la version MIC profitent directement à la version CPU 28 MAI 2013 CEA MAI 2013 PAGE 16

CEA 10 AVRIL 2012 Commissariat à l énergie atomique et aux énergies alternatives Centre de Saclay 91191 Gif-sur-Yvette Cedex T. +33 (0)1 69 08 68 80 F. +33 (0)1 69 08 66 42 DEN/DANS DM2S 28 MAI 2013 Etablissement public à caractère industriel et commercial RCS Paris B 775 685 019