MÉTHODES DE KRYLOV POUR LES PROBLÈMES DE RECHERCHE DE VALEURS PROPRES ADAPTÉES AU PETASCALE. C. CALVIN 1, F. BOILLOD-CERNEUX 1,2, T. DRUMMOND 4, J. DUBOIS 1, N. EMAD 3,5,S. PETITON 2,3, F. YE 1,3 1 CEA Saclay France 2 CNRS/LIFL Lille France 3 Maison de la Simulation Saclay France 4 LBNL Berkeley US 5 PRISM UVSQ - France SEMINAIRE MANON Christophe CALVIN christophe.calvin@cea.fr SÉMINAIRE MANON 29 MARS 2013 INSTN - SACLAY CEA 10 AVRIL 2012 PAGE 1
INTRODUCTION CEA 10 AVRIL 2012 PAGE 2 26 MARS 2013
SUPERCOMPUTING TRENDS Petascale and post-petascale The degree of parallelism is increasing The number of levels of coarse and fine parallelism is increasing The number of levels of memory is increasing Heterogeneous processing units (with CPUs + accelerators) New algorithms and programming paradigms TOP 500 - # of cores 10 000 000 1 000 000 #1 #500 100 000 Moyenne 10 000 1 000 100 10 1 1993 1998 2003 2008 2013 26 MARS 2013 CEA FEB. 2013 PAGE 3
SUPERCOMPUTING TRENDS Main Challenges for Exascale: Power (Energy Efficiency Computing) Low power processor (ARM) Maximize Flops/watt (GPU, MIC) Minimize the memory per core Communication Minimized global communication Optimize communication GBytes/core #1TOP 500 Fault tolerance MTBF <1h 5 4,5 4 3,5 3 2,5 2 1,5 1 0,5 0 26 MARS 2013 1993 1998 2003 2008 2013 CEA FEB. 2013 PAGE 4
IMPACTS SUR LES APPLICATIONS SCIENTIFIQUES ET LES ALGORITHMES NUMÉRIQUES POUR L EXTREME SCALE Nécessité de repenser les algorithmes et les architectures logicielles Minimisation : De la consommation énergétique Des communications globales dont les synchronisations Maximiser le degré de parallélisme des algorithmes et leur vectorisation (SIMD) Multi / hyper threading - 1 à 2 threads par cœur de calcul sur SandyBridge, IvyBridge, Hazwell quelques 10 aines de threads par sockets - 4 threads par cœur de MIC 244 threads - 1 thread par cœur de GPU > 1500 threads vectorisation (SIMD) - AVX1, AVX2, AVX3 26 MARS 2013 CEA MARCH 2013 PAGE 5
IMPACTS SUR LES APPLICATIONS SCIENTIFIQUES ET LES ALGORITHMES NUMÉRIQUES POUR L EXTREME SCALE Nécessité de concevoir des nouvelles architectures de codes scientifiques et d algorithmes numériques pour prendre en compte ces modifications matérielles : Algorithmes polymorphiques (s adaptant au degré de parallélisme), Utilisation de composants optimisés (comme les bib. numériques) Auto tuning / smart tuning Prise en compte de la tolérance aux pannes au niveau de l application et au niveau des algorithmes Nécessité d avoir des runtime performants pour la gestion dynamique du multithreading StarPU 26 MARS 2013 CEA MARCH 2013 PAGE 6
ALGORITHMIQUE NUMÉRIQUE POUR L EXTREME SCALE Nos axes de R&D Co-méthodes Auto-tuning et smart-tuning Programmation efficace des multi-cœurs et accélérateurs Evaluation des modèles et langages de programmation Notre terrain d application Méthodes de Krylov pour la résolution de problèmes aux valeurs propres 26 MARS 2013 CEA MARCH 2013 PAGE 7
PROBLÈMES AUX VALEURS PROPRES Problème récurrent dans de nombreuses applications Page ranking Equation de Boltzmann Forage pétrolier Av Bv Au u 26 MARS 2013 Modeling of Epidemic Spread CEA MARCH 2013 PAGE 8
PROBLÈMES AUX VALEURS PROPRES De nombreuses méthodes Méthodes directes : tri-diagonalisation suivie de QR. Méthodes semi-directes : Lanczos Méthodes itératives : Méthode de la puissance Méthodes de Krylov (basée sur Arnoldi) - IRAM (Implicit Restarted Arnoldi Method) - ERAM (Explicit Restarted Arnoldi Method) 26 MARS 2013 CEA MARCH 2013 PAGE 9
KRYLOV SOLVER FOR THE NEUTRON TRANSPORT EQUATION Example of different solvers for the neutron transport equation with DENOVO/CASL MARCH 26, 2013 CEA MARCH 2013 PAGE 10
EXPLICITELY RESTARTED ARNOLDI METHOD Krylov Method for non symmetric eigenvalue problems MARCH 26, 2013 CEA FEB. 2013 PAGE 11
ERAM EN PARALLÈLE SUR DES ARCHITECTURES MASSIVEMENT PARALLÈLES ET HÉTÉROGÈNES CEA 10 AVRIL 2012 PAGE 12 26 MARS 2013
PARALLEL ERAM Two levels parallelization Intra-node: multithreaded implementation based on BLAS kernels On CPU: multi-threaded BLAS through MKL On GPU: CUBLAS Inter-node: using MPI Matrix decomposed using a chess pattern in order to minimize communication times #MPI process 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #rows x columns 1x1 2x1 3x1 2x2 5x1 3x2 7x1 4x2 3x3 5x2 11x1 4x3 13x1 7x2 5x3 4x4 Perfomance study Weak and strong scaling On CPUs and GPUs MARCH 26, 2013 CEA FEB 2013 PAGE 13
TEST MACHINE CURIE MACHINE One of the European PRACE supercomputer located at TGCC CURIE thin nodes: 5040 nodes B510 bullx, - Each node: 2 8-cores Intel Sandy Bridge EP (E5-2680) 2.7 GHz 64 GB 80,640 cores 1.6 Pflops CURIE fat nodes: 360 nodes S6010 bullx, - Each node: 4 8-cores Intel Nehalem-EX(X7560) 2.26 GHz 128 GB 11,520 cores 105 TFlops CURIE hybrid nodes: 16 nodes B505 bullx - Each node: 9 hybrid lames B505 with 2 Intel Westmere 2.66 GHz/ 2 Nvidia M2090 288 processors Intel + 288 processors Nvidia. 192 TFlops. 26 MARS 2013 CEA FEB 2013 PAGE 14
WEAK SCALING Dense square matrix N=26,000 6,76.10 8 elements 8 eigenpairs Krylov subspace size=16, e=10-8 26 MARS 2013 CEA 10 AVRIL 2012 PAGE 15
WEAK SCALING ANALYSIS T T exec Communication time model: N _ mvp ( p, N) 2 b PCIe 2 log t PCIe Where: comm p: number of GPUs N: vol. of data to transfer (Matrix order/ p) b PCIe = PCIe latency (around 100ns) b MPI = QDR/MPI latency around1 ms t PCIe = PCI-express bandwidth= 1 GB/s t MPI = QDR/MPI bandwidth = 6 GB/s ( p, N) T T _ ( p, N) comp comm mvp N p b MPI t MPI 0,001 0,0008 0,0006 0,0004 0,0002 0 total transfer time (6GB/s) gpu transfer time 1 9 25 49 81 121 169 225 289 361 Communication for the Matrix-Vector product MARCH 26, 2013 CEA FEB 2013 PAGE 16
STRONG SCALING Dense square matrix N=22,000 5.11.10 8 elements 135 eigenpairs Krylov subspace size=225, e=10-8 MARCH 26, 2013 CEA FEB 2013 PAGE 17
STRONG SCALING ANALYSIS For strong scaling, 2 parameters are involved for performances Computing performances which decreases due to decreasing amount of data Communication ratio which increases as the computing time is decreasing Performance evolution on a single GPU depending on the matrix size Three main phases: 1. Exec. Time decreases : Comp. Time is greater than Comm. Time 2. Optimum point. 3. Exec Time increases: Comm. Time greater than Comp. Time MARCH 26, 2013 CEA FEB 2013 PAGE 18
STRONG SCALING ON CPU Fat Nodes Thin Nodes 26 MARS 2013 CEA FEB 2013 PAGE 19
PERFORMANCE ON CPU AND GPUS Partition Execution time(s) Perf. reached / Perf. max = % # of nodes # of cores or GPUs Fat 4,38 169 / 302 = 56% 15 480 Thin 3,46 214 / 437 = 49% 24 384 Hybrid 2,93 306 / 900 = 34% 18 36 26 MARS 2013 CEA FEB 2013 PAGE 20
LES CO-MÉTHODES : MERAM CEA 10 AVRIL 2012 PAGE 21 26 MARS 2013
MERAM Principle Co-Method based on ERAM Launch several ERAM with different parameters Exchange information in order to improve each ERAM convergence Parameter Set: Krylov subspace size Restarting strategy # desired EigenPairs MERAM ERAM (Parameter Set 1) ERAM (Parameter Set 2) ERAM (Parameter Set 3) CEA JUNE 2012 PAGE 22
MERAM Well adapted for Exascale Smaller scalar product MERAM ERAM (Parameter Set 1) ERAM (Parameter Set 2) ERAM (Parameter Set 3) Fault Tolerance CEA JUNE 201 PAGE 23
RESTARTING STRATEGIES : ERAM General Formula : Σα i φi Y. Saad Default αi = 1 Linear αi = i Proposed Restarting Strategy Idea : Using Ritz values Information in restart coefficients Residual αi = i. 1 - Ri Lambda αi = λi λi-1 Linear Residual αi = i. 1 - Ri Lambda Residual αi = λi. 1 - Ri CEA MARCH 2013 PAGE 24
MERAM ALGORITHM ERAM 1 Controler ERAM n Initialization Arnoldi Reduction Recept & order Restart Parameters Initialization Arnoldi Reduction QR Test Reception Send Restart Parameters Received results are better than mine QR Compute restarting vectors Sending restarting vectors Recept Restart Vectors Select & Send best Restart Vectors CEA MARCH 2013 PAGE 25 Test Reception Complete restarting vector with my data
Temps d exécution en secondes SCALABILITÉ DE MERAM SUR TITANE 625 284,8 MERAM avec trois solveurs ERAM 125 123,33 111,7 25 23,9 5 8,75 5,5 5,95 4,4 1 1 2 4 8 15 27 50 148 Nombre de nœuds à 8 cœurs utilisés (1184 pour 148 nœuds) 26 CEA MARCH 2013
Execution time (seconds) MERAM VS ERAM 15625 3125 625 125 25 5 3 735 1 659 13x 826 285 13x 7x 123 112 416 256 17x 173 158 24 29x 31x 9 6 6 27x 293 4 73x ERAM MERAM 1 1 2 4 8 15 27 50 148 2xquadcore nodes used 27 CEA MARCH 2013 PAGE 27
AUTO-TUNING ET SMART-TUNING CEA 10 AVRIL 2012 PAGE 28 26 MARS 2013
AUTO-TUNED MATRIX-VECTOR PRODUCT Hardware Nehalem Tesla Max perf. 12.8 GFlops 37.5 GFlops Why to auto-tune the matrix-vector product? Performance for GPU sparse computations : - Depends on the matrix data pattern - Paper from Bell&Graham shows improvement in performance of 10-15x! - Hardware used: memory controller or caches - How to exploit efficiently the GPU computing power? Restarted Arnoldi Iterative method: matrix is fixed for all iterations CSR, CSC, COO, ELLPACK, Hybrid (ELL+COO). Our strategy: - Before starting the solver, we micro-benchmark the matrixvector product using different matrix formats - We run several times the matrix vector product, following the well-known scheme A ( A.x + x ) + x - The benchmarking is limited for each format either by the number of runs, or a time limit - We control the authorized overhead of the micro-benchmarking MARCH 26, 2013 CEA MARCH 2013 PAGE 29 Speed-up reference ~3x
AUTO-TUNED MATRIX-VECTOR PRODUCT Experiments on the MVP Matrix: nlpkkt80 from university of Florida sparse matrices collection 1,062,400 order matrix with 28,192,672 non zeros Format Perf. (Gflops) CSR scalar CSR vector COO ELL HYB 2 6.5 3.5 17 17 Speed-up Ref. 3.25x 1.75x 8.5x 8.5x Improvement for the whole computation Restarted Arnoldi on multiple processors or GPUs with Nlpkkt80 (1 M order, 28 Mnnz) MARCH 26, 2013 CEA MARCH 2013 PAGE 30
SMART-TUNING APPLIED TO ERAM/MERAM Krylov Method for non symmetric eigenvalue problems MARCH 26, 2013 CEA MARCH 2013 PAGE 31
SMART-TUNING APPLIED TO ERAM/MERAM Limitations Restarting vector computation: Critical step The restarting vector may disrupt ERAM convergence There is no universal method to choose an efficient restarting scheme Subspace size: influences convergence Subspace size too small: no convergence Optimal subspace size depends on matrix, desired eigenpairs, How to find: Good restarting coefficients? Optimal subspace? MARCH 26, 2013 CEA MARCH 2013 PAGE 32
SMART-TUNING APPLIED TO ERAM/MERAM Basically, we use the convergence rate CR as a measure. There are several parameters for the auto tuner: max_idle_time: iterations during which auto tuner did nothing CRmin, CRmax: min and maximum CR Experiments conducted on: a laptop: 2 cores (4 threads total) Two testing phases: Phase I: Testing ERAM subspace autotuning autotuning for same RS Same autotuning for RS Phase II: Testing MERAM subspace autotuning Same autotuning for each ERAM, RS. autotuning for each ERAM, RS. CEA MARCH 2013 PAGE 33
COMPARING ERAM W/ AND W/O AUTOTUNING Matrix CRY10000 Desired eigenpairs: 20 Tolerance: 10 e-09 Autotune: Step = 5, Idle = 15, Avg = 5 Subspace size With AT (MVs) Without AT (MVs) ERAM(20) (min) - Def. 5560 - Lin. 6365 - Res. 5090 - Def. No conv - Lin. No conv - Res. No conv ERAM(75) Res. 2550 Res. 19125 ERAM(85) Res. 675 Res. 680 AT ERAM(20) vs ERAM(75) 3,75x AT ERAM(75) vs AT ERAM(20) 2x AT ERAM(75) vs ERAM(75) 7,5x AT ERAM(85) vs AT ERAM(20) 7,5x AT ERAM(85) vs ERAM(85) similar CEA MARCH 2013 PAGE 34
SMART-TUNING APPLIED TO ERAM/MERAM Smart-tuning on restarting strategies MARCH 26, 2013 CEA MARCH 2013 PAGE 35
#Iterations SMART-TUNING ON RS WITH MERAM Matrix CRY10000 MERAM / ERAM # Iterations 600 500 501 501 400 300 200 225 211 DIV DIV 100 92 108 104 76 88 78 91 79 ERAM which cv first 0 CEA MARCH 2013 PAGE 36
PAGE 37 CEA 10 AVRIL 2012 Commissariat à l énergie atomique et aux énergies alternatives Centre de Saclay 91191 Gif-sur-Yvette Cedex T. +33 (0)1 69 08 68 80 F. +33 (0)1 69 08 66 42 DEN/DANS DM2S 26 MARS 2013 Etablissement public à caractère industriel et commercial RCS Paris B 775 685 019