Par4All : parallélisation automatique pour GPU et moulticœurs

Par4All : parallélisation automatique pour GPU et moulticœurs Mehdi AMINI 1,2 Béatrice CREUSILLET 1,4 Stéphanie EVEN 3 Onil GOUBIER 1 Serge GUELTON 3,2 Ronan KERYELL 1,3 Janice ONANIAN McMAHON 1 François-Xavier PASQUIER 1 Grégoire PÉAN 1 Pierre VILLALON 1 1 HPC Project 2 Mines ParisTech/CRI 3 Institut TÉLÉCOM/TÉLÉCOM Bretagne/HPCAS 4 Université Pierre & Marie Curie/LIP6 29/06/2011

The Distributed Computer Toward computing on spatial data : pattern recognition, mathematical morphology... Massive parallelism to reduce the cost SIMD S. H. Unger. «A Computer Oriented Toward Spatial Problems.» Proceedings of the IRE. p. 1744 1750. oct. 1958 École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 2 / 93

SOLOMON Target application: data reduction, communication, character recognition, optimization, guidance and control, orbit calculations, hydrodynamics, heat flow, diffusion, radar data processing, and numerical weather forecasting Diode + transistor logic in 10-pin TO5 package Daniel L. Slotnick. «The SOLOMON computer.» Proceedings of the December 4-6, 1962, fall joint computer conference. p. 97 107. 1962 École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 3 / 93

POMP & PompC @ LI/ENS 1987 1992 Up to 256 Parallel Processor I 88100 D @D 512 KB SRAM HyperCom FIFO Global exception Network Video Scalar Reduction I/O PAV DAC R G B Alpha Vectorial code Scalar broadcast Control LIW code RAM RAM Scalar code @I I @D 88100 D Scalar processor Scalar Data RAM VME Bus Host RAM SCSI École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 4 / 93

HyperParallel Technologies (1992 1998) Parallel computer Proprietary 3D-torus network DEC Alpha 21064 + FPGA HyperC (follow-up of PompC @ LI/ENS Ulm) PGAS (Partitioned Global Address Space) language An ancestor of UPC... Already on the Saclay Plateau! Quite simple business model Customers need just to rewrite all their code in HyperC Difficult entry cost... Niche market... American subsidiary with dataparallel datamining application acquired by Yahoo! in 1998 Closed technology lost for customers and... founders École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 5 / 93

Present motivations MOORE s law there are more transistors but they cannot be used at full speed without melting ± Superscalar and cache are less efficient compared to transistor budget Chips are too big to be globally synchronous at multi GHz Now what cost is to move data and instructions between internal modules, not the computation! Huge time and energy cost to move information outside the chip Parallelism is the only way to go... Research is just crossing reality! No one size fit all... Future will be heterogeneous: PIM... But compilers are always behind... GPGPU, Cell, vector/simd, FPGA, École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 6 / 93

ARM yourself (I) Do some computations where the captors are... Smartphone and other sensor networks Trade-off between communication energy and inside/remote computations Texas Instrument OMAP4470 announced on 2011/06/02 2 ARM Cortex-A9 MPCores @ 1.8GHz with Neon vector instructions 2 ARM Cortex-M3 cores (low-power and real-time responsiveness, multimedia, avoiding to wake up the Cortex-A9...) SGX544 graphics core with OpenCL 1.1 support, with 4 USSE2 core @ 384 MHz producing each 4 FMAD/cycle: 12.3 GFLOPS 2D graphics accelerator 3 HD displays and up to QXGA (2048x1536) resolution + stereoscopic 3D Dual-channel, 466 MHz LPDDR2 memory École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 7 / 93

ARM yourself (II) Current course to have non-x86 servers based on ARM... Experiments on low power clusters Think to evaluate power consumption on your application École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 8 / 93

MPPA by Kalray (the French touch!) (I) 256 VLIW processors per chip 16 clusters of 16 processors) Shared memory and NoC with DMA 28 nm CMOS technology, 5 W @ 400 MHz FPU 32/64 bits IEEE 754: 205 GFLOPS SP, 1 TOPS 16 bits 2 64-bit DDR3 memory controllers for high bandwidth main memory transfers 2 40 Gb/s or 8 10 Gb/s) Ethernet controller École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 9 / 93

MPPA by Kalray (the French touch!) (II) 2 8-lane PCI Express Gen 3 4 4 8-lane Interlaken interfaces for multi-mppa chip system integration (8 MPPA/PCIe board) or connection to external FPGAs, I/O... Linux or bare metal with AccessCore library Multi-core compiler (gcc 4.5), simulator, debugger (gdb), profiler Eclipse IDE Programming from high-level C-based language AccessCore library École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 10 / 93

Off-the-shelf AMD/ATI Radeon HD 6970 GPU 2.64 billion 40nm transistors 1536 stream processors @ 880 MHz, 2.7 TFLOPS SP, 675 GFLOPS DP + External 1 GB GDDR5 memory 5.5 Gt/s, 176 GB/s, 384b GDDR5 250 W on board (20 idle), PCI Express 2.1 x16 bus interface OpenGL, OpenCL Radeon HD 6990 double chip card More integration: Llano APU (FUSION Accelerated Processing Unit) : x86 multicore + GPU 32nm, OpenCL École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 11 / 93

I Radeon HD 6870 big picture Par4All : parallélisation automatique pour GPU et moulticœurs École d été francophone de traitement d images sur GPU 29/06/2011 Ronan K ERYELL et al. 12 / 93

IBM Power 7 (2010) 4 chips per quad-chip module 8 cores per chip @ 4.0 GHz (4.25 GHz in TurboCore mode with 4 cores) 1.2 Gtr 45 nm SOI process, 567 mm 2 4 SMT threads per core 32I+32D kb L1 cache/core 256 kb L2 cache/core 32 MB L3 cache in edram 100 GB/s DDR3 memory 12 execution units per core: 2 fixed-point units 2 load/store units 4 double-precision floating-point units 1 vector unit supporting VSX 1 decimal floating-point unit 1 branch unit 1 condition register unit École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 13 / 93

Time to be back in parallelism! Good time for more start-ups! 2006: thinking to yet another start-up... People that met 1990 at the SEH/ETCA French Parallel Computing military lab (CM2, CM5...) Later became researchers in Computer Science, CINES director and ex-cea/dam, venture capital and more: ex-ceo of Thales Computer, HP marketing... HPC Project launched in December 2007 30 colleagues in France (Montpellier, Meudon), Canada (Montréal with Parallel Geometry) & USA (Santa Clara, CA, USA) École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 14 / 93

HPC Project hardware: WildNode from Wild Systems Through its Wild Systems subsidiary company WildNode hardware desktop accelerator Low noise for in-office operation x86 manycore nvidia Tesla GPU Computing Linux & Windows WildHive Aggregate 2-4 nodes with 2 possible memory views Distributed memory with Ethernet or InfiniBand http://www.wild-systems.com École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 15 / 93

HPC Project software and services Parallelize and optimize customer applications, co-branded as a bundle product in a WildNode (e.g. Presagis Stage battle-field simulator, WildCruncher for Scilab//...) Acceleration software for the WildNode GPU-accelerated libraries for C/Fortran/Scilab/Matlab/Octave/R Transparent execution on the WildNode Par4All automatic parallelization tool Remote display software for Windows on the WildNode HPC consulting Optimization and parallelization of applications High Performance?... not only TOP500-class systems: power-efficiency, embedded systems, green computing... Embedded system and application design Training in parallel programming (OpenMP, MPI, TBB, CUDA, OpenCL...) École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 16 / 93

Par4All Outline 1 Par4All OpenMP code generation GPU code generation 2 Coding rules 3 Wild Cruncher 4 Results 5 Future developments 6 Conclusion 7 Table des matières École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 17 / 93

Par4All Use the Source, Luke... Hardware is moving quite (too) fast but... What has survived for 50+ years? Fortran programs... What has survived for 40+ years? IDL, Matlab, Scilab... What has survived for 30+ years? C programs, Unix... A lot of legacy code could be pushed onto parallel hardware (accelerators) with automatic tools... Plan B: no new language... Need automatic tools for source-to-source transformation to leverage existing software tools for a given hardware Not as efficient as hand-tuned programs, but quick production phase École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 18 / 93

Par4All We need software tools!!! Application development: long-term business long-term commitment in a tool that needs to survive to (too fast) technology change HPC Project needs tools for its hardware accelerators (Wild Nodes from Wild Systems) and to parallelize, port & optimize customer applications École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 19 / 93

Par4All Not reinventing the wheel... No NIH syndrome please! Want to create your own tool? House-keeping and infrastructure in a compiler is a huge task Unreasonable to begin yet another new compiler project... Many academic Open Source projects are available......but customers need products Integrate your ideas and developments in existing project...or buy one if you can afford (ST with PGI...) Some projects to consider Old projects: gcc, PIPS... and many dead ones (SUIF...) But new ones appear too: LLVM, RoseCompiler, Cetus... Par4All Funding an initiative to industrialize Open Source tools PIPS is the first project to enter the Par4All initiative http://www.par4all.org École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 20 / 93

Par4All PIPS (I) PIPS (Interprocedural Parallelizer of Scientific Programs): Open Source project from Mines ParisTech... 23-year old! Funded by many people (French DoD, Industry & Research Departments, University, CEA, IFP, Onera, ANR (French NSF), European projects, regional research clusters...) One of the projects that introduced polytope model-based compilation 456 KLOC according to David A. Wheeler s SLOCCount... but modular and sensible approach to pass through the years 300 phases (parsers, analyzers, transformations, optimizers, parallelizers, code generators, pretty-printers...) that can be combined for the right purpose Abstract interpretation École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 21 / 93

Par4All PIPS (II) Polytope lattice (sparse linear algebra) used for semantics analysis, transformations, code generation... to deal with big programs, not only loop-nests NewGen object description language for language-agnostic automatic generation of methods, persistence, object introspection, visitors, accessors, constructors, XML marshaling for interfacing with external tools... Interprocedural à la make engine to chain the phases as needed. Lazy construction of resources On-going efforts to extend the semantics analysis for C Around 15 programmers currently developing in PIPS (Mines ParisTech, HPC Project, IT SudParis, TÉLÉCOM Bretagne, RPI) with public svn, Trac, git, mailing lists, IRC, Plone, Skype... and use it for many projects But still... Huge need of documentation (even if PIPS uses literate programming...) École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 22 / 93

Par4All PIPS (III) Need of industrialization Need further communication to increase community size École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 23 / 93

Par4All Current PIPS usage Automatic parallelization (Par4All C & Fortran to OpenMP) Distributed memory computing with OpenMP-to-MPI translation [STEP project] Generic vectorization for SIMD instructions (SSE, VMX, Neon, CUDA, OpenCL...) (SAC project) [SCALOPES] Parallelization for embedded systems [SCALOPES] Compilation for hardware accelerators (Ter@PIX, SPoC, SIMD, FPGA...) [FREIA, SCALOPES] High-level hardware accelerators synthesis generation for FPGA [PHRASE, CoMap] Reverse engineering & decompiler (reconstruction from binary to C) Genetic algorithm-based optimization [Luxembourg university+tb] Code instrumentation for performance measures GPU with CUDA & OpenCL [TransMedi@, FREIA, OpenGPU, MediaGPU] École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 24 / 93

Par4All Par4All usage Generate from sequential C, Fortran & Scilab code OpenMP for SMP CUDA for nvidia GPU SCMP task programs for SCMP machine from CEA OpenCL for GPU & ST Platform 2012 (on-going) Code for various accelerators [SMECY], Kalray/MPPA [SIMILAN]... (on-going) École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 25 / 93

Par4All Par4All PyPS scripting in the backstage (I) PIPS is a great tool-box to do source-to-source compilation...but not really usable by λ end-user Development of Par4All Add a user compiler-like infrastructure p4a script as simple as p4a --openmp toto.c -o toto p4a --cuda toto.c -o toto -lm Be multi-target Apply some adaptative transformations Up to now PIPS was scripted with a special shell-like language: tpips Not enough powerful (not a programming language) Develop a SWIG Python interface to PIPS phases and interface École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 26 / 93

Par4All Par4All PyPS scripting in the backstage (II) Overview All the power of a widely spread real language Automate with introspection through the compilation flow Easy to add any glue, pre-/post-processing to generate target code.p4a.c Back-end compilers.f.c Preprocessor PyPS PIPS Postprocessor gcc, icc... OpenMP executa Par4All P4A Accel runtime nvcc CUDA executable Sequential source code.p4a.cu Parallel source code Parallel executable Invoke PIPS transformations With different recipes according to generated stuff Special treatments on kernels... Compilation and linking infrastructure: can use gcc, icc, nvcc, nvcc+gcc, nvcc+icc École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 27 / 93

Par4All Par4All PyPS scripting in the backstage (III) House keeping code Fundamental: colorizing and filtering some PIPS output, running cursor... École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 28 / 93

Par4All Outline OpenMP code generation 1 Par4All OpenMP code generation GPU code generation 2 Coding rules 3 Wild Cruncher 4 Results 5 Future developments 6 Conclusion 7 Table des matières École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 29 / 93

Par4All Parallelization to OpenMP OpenMP code generation (I) The easy way... Already in PIPS Used to bootstrap the start-up with stage-0 investors Indeed, we used only bash-generated tpips at this time (2008, no PyPS yet), needed a lot of bug squashing on C support in PIPS... Several parallelization algorithms are available in PIPS For example classical Allen & Kennedy use loop distribution more vector-oriented than kernel-oriented (or need later loop-fusion) Coarse grain parallelization based on the independence of array regions used by different loop iterations Currently used because generates GPU-friendly coarse-grain parallelism Accept complex control code without if-conversion Now in src/simple_tools/p4a_process.py, function process() École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 30 / 93

Par4All Parallelization to OpenMP OpenMP code generation (II) 1 # First apply some generic p aralle lizati on : processor. parallelize ( fine = input. fine, 3 apply_phases_before = input. apply_phases [ abp ], apply_phases_after = input. apply_phases [ aap ]) 5 [...] i f input. openmp and not input. accel : 7 # P a r a l l e l i z e the code in an OpenMP way: processor. ompify ( apply_phases_before = input. apply_phases [ abo ], 9 apply_phases_after = input. apply_phases [ aao ]) 11 # Write the output f i l e s. output. files = processor. save ( input. output_dir, 13 input. output_prefix, input. output_suffix ) src/simple_tools/p4a_process.py, function p4a_processor::parallelize() École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 31 / 93

Par4All Parallelization to OpenMP OpenMP code generation (III) 1 def parallelize ( self, fine = False, filter_select = None, 2 filter_exclude = None, apply_phases_before = [], apply_ """Apply transformations to p a r a l l e l i z e the code in the workspace 4 """ all_modules = self. filter_modules ( filter_select, filter_exclude ) 6 i f fine : # Set to False ( mandatory ) f o r A&K algorithm on C source file 8 self. workspace. props. memory_effects_only = self. fortran 10 # Apply requested phases before parallezation apply_user_requested_phases ( all_modules, apply_phases_before ) 12 # Try to privatize all the scalar variables in loops : 14 all_modules. privatize_module () 16 i f fine : # Use a fine - grain parallelization à la Allen & Kennedy : 18 all_modules. internalize_parallel_code ( concurrent =True ) 20 # Always use a coarse - grain parallelization with regions : all_modules. coarse_grain_parallelization ( concurrent =True ) 22 i f self. atomic : 24 # Use a coarse - grain parallelization with regions : École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 32 / 93

Par4All Parallelization to OpenMP OpenMP code generation (IV) 26 all_modules. flag_parallel_reduced_loops_with_atomic ( concurrent = 28 # Apply requested phases after parallelization : apply_user_requested_phases ( all_modules, apply_phases_after ) Subliminal message to PIPS/Par4All developers: write clear code with good comments since it can end up verbatim into presentations like this École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 33 / 93

Par4All OpenMP output sample OpenMP code generation 1!$omp p a r a l l e l do private ( I, K, X) C multiply the two square matrices of ones 3 DO J = 1, N 0016!$omp p a r a l l e l do private (K, X) 5 DO I = 1, N 0017 X = 0 0018 7!$omp p a r a l l e l do reduction (+:X) DO K = 1, N 0019 9 X = X+A(I,K)* B(K,J) 0020 ENDDO 11!$omp end p a r a l l e l do C(I,J) = X 0022 13 ENDDO!$omp end p a r a l l e l do 15 ENDDO!$omp end p a r a l l e l do École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 34 / 93

Par4All Outline GPU code generation 1 Par4All OpenMP code generation GPU code generation 2 Coding rules 3 Wild Cruncher 4 Results 5 Future developments 6 Conclusion 7 Table des matières École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 35 / 93

Par4All Basic GPU execution model GPU code generation A sequential program on a host launches computational-intensive kernels on a GPU Allocate storage on the GPU Copy-in data from the host to the GPU Launch the kernel on the GPU The host waits... Copy-out the results from the GPU to the host Deallocate the storage on the GPU Generic scheme for other heterogeneous accelerators too École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 36 / 93

Par4All Automatic parallelization GPU code generation Most fundamental for a parallel execution Finding parallelism! Several parallelization algorithms are available in PIPS For example classical Allen & Kennedy use loop distribution more vector-oriented than kernel-oriented (or need later loop-fusion) Coarse grain parallelization based on the independence of array regions used by different loop iterations Currently used because generates GPU-friendly coarse-grain parallelism Accept complex control code without if-conversion École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 37 / 93

Par4All Outlining GPU code generation (I) Parallel code Kernel code on GPU Need to extract parallel source code into kernel source code: outlining of parallel loop-nests Before: 1 f o r (i = 1;i <= 499; i ++) 2 f o r (j = 1;j <= 499; j ++) { save [i][ j] = 0.25*( space [i - 1][ j] + space [i + 1][ j] 4 + space [i][ j - 1] + space [i][ j + 1]); } École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 38 / 93

Par4All Outlining GPU code generation (II) After: 1 p4a_kernel_launcher_0 ( space, save ); [...] 3 void p4a_kernel_launcher_0 ( float_t space [ SIZE ][ SIZE ], float_t save [ SIZE ][ SIZE ]) { 5 f o r (i = 1; i <= 499; i += 1) f o r (j = 1; j <= 499; j += 1) 7 p4a_kernel_0 (i, j, save, space ); } 9 [...] void p4a_kernel_0 ( float_t space [ SIZE ][ SIZE ], 11 float_t save [ SIZE ][ SIZE ], i n t i, 13 i n t j) { save [i][ j] = 0.25*( space [i -1][ j]+ space [i+1][ j] 15 +space [i][j -1]+ space [i][ j+1]); } École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 39 / 93

Par4All GPU code generation From array regions to GPU memory allocation (I) Memory accesses are summed up for each statement as regions for array accesses: integer polytope lattice There are regions for write access and regions for read access The regions can be exact if PIPS can prove that only these points are accessed, or they can be inexact, if PIPS can only find an over-approximation of what is really accessed École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 40 / 93

Par4All GPU code generation From array regions to GPU memory allocation (II) Example 1 for ( i = 0; i <= n 1; i += 1) 2 for ( j = i ; j <= n 1; j += 1) h_a [ i ] [ j ] = 1; can be decorated by PIPS with write array regions as: 1 // <h_a[phi1] [PHI2] W EXACT {0<=PHI1, PHI2+1<=n, PHI1<=PHI2}> for ( i = 0; i <= n 1; i += 1) 3 // <h_a[phi1] [PHI2] W EXACT {PHI1==i, i<=phi2, PHI2+1<=n, 0<=i}> for ( j = i ; j <= n 1; j += 1) 5 // <h_a[phi1] [PHI2] W EXACT {PHI1==i, PHI2==j, 0<=i, i<=j, 1+j<=n}> h_a [ i ] [ j ] = 1; These read/write regions for a kernel are used to allocate with a cudamalloc() in the host code the memory used inside a kernel and to deallocate it later with a cudafree() École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 41 / 93

Par4All Communication generation GPU code generation (I) Conservative approach to generate communications Associate any GPU memory allocation with a copy-in to keep its value in sync with the host code Associate any GPU memory deallocation with a copy-out to keep the host code in sync with the updated values on the GPU But a kernel could use an array as a local (private) array...pips does have many privatization phases But a kernel could initialize an array, or use the initial values without writing into it or use/touch only a part of it or... École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 42 / 93

Par4All Communication generation GPU code generation (II) More subtle approach PIPS gives 2 very interesting region types for this purpose In-region abstracts what really needed by a statement Out-region abstracts what really produced by a statement to be used later elsewhere In-Out regions can directly be translated with CUDA into copy-in 1 cudamemcpy ( accel_address, host_address, 2 size, cudamemcpyhosttodevice ) copy-out 1 cudamemcpy ( host_address, accel_address, 2 size, cudamemcpydevicetohost ) École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 43 / 93

Par4All Loop normalization GPU code generation Hardware accelerators use fixed iteration space (thread index starting from 0...) Parallel loops: more general iteration space Loop normalization Before 1 f o r (i = 1; i < SIZE - 1; i ++) 2 f o r (j = 1;j < SIZE - 1; j ++) { save [i][ j] = 0.25*( space [i - 1][ j] + space [i + 1][ j] 4 + space [i][ j - 1] + space [i][ j + 1]); } After 1 f o r (i = 0; i < SIZE - 2; i ++) f o r (j = 0;j < SIZE - 2; j ++) { 3 save [i +1][ j +1] = 0.25*( space [i][ j + 1] + space [i + 2][ j + 1] + space [i + 1][ j] + space [i + 1][ j + 2]); 5 } École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 44 / 93

Par4All From preconditions to iteration clamping GPU code generation (I) Parallel loop nests are compiled into a CUDA kernel wrapper launch The kernel wrapper itself gets its virtual processor index with some blockidx.x*blockdim.x + threadidx.x Since only full blocks of threads are executed, if the number of iterations in a given dimension is not a multiple of the blockdim, there are incomplete blocks An incomplete block means that some index overrun occurs if all the threads of the block are executed École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 45 / 93

Par4All From preconditions to iteration clamping GPU code generation (II) So we need to generate code such as 1 void p4a_kernel_wrapper_0 ( i n t k, i n t l,...) 2 { k = blockidx.x* blockdim.x + threadidx.x; 4 l = blockidx.y* blockdim.y + threadidx.y; i f (k >= 0 && k <= M - 1 && l >= 0 && l <= M - 1) 6 kernel (k, l,...); } But how to insert these guards? The good news is that PIPS owns preconditions that are predicates on integer variables. Preconditions at entry of the kernel are: 1 // P( i, j, k, l ) {0<=k, k<=63, 0<=l, l <=63} Guard directly translation in C of preconditions on loop indices that are GPU thread indices École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 46 / 93

Par4All Complexity analysis GPU code generation Launching a GPU kernel is expensive so we need to launch only kernels with a significant speed-up (launching overhead, memory CPU-GPU copy overhead...) Some systems use #pragma to give a go/no-go information to parallel execution 1 #pragma omp parallel i f ( size >100) phase in PIPS to symbolically estimate complexity of statements Based on preconditions Use a SuperSparc2 model from the 90s... Can be changed, but precise enough to have a coarse go/no-go information To be refined: use memory usage complexity to have information about memory reuse (even a big kernel could be more efficient on a CPU if there is a good cache use) École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 47 / 93

Par4All Communication optimization GPU code generation Naive approach : load/compute/store Useless communications if a data on GPU is not used on host between 2 kernels... Use static interprocedural data-flow communications Fuse various GPU arrays : remove GPU (de)allocation Remove redundant communications New p4a --com-optimization option École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 48 / 93

Par4All Communication optimization GPU code generation Naive approach : load/compute/store Useless communications if a data on GPU is not used on host between 2 kernels... Use static interprocedural data-flow communications Fuse various GPU arrays : remove GPU (de)allocation Remove redundant communications New p4a --com-optimization option École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 49 / 93

Par4All Fortran to C-based GPU languages GPU code generation Fortran 77 parser available in PIPS CUDA & OpenCL are C++/C99 with some restrictions on the GPU-executed parts Need a Fortran to C translator (f2c...)? Only one internal representation is used in PIPS Use the Fortran parser Use the... C pretty-printer! But the IO Fortran library is complex to use... and to translate If you have IO instructions in a Fortran loop-nest, it is not parallelized anyway because of sequential side effects So keep the Fortran output everywhere but in the parallel CUDA kernels Apply a memory access transposition phase a(i,j) a[j-1][i-1] inside the kernels to be pretty-printed as C Compile and link C GPU kernel parts + Fortran main parts Quite harder than expected... Use Fortran 2003 for C interfaces... École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 50 / 93

Par4All Par4All Accel runtime GPU code generation (I) CUDA or OpenCL can not be directly represented in the internal representation (IR, abstract syntax tree) such as device or <<< >>> PIPS motto: keep the IR as simple as possible by design Use some calls to intrinsics functions that can be represented directly Intrinsics functions are implemented with (macro-)functions p4a_accel.h has indeed currently 3 implementations p4a_accel-cuda.h than can be compiled with CUDA for nvidia GPU execution or emulation on CPU p4a_accel-openmp.h that can be compiled with an OpenMP compiler for simulation on a (multicore) CPU Add CUDA support for complex numbers École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 51 / 93

Par4All Par4All Accel runtime GPU code generation (II) On-going support of OpenCL (p4a-opencl git branch) written in C/CPP/C++ Can be used to simplify manual programming too (OpenCL...) Manual radar electromagnetic simulation code @TB One code target CUDA/OpenCL/OpenMP (without Par4All) OpenMP emulation for almost free Use Valgrind to debug GPU-like and communication code! (Nice side effect of source-to-source...) May even improve performance compared to native OpenMP generation because of memory layout change École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 52 / 93

Par4All GPU code generation Working around CUDA & OpenCL limitations CUDA is not based on C99 but rather on C89 + few C++ extensions OpenCL is based on C99 but C99 variable size arrays not allowed in kernels... Some nice PIPS generated code from C99 user code may not compile Use some array linearization at some places École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 53 / 93

Coding rules Outline 1 Par4All OpenMP code generation GPU code generation 2 Coding rules 3 Wild Cruncher 4 Results 5 Future developments 6 Conclusion 7 Table des matières École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 54 / 93

Coding rules Coding rules Automatic parallelization is not magic Use abstract interpretation to «understand» programs Undecidable in the generic case ( halting problem) Quite easier for well written programs Develop a coding rule manual to help parallelization and... sequential quality! Avoid useless pointers Take advantage of C99 (arrays of non static size...) Use higher-level C, do not linearize arrays... Organize execution in cleaner loops expressing better parallelism... Prototype of coding rules report on-line on par4all.org École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 55 / 93

Coding rules Bad/good C programming example Dynamically allocated 2-d array with dynamic size Bad: 1 i n t n, m; double * t = 3 malloc ( s i z e o f ( double )*n*m); f o r ( i n t i = 0; i < n; i ++) 5 f o r ( i n t j = 0; j < m; m ++) t[i*n + j] = i + j; 7 free (t); Good: 1 double (*t)[n][m] = 6 } malloc ( s i z e o f ( double (*)[ n][m ])); 3 f o r ( i n t i = 0; i < n; i ++) f o r ( i n t j = 0; j < m; m ++) 5 (*t)[i][j] = i + j; free (t); Good: 1 { 2 double t[n][m]; f o r ( i n t i = 0; i < n; i ++) 4 f o r ( i n t j = 0; j < m; m ++) t[i][ j] = i + j; Before parallelization, good sequential programming... École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 56 / 93

Coding rules Take advantage of C99 (I) Write C 99 impler C 99 ode, easier to parallelize automatic 99 ally... BaC 99 k to C 99 imple C 99 ings Avoiding using C 99 umbersome old C C 99 onstruc 99 ts C 99 an lead to C 99 leaner C 99 ode, more effic 99 ient and more parallelizable C 99 ode (with Par4All...) C99 adds C 99 ympac 99 etic 99 features that are unfortunately not well known: MultidimenC 99 ional arrays with non-static 99 size Avoid malloc() spam and useless pointer C 99 onstruc 99 ions You C 99 an have arrays with a dynamic 99 size in func 99 tion parameters (as in Fortran) Avoid useless linearizing C 99 omputac 99 ions (a[i][j] instead of a[i+n*j]...) Avoid non-affine constructs that are hard to analyze for parallelization, communications... Avoid most of alloca() École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 57 / 93

Coding rules Take advantage of C99 (II) TLS (Thread-Local Storage) ExtenC 99 ion C 99 an express independent storage C 99 omplex numbers and booleans avoid further struc 99 tures or enums C 99 is good for you! École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 58 / 93

Wild Cruncher Outline 1 Par4All OpenMP code generation GPU code generation 2 Coding rules 3 Wild Cruncher 4 Results 5 Future developments 6 Conclusion 7 Table des matières École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 59 / 93

Wild Cruncher Scilab language Interpreted scientific language widely used like Matlab Free software Roots in free version of Matlab from the 80 s Dynamic typing (scalars, vectors, (hyper)matrices, strings...) Many scientific functions, graphics... Double precision everywhere, even for loop indices (now) Slow because everything decided at runtime, garbage collecting Implicit loops around each vector expression Huge memory bandwidth used Cache thrashing Redundant control flow École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 60 / 93

Wild Cruncher Scilab & Matlab Strong commitment to develop Scilab through Scilab Enterprise, backed by a big user community, INRIA... Reuse Par4All infrastructure to parallelize the code Scilab/Matlab input : sequential or array syntax Compilation to C code Type inference to guess (crazy ) legacy semantics Heuristic: first encountered type is forever Parallelization of the generated C code Speedup > 1000 WildCruncher: x86+gpu appliance with nice interface Scilab mathematical model & simulation Par4All automatic parallelization //Geometry polynomial-based 3D rendering & modelling HPC Project WildNode appliance with Scilab parallelization presented at Ter@tec 2011/06/28-29 École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 61 / 93

Wild Cruncher Wild Cruncher École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 62 / 93

Results Outline 1 Par4All OpenMP code generation GPU code generation 2 Coding rules 3 Wild Cruncher 4 Results 5 Future developments 6 Conclusion 7 Table des matières École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 63 / 93

Results Hyantes (I) Geographical application: library to compute neighbourhood population potential with scale control WildNode with 2 Intel Xeon X5670 @ 2.93GHz (12 cores) and a nvidia Tesla C2050 (Fermi), Linux/Ubuntu 10.04, gcc 4.4.3, CUDA 3.1 Sequential execution time on CPU: 30.355s OpenMP parallel execution time on CPUs: 3.859s, speed-up: 7.87 CUDA parallel execution time on GPU: 0.441s, speed-up: 68.8 With single precision on a HP EliteBook 8730w laptop (with an Intel Core2 Extreme Q9300 @ 2.53GHz (4 cores) and a nvidia GPU Quadro FX 3700M (16 multiprocessors, 128 cores, architecture 1.1)) with Linux/Debian/sid, gcc 4.4.5, CUDA 3.1: Sequential execution time on CPU: 34.7s OpenMP parallel execution time on CPUs: 13.7s, speed-up: 2.53 OpenMP emulation of GPU on CPUs: 9.7s, speed-up: 3.6 CUDA parallel execution time on GPU: 1.57s, speed-up: 24.2 École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 64 / 93

Results Hyantes (II) Original main C kernel: 1 void run ( data_t xmin, data_t ymin, data_t xmax, data_t ymax, data_t step, da 2 town pt[ rangex ][ rangey ], town t[ nb ]) { 4 size_t i,j,k; 6 fprintf ( stderr,"begin computation... \ n"); 8 f o r (i =0;i< rangex ;i ++) f o r (j =0;j< rangey ;j ++) { 10 pt[i][ j]. latitude =( xmin+step *i )*180/ M_PI ; pt[i][ j]. longitude =( ymin+step *j )*180/ M_PI ; 12 pt[i][j]. stock =0.; f o r (k =0;k<nb;k ++) { 14 data_t tmp = 6368.* acos ( cos ( xmin+step *i)* cos (t[k]. latitude ) * cos (( ymin+step *j)-t[k]. longitude ) 16 + sin ( xmin+step *i)* sin (t[k]. latitude )); i f ( tmp < range ) 18 pt[i][ j]. stock += t[k]. stock / (1 + tmp ) ; } 20 } fprintf ( stderr,"end computation... \ n"); 22 } École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 65 / 93

Results Hyantes (III) Example given in par4all.org distribution École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 66 / 93

Results Hyantes (IV) OpenMP code: 1 void run ( data_t xmin, data_t ymin, data_t xmax, data_t ymax, data_t step, da 2 { size_t i, j, k; 4 fprintf ( stderr, "begin computation... \ n"); 6 #pragma omp parallel f o r p r i v a t e (k, j) 8 f o r (i = 0; i <= 289; i += 1) f o r (j = 0; j <= 298; j += 1) { 10 pt[i][ j]. latitude = ( xmin+step *i )*180/3.14159265358979323846; pt[i][ j]. longitude = ( ymin+step *j )*180/3.14159265358979323846; 12 pt[i][j]. stock = 0.; f o r (k = 0; k <= 2877; k += 1) { 14 data_t tmp = 6368.* acos ( cos ( xmin+step *i)* cos (t[k]. latitude )* cos ( i f (tmp <range ) 16 pt[i][ j]. stock += t[k]. stock /(1+ tmp ); } 18 } fprintf ( stderr, "end computation... \ n"); 20 } void display ( town pt [290][299]) 22 { École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 67 / 93

Results Hyantes (V) size_t i, j; 24 f o r (i = 0; i <= 289; i += 1) { f o r (j = 0; j <= 298; j += 1) 26 printf ("%l f %l f %l f \n", pt[i][ j]. latitude, pt[i][ j]. longitude, pt[i printf ("\n"); 28 } } École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 68 / 93

Results Hyantes (VI) Generated GPU code: 1 void run ( data_t xmin, data_t ymin, data_t xmax, data_t ymax, data_t step, data_t range, town pt [290][299], town t [2878]) 3 { size_t i, j, k; 5 //PIPS generated variable town (* P_0 )[2878] = ( town (*)[2878]) 0, (* P_1 )[290][299] = ( town (*)[290][299]) 0; 7 fprintf ( stderr, "begin computation... \n"); 9 P4A_accel_malloc (&P_1, s i z e o f (town [290][299]) -1+1); P4A_accel_malloc (&P_0, s i z e o f (town [2878]) -1+1); 11 P4A_copy_to_accel (pt, *P_1, s i z e o f (town [290][299]) -1+1); P4A_copy_to_accel (t, *P_0, s i z e o f (town [2878]) -1+1); 13 p4a_kernel_launcher_0 (* P_1, range, step, *P_0, xmin, ymin ); 15 P4A_copy_from_accel (pt, *P_1, s i z e o f (town [290][299]) -1+1); P4A_accel_free (* P_1 ); 17 P4A_accel_free (* P_0 ); fprintf ( stderr, "end computation... \n"); 19 } 21 void p4a_kernel_launcher_0 (town pt [290][299], data_t range, data_t step, town t [2878], data_t xmin, data_t ymin ) 23 { //PIPS generated variable 25 size_t i, j, k; P4A_call_accel_kernel_2d ( p4a_kernel_wrapper_0, 290,299, i, j, pt, range, 27 step, t, xmin, ymin ); } 29 P4A_accel_kernel_wrapper void p4a_kernel_wrapper_0 (size_t i, size_t j, town pt [290][299], École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 69 / 93

Results Hyantes (VII) 31 data_t range, data_t step, town t [2878], data_t xmin, data_t ymin ) { 33 // Index has been replaced by P4A_vp_0: i = P4A_vp_0 ; 35 // Index has been replaced by P4A_vp_1: j = P4A_vp_1 ; 37 // Loop nest P4A end p4a_kernel_0 (i, j, &pt [0][0], range, step, &t[0], xmin, ymin ); 39 } 41 P4A_accel_kernel void p4a_kernel_0 ( size_t i, size_t j, town *pt, data_t range, data_t step, town *t, data_t xmin, data_t ymin ) 43 { //PIPS generated variable 45 size_t k; // Loop nest P4A end 47 i f (i <=289&& j <=298) { pt [299* i+j]. latitude = ( xmin + step *i )*180/3.14159265358979323846; 49 pt [299* i+j]. longitude = ( ymin + step *j )*180/3.14159265358979323846; pt [299* i+j]. stock = 0.; 51 f o r (k = 0; k <= 2877; k += 1) { data_t tmp = 6368.* acos (cos (xmin + step *i)*cos ((*(t+k )). latitude )*cos (ymin + step *j 53 -(*(t+k )). longitude )+ sin ( xmin + step *i)* sin ((*( t+k )). latitude )); i f (tmp < range ) 55 pt [299* i+j]. stock += t[k]. stock /(1+ tmp ); } 57 } } École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 70 / 93

Results Stars-PM Particle-Mesh N-body cosmological simulation C code from Observatoire Astronomique de Strasbourg Use FFT 3D Example given in par4all.org distribution École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 71 / 93

Results Stars-PM time step 1 void iteration ( coord pos [ NP ][ NP ][ NP], 2 coord vel [ NP ][ NP ][ NP], f l o a t dens [NP ][ NP ][ NP], 4 i n t data [NP ][ NP ][ NP], i n t histo [NP ][ NP ][ NP ]) { 6 / S p l i t space into regular 3D grid : / discretisation (pos, data ); 8 / Compute density on the grid : / histogram ( data, histo ); 10 / Compute attraction potential in Fourier s space : / 12 potential ( histo, dens ); / Compute in each dimension the r e s u l t i n g forces and 14 integrate the acceleration to update the speeds : / forcex (dens, force ); 16 updatevel (vel, force, data, 0, dt ); forcey (dens, force ); 18 updatevel (vel, force, data, 1, dt ); forcez (dens, force ); 20 updatevel (vel, force, data, 2, dt ); / Move the p a r t i c l e s : / 22 updatepos (pos, vel ); } École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 72 / 93

Results Stars-PM & Jacobi results with p4a 1.1.2 2 Xeon Nehalem X5670 (12 cores @ 2,93 GHz) 1 GPU nvidia Tesla C2050 CUDA 3.2 Automatic call to CuFFT instead of FFTW (stubs...) 150 iterations of Stars-PM Speed-up p4a Simulation Cosmo. Jacobi 32 3 64 3 128 3 Sequential (time in s) (gcc -O3) 0.68 6.30 98.4 24.5 OpenMP 6 threads --openmp 4.25 4.92 5.9 1.78 CUDA base --cuda 0.77 1.21 3.13 0.36 Optim. comm. 1.1 --cuda --com-opt. 3.4 5.38 11 3.8 Reduction Optim. 1.1.2 --cuda --com-opt. 6.8 19.7 46.9 6.4 Manual optim. (gcc -O3) 13.6 24.2 54.7 p4a 1.1.2 introduce generation of CUDA atomic updates for PIPS detected reductions. Other solution to investigate: CuDPP call generation École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 73 / 93

Future developments Outline 1 Par4All OpenMP code generation GPU code generation 2 Coding rules 3 Wild Cruncher 4 Results 5 Future developments 6 Conclusion 7 Table des matières École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 74 / 93

Future developments Eclipse interfacing (I) Motivation Not easy to understand why some code is not parallel or inefficient Need to represent internal representation to he user Need interaction between tools and user Useful also to debug the tools themselves Reuse existing IDE École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 75 / 93

Future developments Eclipse interfacing (II) Integration into Eclipse Invoke various compilation steps Need to attach information to source code Source code with hyperlinks and tooltips... Graph representation (data dependence, control flow...) Hiding/merging information Some ideas Rely on CDT and PTP mode Improve existing OpenCL mode Define simple tool interaction framework by Eclipse expert Simple textual communication (YAML...) No need to develop inside Eclipse for tool providers Still to be defined... WIP in OpenGPU École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 76 / 93

Future developments Eclipse interfacing (III) In PIPS: rely on attachments to have IR information flowing through the prettyprinter (back to Emacs PIPS...) École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 77 / 93

Future developments Towards multi-gpu with OpenCL & *PU Side effect of MediaGPU project... LaBRI is working on *PU to distribute kernel code on any node, on any accelerator (GPU, SPU, CPU) *PU Recent developments: virtual OpenCL device that rely on *PU for execution Load balancing on CPU & GPU of kernels Only work on asynchronous and independent kernels Need new phases to detect independent kernels Need new phases to split loop nests into independent multi-gpu kernels that cannot shared memory... Some hard work to be efficient WIP in MediaGPU École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 78 / 93

Future developments Stub abstractions (I) Real code source code + library code, intrinsic functions, system library... Parallelizer «understands» (at some level...) source code How to parallelize source code with library code, intrinsic functions, system library... without source code? Example : a programmer uses a special IO function that is missed by the compiler no results really used dead-code elimination Use stubs and stub recipes Provide stub functions to satisfy the analysis without having to describe them in the compiler For PIPS: Equivalent memory effect for interprocedural abstract interpretation Equivalent I/O effect Set variable values in a polyhedral-friendly way when available École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 79 / 93

Future developments Stub abstractions (II) Provide some sequential implementation that can be well parallelized Provide some parallelized implementation for GPU Give information for Par4All Accel runtime or static estimator to know if it is worth to move data on GPU or execute on OpenMP on CPU if not already on GPU (reminiscent HPF (re)distribution...) Provide Makefiles/glue/... to compile everything... Only discover function use during compilation Need to cope with dynamic linking... On-going generalization through stub concept abstraction + stub broker École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 80 / 93

Future developments SCMP computer Future revolutions Software radio, cognitive radio, passive radar, compressed sensing... Embedded accelerator developed at French CEA Task graph oriented parallel multiprocessor Hardware task graph scheduler Synchronizations Communication through memory page sharing Generating code from THALES (TCF) GSM sensing application in SCALOPES European project Reuse output of PIPS GPU phases + specific phases SCMP code with tasks SCMP task descriptor files Adapted Par4All Accel run-time École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 81 / 93

Future developments SCMP tasks In general case, different tasks can produce data in unpredictable way: use helper data server tasks to deal with coherency when several producers École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 82 / 93

Future developments SCMP task code (before/after) i n t main () { i n t i, t, a [20], b [20]; f o r (t =0; t < 100; t ++) { kernel_tasks_1 : f o r (i =0; i <10; i ++) a[i] = i+t; kernel_tasks_2 : f o r (i =10; i <20; i ++) a[i] = 2* i+t; kernel_tasks_3 : f o r (i =10; i <20; i ++) printf ("a[%d ] = %d\n", i, a[i ]); } r e t u r n (0); } i n t main () { P4A_scmp_reset (); i n t i, t, a [20], b [20]; f o r (t = 0; t <= 99; t += 1) { [...] { //PIPS generated variable i n t (* P4A a 1 )[10] = ( i n t (*)[10]) 0; P4A_scmp_malloc (( void **) & P4A a 1, s i z e o f ( i n t )*10, P4A a 1_id, P4A a 1_prod_p P4A a 1_cons_p, P4A i f ( scmp_task_2_p ) f o r (i = 10; i <= 19; i += 1) (* P4A a 1 )[i -10] = 2*i+t; P4A_copy_from_accel_1d ( sizeof ( i n t ), 20, 10, P4A_sesam_server_a_p? &a [0] : NULL, * P4A P4A a 1_id, P4A a 1_prod_p P4A a_ P4A_scmp_dealloc ( P4A a 1, P4A a 1_id, P4A a 1_prod_p P4A a 1_cons_p, P4A } [...] } return ( ev_t004 ); } École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 83 / 93

Future developments Performance of GSM sensing on SCMP Speed-up on 4 PE SCMP: 2.35 with manual parallelization by SCMP team 1.86 with automatic Par4All parallelization Still big memory overhead To optimize... Use these techniques to generate automate cloud-ification Take advantage of source-to-source École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 84 / 93

Future developments Saint Cloud gatekeeper & massive virtual I/O École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 85 / 93

Future developments Future challenges for the real world (I) Make a compiler with features that compose: able to generate heterogeneous code for heterogeneous machines with all together: MPI code generation between nodes Generate OpenMP parallel code for SMP Processors inside node Multi-GPU with each SMP thread controlling a GPU Work distribution (à la *PU?) between GPU and OpenMP Generate CUDA/OpenCL GPU or other accelerator code Generate SIMD vector code in OpenMP Generate SIMD vector code in GPU code Generating KPN-style task code for special accelerators (SCMP...) [SCALOPES, SMECY] or cloud-computing Source-to-source transformation is a key technology Rely a lot on Par4All Accel run-time Define good minimal abstractions Simplify compiler infrastructure École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 86 / 93

Future developments Future challenges for the real world (II) Improve target portability Finding a good ratio between specific architecture features and global efficiency Future is to static compilation + run-time optimizations... PyPS as a generic source-to-source compiler Started as Python abstraction of PIPS pass manager Evolved in a generic source-to-source compiler infrastructure Parallel evolution of Par4All & PyPS refactoring of Par4All back to PyPS future features Python in PyPS through multiple inheritance, mix-ins, dynamicity... helps a lot Common refactoring of PyPS+Par4All soon Light-weight (YAML...) interaction with Eclipse [OpenGPU] Combine different source-to-source compilers at phase level Use source core + #pragma/decorations + intrinsic functions [SMECY, SIMILAN] Useful to have non-compiler expert writing specialized compilers École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 87 / 93

Future developments Future challenges for the real world (III) Software engineering issue How to have compiler parts combine seamlessly? École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 88 / 93

Conclusion Outline 1 Par4All OpenMP code generation GPU code generation 2 Coding rules 3 Wild Cruncher 4 Results 5 Future developments 6 Conclusion 7 Table des matières École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 89 / 93

Conclusion Conclusion (I) GPU (and other heterogeneous accelerators): impressive peak performances and memory bandwidth, power efficient Domain is maturing: any languages, libraries, applications, tools... Just choose the good one Real codes are often not well written to be parallelized... even by human being At least writing clean C99/Fortran/Scilab... code should be a prerequisite Take a positive attitude... Parallelization is a good opportunity for deep cleaning (refactoring, modernization... ) improve also the original code Open standards to avoid sticking to some architectures Need software tools and environments that will last through business plans or companies École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 90 / 93

Conclusion Conclusion (II) Open implementations are a warranty for long time support for a technology (cf. current tendency in military and national security projects) p4a motto: keep things simple Open Source for community network effect Easy way to begin with parallel programming Source-to-source Give some programming examples Good start that can be reworked upon Entry cost Exit cost! Do not loose control on your code and your data! Subliminal message: we re hiring... École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 91 / 93

Conclusion How to translate manycores in French? A side effect of RenPar/SympA/CFSE 2008 in Fribourg... Visit of the castle of Gruyère with many old French texts with moult usages of moult and even moultement Moult jokes with Lionel LACASSAGNE and Joël FALCOU Lionel LACASSAGNE (IEF) recently coined «moulticœur» for manycore Wikipedia article still to be written... École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 92 / 93

Table des matières Outline 1 Par4All OpenMP code generation GPU code generation 2 Coding rules 3 Wild Cruncher 4 Results 5 Future developments 6 Conclusion 7 Table des matières École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 93 / 93

Table des matières The Distributed Computer 2 SOLOMON 3 POMP & PompC @ LI/ENS 1987 1992 4 HyperParallel Technologies (1992 1998) 5 HyperParallel Technologies (1992 1998) 6 Present motivations 7 ARM yourself 8 MPPA by Kalray (the French touch!) 10 Off-the-shelf AMD/ATI Radeon HD 6970 GPU 12 Radeon HD 6870 big picture 13 IBM Power 7 (2010) 14 Time to be back in parallelism! 15 HPC Project hardware: WildNode from Wild Systems 16 HPC Project software and services 17 1 Par4All Outline 18 Use the Source, Luke... 19 We need software tools!!! 20 Not reinventing the wheel... No NIH syndrome please! 21 PIPS 22 Current PIPS usage 25 Par4All usage 26 Par4All PyPS scripting in the backstage 27 Par4All Accel runtime 52 Working around CUDA & OpenCL limitations 54 2 Coding rules Outline 55 Coding rules 56 Bad/good C programming example 57 Take advantage of C99 58 3 Wild Cruncher Outline 60 Scilab language 61 Scilab & Matlab 62 Wild Cruncher 63 4 Results Outline 64 Hyantes 65 Stars-PM 72 Stars-PM time step 73 Stars-PM & Jacobi results with p4a 1.1.2 74 5 Future developments Outline 75 Eclipse interfacing 76 Towards multi-gpu with OpenCL & *PU 79 OpenMP code generation Stub abstractions 80 Outline 30 SCMP computer 82 Parallelization to OpenMP 31 SCMP tasks 83 OpenMP output sample 35 SCMP task code (before/after) 84 GPU code generation Outline 36 Performance of GSM sensing on SCMP Performance of GSM sensing on SCMP 85 86 Basic GPU execution model 37 Saint Cloud gatekeeper & massive virtual I/O 87 Automatic parallelization 38 Future challenges for the real world 88 Outlining 39 From array regions to GPU memory allocation 41 6 Conclusion Communication generation 43 Outline 91 Loop normalization 45 Conclusion 92 From preconditions to iteration clamping 46 How to translate manycores in French? 94 Complexity analysis 48 7 Communication optimization 49 Table des matières Communication optimization 50 Outline 95 Fortran to C-based GPU languages 51 Vous êtes ici! 96 École d été francophone de traitement d images sur GPU 29/06/2011 Ronan KERYELL et al. 93 / 93