Algorithmes répartis sur grilles et cluster de clusters 7 décembre 2007

Documents pareils

Institut français des sciences et technologies des transports, de l aménagement

XtremWeb-HEP Interconnecting jobs over DG. Virtualization over DG. Oleg Lodygensky Laboratoire de l Accélérateur Linéaire

Instructions Mozilla Thunderbird Page 1

IPv6: from experimentation to services

Exercices sur SQL server 2000

THÈSE. présentée à TÉLÉCOM PARISTECH. pour obtenir le grade de. DOCTEUR de TÉLÉCOM PARISTECH. Mention Informatique et Réseaux. par.

Application Form/ Formulaire de demande

Frequently Asked Questions

Cedric Dumoulin (C) The Java EE 7 Tutorial

Bourses d excellence pour les masters orientés vers la recherche

Plan. Department of Informatics

TABLE DES MATIERES A OBJET PROCEDURE DE CONNEXION

Quick Start Guide This guide is intended to get you started with Rational ClearCase or Rational ClearCase MultiSite.

APPENDIX 6 BONUS RING FORMAT

Instructions pour mettre à jour un HFFv2 v1.x.yy v2.0.00

Package Contents. System Requirements. Before You Begin

1.The pronouns me, te, nous, and vous are object pronouns.

Qualité et ERP CLOUD & SECURITY (HACKING) Alireza MOKHTARI. 9/12/2014 Cloud & Security

3615 SELFIE. HOW-TO / GUIDE D'UTILISATION

Guide d'installation rapide TFM-560X YO.13

FÉDÉRATION INTERNATIONALE DE NATATION Diving

MapReduce. Malo Jaffré, Pablo Rauzy. 16 avril 2010 ENS. Malo Jaffré, Pablo Rauzy (ENS) MapReduce 16 avril / 15

affichage en français Nom de l'employeur *: Lions Village of Greater Edmonton Society

I. Programmation I. 1 Ecrire un programme en Scilab traduisant l organigramme montré ci-après (on pourra utiliser les annexes):

Contents Windows

DOCUMENTATION MODULE BLOCKCATEGORIESCUSTOM Module crée par Prestacrea - Version : 2.0

PARIS ROISSY CHARLES DE GAULLE

WEB page builder and server for SCADA applications usable from a WEB navigator

Contrôle d'accès Access control. Notice technique / Technical Manual

Once the installation is complete, you can delete the temporary Zip files..

Génération de code binaire pour application multimedia : une approche au vol

The new consumables catalogue from Medisoft is now updated. Please discover this full overview of all our consumables available to you.

Township of Russell: Recreation Master Plan Canton de Russell: Plan directeur de loisirs

We Generate. You Lead.

I>~I.J 4j1.bJ1UlJ ~..;W:i 1U

INSTRUMENTS DE MESURE SOFTWARE. Management software for remote and/or local monitoring networks

Forthcoming Database

Judge Group: P Title: Quel est meilleur: le compost ou le fertilisant chimique? Student(s): Emma O'Shea Grade: 6

Utiliser une WebCam. Micro-ordinateurs, informations, idées, trucs et astuces

Présentation par François Keller Fondateur et président de l Institut suisse de brainworking et M. Enga Luye, CEO Belair Biotech

Project 1 Experimenting with Simple Network Management Tools. ping, traceout, and Wireshark (formerly Ethereal)

Cloud Computing: de la technologie à l usage final. Patrick CRASSON Oracle Thomas RULMONT WDC/CloudSphere Thibault van der Auwermeulen Expopolis

Relions les hommes à l entreprise Linking people to companies

ETABLISSEMENT D ENSEIGNEMENT OU ORGANISME DE FORMATION / UNIVERSITY OR COLLEGE:

Quatre axes au service de la performance et des mutations Four lines serve the performance and changes

Services à la recherche: Data Management et HPC *

Software and Hardware Datasheet / Fiche technique du logiciel et du matériel

How to Login to Career Page

Plateforme Technologique Innovante. Innovation Center for equipment& materials

La Forge INRIA : bilan et perspectives. Hervé MATHIEU - 11 mai 2010

Gestion des prestations Volontaire

SERVEUR DÉDIÉ DOCUMENTATION

LES APPROCHES CONCRÈTES POUR LE DÉPLOIEMENT D INFRASTRUCTURES CLOUD AVEC HDS & VMWARE

Tex: The book of which I'm the author is an historical novel.

0,3YDQGLWVVHFXULW\ FKDOOHQJHV 0$,1²0RELOLW\IRU$OO,31HWZRUNV²0RELOH,3 (XUHVFRP:RUNVKRS %HUOLQ$SULO

Face Recognition Performance: Man vs. Machine

Acce s aux applications informatiques Supply Chain Fournisseurs

Editing and managing Systems engineering processes at Snecma

Exemple PLS avec SAS

Tier 1 / Tier 2 relations: Are the roles changing?

DOCUMENTATION - FRANCAIS... 2

MANUEL MARKETING ET SURVIE PDF

DG-ADAJ: Une plateforme Desktop Grid

Intel Corporation Nicolas Biguet Business Development Manager Intel France

Surveillance de Scripts LUA et de réception d EVENT. avec LoriotPro Extended & Broadcast Edition

Nouveautés printemps 2013

DOCUMENTATION - FRANCAIS... 2

CEST POUR MIEUX PLACER MES PDF

VMware : De la Virtualisation. au Cloud Computing

Notice Technique / Technical Manual

FOURTH SESSION : "MRP & CRP"

«Rénovation des curricula de l enseignement supérieur - Kazakhstan»

Prérequis réseau constructeurs

MELTING POTES, LA SECTION INTERNATIONALE DU BELLASSO (Association étudiante de lʼensaparis-belleville) PRESENTE :

Le Cloud Computing est-il l ennemi de la Sécurité?

THE EVOLUTION OF CONTENT CONSUMPTION ON MOBILE AND TABLETS

Mon Service Public - Case study and Mapping to SAML/Liberty specifications. Gaël Gourmelen - France Telecom 23/04/2007

Archived Content. Contenu archivé

CONTEC CO., LTD. Novembre 2010

Le projet Gaïa, le Big Data au service du traitement de données satellitaires CRIP - 16/10/2013 Pierre-Marie Brunet

VTP. LAN Switching and Wireless Chapitre 4

3A-IIC - Parallélisme & Grid GRID : Définitions. GRID : Définitions. Stéphane Vialle. Stephane.Vialle@supelec.fr

Improving the breakdown of the Central Credit Register data by category of enterprises

Interest Rate for Customs Purposes Regulations. Règlement sur le taux d intérêt aux fins des douanes CONSOLIDATION CODIFICATION

Warning: Failure to follow these warnings could result in property damage, or personal injury.

Compléter le formulaire «Demande de participation» et l envoyer aux bureaux de SGC* à l adresse suivante :

Paxton. ins Net2 desktop reader USB

PRESENTATION. CRM Paris - 19/21 rue Hélène Boucher - ZA Chartres Est - Jardins d'entreprises GELLAINVILLE

BILAN du projet PEPS 1 EOLIN (Eolien LMI INSA)

calls.paris-neuroscience.fr Tutoriel pour Candidatures en ligne *** Online Applications Tutorial

physicien diplômé EPFZ originaire de France présentée acceptée sur proposition Thèse no. 7178

Limitations of the Playstation 3 for High Performance Cluster Computing

WiFi Security Camera Quick Start Guide. Guide de départ rapide Caméra de surveillance Wi-Fi (P5)

IPSAS 32 «Service concession arrangements» (SCA) Marie-Pierre Cordier Baudouin Griton, IPSAS Board

APPENDIX 2. Provisions to be included in the contract between the Provider and the. Holder

LOGICIEL D'ADMINISTRATION POUR E4000 & G4000 MANAGEMENT SOFTWARE FOR E4000 & G4000

setting the scene: 11dec 14 perspectives on global data and computing e-infrastructure challenges mark asch MENESR/DGRI/SSRI - France

Natixis Asset Management Response to the European Commission Green Paper on shadow banking

Lamia Oukid, Ounas Asfari, Fadila Bentayeb, Nadjia Benblidia, Omar Boussaid. 14 Juin 2013

Transcription:

Algorithmes répartis sur grilles et cluster de clusters 7 décembre 007 Serge G. Petiton 1

Goals and hypothesis

Questions : Is Parallel Matrix Computing adaptable to large scale PP Platforms? Or, is it possible to define a new programming paradigm for scientific matrix-based computing on large scale PP plateforms? Other that only the task-farming programming one. What algorithms for classical linear algebra problems? Is the class of well - adapted methods large enough? What language and framework for the end-users? What evaluation and economic model? 3

Large scale PP computing, hypothesis Each peer can : compute Send and receive data Participate to the system coordination Several thousand of peers targeted : Between 4Gflops and 50Mflops per peer Different Memory size, between 1Gbytes and 18Mbytes Different software on each peer (BLAS and LAPACK available) Fault tolerant system (including MPI communications) Indeterminism MPICH - V as a Fault-Tolerant MPI ; with multicasts 4

Platforms Two configurations for XtremWeb: A LAN based with 18 PCs, largely non - dedicated (Polytech - Lille, LIFL, INRIA). A WAN based with 18PCs+133PCs distributed on two geographic sites (Polytech - Lille/LIFL and Paris XI university / INRIA), also non -dedicated. One XtremWeb - CH testbed of 18 PCs is also deployed. 5

Platforms Polytech Lille/LIFL/INRIA Platform (18 PCs) 8 PIII, 450MHz, 81 Intel Celerons 600Mhz to.4 GHz, 15 AMD Duran, 750MHz 4 PIV,.4 GHz. Memory size from 18MBytes to 51MBytes Paris 11 Univ./LRI/INRIA (133 PCs) 101 AMD Athlon and K7, 7 bi proc., 600MHz to.8ghz, 10 PIV, GHz, 1 PIII, 500 MHz, 9 PII, 400 MHz. Memory Size from 18Mbytes to 1Gbytes Ethernets with heterogeneous speeds from 10Mbits/s to 100Mbits/s No-dedicated to our experimentations : students and seti@home are also using these computers 6

Algorithms, Evaluations and/or Experimentations - BLAS computation (for iterative methods) - Eigenvalues end eigenvectors 7

Algorithms, Evaluations and/or Experimentations - BLAS computation (for iterative methods) - Eigenvalues end eigenvectors - Gauss Jordan par blocs 8

Distributed Block Dense Matrix Vector Multiplication with XtremWeb Two traditional ways to distributed matrices in block matrix algebra generate two different algorithms, as follow :The matrix vector product is a simple but fundamental operation in matrix computing. It forms the core of iterative methods, for example. It is often one of the primary operations evaluated to benchmark supercomputers. A x b A x b x = and x = Version 1 Version The matrix A remains unchanged, this enables us to pre-deploy data and use persistent data storage (for iterative method for example). 9

Results 1 The computing times of different runs vary greatly. We present here the average computing times (in seconds) for each execution, Version 1 : with or without persistent storage in LAN and WAN configurations. For blocks size of 1500, we introduce 40 tasks into the XtremWeb network at each execution (more tasks than available workers). 110 tasks for blocks size of 3000. 7 tasks for blocks size of 3750. 100 Time in seconds 1000 800 600 400 00 Without persistent storage With persistent storage 0 Blocks size 1500 in LAN Blocks size 3000 in LAN 3750 in LAN Blocks size 3750 in WAN 3000 in WAN 10

Results Version : only one step. At each execution, we introduce 100 tasks into the XtremWeb network 600 500 Time in seconds 400 300 00 100 LAN Configuration WAN configuration 0 block size 300-by-30000 Lack of an efficient scheduling. 11

PP Out-of-core programming 1 The out-of-core technique solves large problems on standard computers. We will use it for increasing the size of problems on each peer. Classical out-of-core matrix vector product. The average computing times (in seconds) : - For out-of-core version (size 30000) we introduce between 3 and 30 tasks in the XtremWeb network. Time in seconds 1000 800 600 400 00 0 Matrices sizes 30000 Laptop computer/out-of-core (P4,.4GHz and 51MB) Xtremweb in LAN (version, 100 tasks) Xtremweb/out-of-core in LAN with persistent storage (3 tasks) Xtremweb/out-of-core in LAN with persistent storage(30 tasks) 1

PP with out-of-core programming The out-of-core technique solves large problems on standard computers. We will use it for increasing the size of problems on peers. The matrix vector product is scheduled out-of-core in very simple and efficient way The average computing times (in seconds) : - For out-of-core version (size 30000) we introduce between 3 and 30 tasks in the XtremWeb network. Time in seconds 1000 800 600 400 00 0 Matrices sizes 30000 Size 300000 Laptop computer/out-of-core (P4,.4GHz and 51MB) Xtremweb in LAN (version, 100 tasks) Xtremweb/out-of-core in LAN with persistent storage (3 tasks) Xtremweb/out-of-core in LAN with persistent storage(30 tasks) Xtremweb/out-of-core with persistent storage(size 300000) 13

PP with out-of-core programming For matrix size of 30000, we obtain a speed-up up to 0 compared to the.4 GHz laptop execution ; using the out-of-core product. The vector and all matrix blocks are pre-deployed. We increase the size of matrix with 300000 (100 times the initial matrix, >700 GB). We introduce 30 tasks in the LAN XtremWeb network (each task with block size of 1000-by-300000) the average computing time is 575 seconds! 14

Conclusion The central management of communications is a strong bottleneck We need smart scheduler and efficient persistent data PP storage Out-of-core computing on each peer as a programming paradigm. Possible to get a numerical result which was not computable on a unique computer. New end-users 15

n I 0 0 0 A I,J B = 0 0 I 0 0 I 0 0 0 0 0 I p A B = B A = I Block Gauss-Jordan Matrix size = N = p n To invert a matrix N 3 operations Challenge : N = 10 6 16

n p n 1 3 3 3 p 3 3 3 3 3 3 1 Element Gauss-Jordan, LAPACK, cx =n 3 +O(n ) A = +/- A B ; BLAS3, cx = n 3 n, A = B 3 A = A B C ; BLAS3, cx = n 3 n 64 bit floating point numbers 17

3 3 3 1 3 3 3 3 3 3 Each computing task : 1 up to 3 blocks maximum n < (memory size of one pair) / 3 Up to (p-1) peers 18

3 3 3 1 3 3 3 3 3 3 Computation of «new» blocks on peer which minimize communications «update» of block at step k, on peer who updated the block a step k-1 data send to dedicate peer ASAP 19

Element Gauss-Jordan p-1 A=AB, BLAS3 Element Gauss-Jordan (p-1) A=AB-C, BLAS3 n double floating points = 8n bytes One step of the Block Gauss-Jordan method ; p=4 0

Clouage et volatilité Tâche T1 Tâche T R1 Pair 1 Pair 1 temps Si le Pair 1 se retire de l ensemble des pairs? Nécessité de dupliquer la tâche pour garantir le clouage Tâche T1 D1 Tâche T R1 R1=D Pair D3 Pair 1

Anticipation et volatilité Tâche T1 temps Pair 1 R1 Problème de clouage Géré par MPICH-V D1 R1 R1 Tâche T R1=D Pair Pair 3 D3 Pair 3

3 3 3 1 3 1 3 3 3 3 3 3 3 Nevertheless, peers are not stables. Then, we can have in parallel computing from several steps of the method. We have to use an inter and intra steps dependency graph. 3

<?xml version="1.0" encoding="iso-8859-1"?> <component type="graph"> <name>gaussjordan</name> <description> This component invert a bloc matrix using Gauss-Jordan algorithm. </description> <parameters> <parameter name="minput" type="bloc_matrix_real" access="in"/> <parameter name="moutput" type="bloc_matrix_real" access="ou"/> <parameter name="bloc_count" type="integer" access="in"/> <parameter name="size" type="integer" access="in"/> </parameters> <graph> /* Declarations */ const twop := * bloc_count, pp1 := bloc_count+ 1, pm1 := bloc_count - 1, m := size/ bloc_count ; domain a[1..bloc_count, 1..twop] of MATRIX_MATRIX_DOUBLE; bind a[1..bloc_count,1..bloc_count] to Minput; bind a[1..bloc_count,pp1..twop] to Moutput; event cal[1..bloc_count, 1..bloc_count, 1..twop]; /* Calcul */ compute invmatcp(a[1,1], a[1,pp1], m); par par (j :=, bloc_count) do compute diad3(a[1,j], a[1,1], m); signal(cal[1,1,j], cal[1,1,j]); end par do // par (i :=, bloc_count) (j :=, bloc_count) do wait(cal[1, 1, j]); if (i eq and j eq ) then compute triad3inv(a[,], a[, + bloc_count], a[,1], a[1,], m); signal(cal[,, ], cal[,, + bloc_count]); else compute triad3(a[i,j], a[i,1], a[1,j], m); signal(cal[1, i, j]); endif end par do // par (i :=, bloc_count) do compute mdiad3(a[i,1+bloc_count], a[1,1], m); signal(cal[1, i, pp1]); end par do // 4

Méthodes hybrides asynchrones 5

Multiple Explicit Restarted Arnoldi Method (MERAM) La méthode d Arnoldi comporte 3 phases par itération : - projection dans un sous-espace de taille m << n - calcul dans ce sous espace (faible complexité) - redémarrage : nouvelle itération Projeter dans plusieurs sous-espaces différents de tailles distinctes Les informations supplémentaires permettent d accélérer la convergence vers les valeurs propres recherchées 6

MERAM : 1993, SEH, ETCA, avec N. Emad et G. Edjlali Complètement asynchrone et à gros grains Plusieurs stratégies de redémarrage sélectionnées Communications en parallèle des calculs Lisse les convergences comportant souvent des paliers Expérimentations avec deux machines parallèles et plusieurs stations 7

MERAM : 1993, SEH, ETCA, tâches asynchrones communications en parallèle des calculs possibilité d autres calculs déportés 8

MERAM : 1993, SEH, ETCA, résultats numériques (1) Expérimentations avec une CM00, une CM5 et des SUN 9

MERAM : 1993, SEH, ETCA, résultats numériques () Expérimentations avec une CM00, une CM5 et des SUN 30

GMRES/LS-Arnoldi : 1995, LIFL, USTL, avec Guy Bergère et Azedine Essai Calcul des valeurs propres dominantes pour optimiser la convergence de GMRES Calcul à l aide de LS des paramètres optimaux degré du polynôme, fréquence de l hybridation,... Expérimentations avec deux machines parallèles et plusieurs stations 31

GMRES/LS-Arnoldi, LIFL, USTL 3

GMRES/LS-Arnoldi: 1996, LIFL, USTL résultats numérique Expérimentations avec machines parallèles et des SUN 33