Calcul Québec - Université Laval. Atelier CUDA/GPU

Transcription

1 Atelier CUDA/GPU Maxime Boissonneault Université Laval - Octobre 2014 Adaptation du "CUDA/GPU Workshop", par Dan Mazur <[email protected]>, Université McGill 1

2 2 Prérequis Connaissance de base d'un système Linux Connexion à distance avec SSH Éditeur texte (VIM, emacs, nano) Bonne connaissance du langage C Syntaxe Compilateur GCC Familiarité avec les fils d'exécutions (threads)

3 3 Postes fixes Windows/Linux Pour se connecter sur les postes Nom d utilisateur : formation Mot de passe : «Automne14»

4 4 Plan Vue d'ensemble du GPGPU Terminologie et architecture Connexion et préparation de l'environnement Introduction à CUDA-C Exemples Mémoire globale Blocs et fils d'exécutions Syntaxe courante Gestion des erreurs Mémoire partagée et synchronisation Mémoire unifiée (Cuda 6) Multi-GPU

5 V1.0 Vue d'ensemble du GPGPU 5

6 6 Qu'est-ce qu'un GPU Dispositif servant à résoudre des calculs coûteux en temps (accélérateur) Grand nombre de coeurs peu coûteux Énergie $$ Vitesse de calcul théorique impressionnante : plusieurs Teraflops Modèle hybride de programmation : CPU et GPU en même temps

7 7 Historique 1980s Premiers contrôleurs graphiques (Intel, TI, IBM) 1990s Premiers GPUs 2D et 3D Gaming Introduction de CUDA Révisions des compute capabilities... OpenCL OpenACC K20, Titan compute capability 3.5

8 8 Applications connues ABAQUS, Amber, GROMACS, LAMMPS, MATLAB, PETSc, OpenFOAM (certains solvers)

9 9 Librairies CuBLAS Magma CuFFT Thrust

10 10 Programmation CUDA-{C,Fortran Compute Unified Device Architecture OpenCL OpenACC

11 11 Rappel : C et les pointeurs int x = 1; int *p; p = &x; *p = 0; printf( x=%d\n,x); Quel sera l'affichage à l'écran? A)x=1 B)x=0 C)x=0x7ffff957 D)Erreur ou SEGFAULT

12 11 Rappel : C et les pointeurs int x = 1; int *p; p = &x; *p = 0; printf( x=%d\n,x); Quel sera l'affichage à l'écran? A)x=1 B)x=0 C)x=0x7ffff957 D)Erreur ou SEGFAULT On assigne 0 à la mémoire vers laquelle pointe p

13 V1.0 Terminologie et architecture 12

14 13 GPU

15 coeurs fp32 Calcul flotant 32 bits Calcul entier (simple) 32 bits 64 coeurs fp64 32 SFU Multiplications d'entier, etc. sin, cos, LD/ST Chargement / rapatriement de la mémoire 16 TEX Textures SM (GK110)

16 15 Connexion et préparation de l'environnement $ ssh [email protected] password: [~]$ prepare_formation cuda [~]$ prepare_formation job [user41@gpu-k20-02 ~]$ cd ~/ formation/cuda/formation Noeuds de formation fournis par :

17 16 Structure des exercices formation]$ ls -lh total 36K drwxr-xr-x 2 mboisson clumeq 4,0K 7 oct 10:58 1-compilation drwxr-xr-x 3 mboisson clumeq 4,0K 9 oct 14:59 3-remplir-vecteur drwxr-xr-x 3 mboisson clumeq 4,0K 9 oct 15:10 4-multiplications-matrices drwxr-xr-x 3 mboisson clumeq 4,0K 7 oct 10:58 5-erreurs drwxr-xr-x 3 mboisson clumeq 4,0K 8 oct 16:05 7-produit-scalaire drwxr-xr-x 3 mboisson clumeq 4,0K 8 oct 16:04 8-multiplications-matricecuda6 drwxr-xr-x 3 mboisson clumeq 4,0K 20 oct 10:08 9-multiples-gpus -rw-r--r-- 1 mboisson clumeq oct 15:30 makefile-levels.mk -rw-r--r-- 1 mboisson clumeq oct 10:59 makefile.mk

18 17 Code minimal global void foo() { int main() { foo<<<1,1>>>(); printf("cuda error: %s\n", cudageterrorstring(cudagetlasterror())); return 0; trivial.cu

19 18 Compilation $ nvcc <fichier.cu> -o <exe> $./<exe>

20 19 Exercice 1 : compilation Fichier trivial.cu Description Compiler et exécuter un programme CUDA-C

21 19 Exercice 1 : compilation Fichier Description trivial.cu Compiler et exécuter un programme CUDA-C Indice : Utilisez "nvcc" comme compilateur

22 20 Exercice 1 : solution $ nvcc trivial.cu -o trivial $./trivial CUDA error: no error $

23 20 Exercice 1 : solution $ nvcc trivial.cu -o trivial $./trivial CUDA error: no error $ «Compilateur» à utiliser

24 21 Exercice 1 : solution (alternative) $ nvcc -arch=sm_12 trivial.cu -o trivial $./trivial CUDA error: no error $

25 21 Exercice 1 : solution (alternative) $ nvcc -arch=sm_12 trivial.cu -o trivial $./trivial CUDA error: no error $ Important pour avoir accès à des opérations atomiques de base.

26 22 Exercice 1 : restant des exercices $ make nvcc -arch=sm_12 -c trivial.cu -o trivial.o nvcc trivial.o -o trivial $./trivial CUDA error: no error $

27 23 Code minimal global void foo() { int main() { foo<<<1,1>>>(); printf("cuda error: %s\n", cudageterrorstring(cudagetlasterror())); return 0;

28 23 Code minimal Kernel global void foo() { int main() { foo<<<1,1>>>(); printf("cuda error: %s\n", cudageterrorstring(cudagetlasterror())); return 0;

29 23 Portée de la fonction(kernel). Code minimal Kernel global void foo() { int main() { foo<<<1,1>>>(); printf("cuda error: %s\n", cudageterrorstring(cudagetlasterror())); return 0;

30 23 Portée de la fonction(kernel). Code minimal Kernel global void foo() { int main() { foo<<<1,1>>>(); printf("cuda error: %s\n", cudageterrorstring(cudagetlasterror())); return 0; Paramètres d exécution du kernel.

31 24 Exemples CUDA $CUDA_HOME/samples Contient : Exemples de programmes (utilitaires, imagerie, finance, simulations, librairies) Documentation Outils de gestion (nvidia-smi)

32 25 Exercice 2 : exemples CUDA Répertoire $CUDA_HOME/samples/1_Utilities/deviceQuery/ Description Copier, compiler et exécuter le programme devicequery

33 26 Sortie de devicequery Detected 1 CUDA Capable device(s) Device 0 : "GeForce GT 330M" CUDA Driver Version / Runtime Version 5.5 / 5.5 CUDA Capability Major/Minor version number: 1.2 Total amount of global memory: 256 MBytes ( bytes) ( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores Total amount of shared memory per block: bytes Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: x x 1

34 26 Sortie de devicequery Detected 1 CUDA Capable device(s) Device 0 : "GeForce GT 330M" Nombre de carte(s) CUDA Driver Version / Runtime Version 5.5 / 5.5 CUDA Capability Major/Minor version number: 1.2 Total amount of global memory: 256 MBytes ( bytes) ( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores Total amount of shared memory per block: bytes Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: x x 1

35 26 Sortie de devicequery Detected 1 CUDA Capable device(s) Device 0 : "GeForce GT 330M" CUDA Driver Version / Runtime Version 5.5 / 5.5 CUDA Capability Major/Minor version number: 1.2 Numéro de la carte Total amount of global memory: 256 MBytes ( bytes) ( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores Total amount of shared memory per block: bytes Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: x x 1

36 26 Sortie de devicequery Detected 1 CUDA Capable device(s) Device 0 : "GeForce GT 330M" CUDA Driver Version / Runtime Version 5.5 / 5.5 CUDA Capability Major/Minor Version version de number: 1.2 Total amount of global memory: CUDA ( bytes) 256 MBytes ( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores Total amount of shared memory per block: bytes Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: x x 1

37 26 Sortie de devicequery Detected 1 CUDA Capable device(s) Device 0 : "GeForce GT 330M" CUDA Driver Version /"Capability" Runtime Version 5.5 / 5.5 CUDA Capability Major/Minor version number: 1.2 Total amount of global memory: 256 MBytes ( bytes) ( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores Total amount of shared memory per block: bytes Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: x x 1

38 26 Sortie de devicequery Detected 1 CUDA Capable device(s) Device 0 : "GeForce GT 330M" CUDA Driver Version / Runtime Version 5.5 / 5.5 CUDA Capability Major/Minor version number: 1.2 Total amount of global memory: 256 MBytes ( bytes) ( 6) Multiprocessors Mémoire x ( 8) CUDA Cores/MP: 48 CUDA Cores Total amount of shared totale memory per block: bytes Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: x x 1

39 26 Sortie de devicequery Detected 1 CUDA Capable device(s) Device 0 : "GeForce GT 330M" CUDA Driver Version / Runtime Version 5.5 / 5.5 CUDA Capability Major/Minor version number: 1.2 Total amount of global memory: 256 MBytes ( bytes) ( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores Total amount of shared Nombre memory de per block: bytes coeurs Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: x x 1

40 26 Sortie de devicequery Detected 1 CUDA Capable device(s) Device 0 : "GeForce GT 330M" CUDA Driver Version / Runtime Version 5.5 / 5.5 CUDA Capability Major/Minor version number: 1.2 Total amount of global memory: 256 MBytes ( bytes) Mémoire ( 6) Multiprocessors x ( partagée 8) CUDA Cores/MP: 48 CUDA Cores Total amount of shared memory per block: bytes Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: x x 1

41 26 Sortie de devicequery Detected 1 CUDA Capable device(s) Device 0 : "GeForce GT 330M" CUDA Driver Version / Runtime Version 5.5 / 5.5 CUDA Capability Major/Minor version number: 1.2 Total amount of global memory: 256 MBytes ( bytes) ( 6) Multiprocessors Nombre x ( 8) de CUDA fils Cores/MP: 48 CUDA Cores Total amount of max. shared par memory bloc per block: bytes Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: x x 1

42 26 Sortie de devicequery Detected 1 CUDA Capable device(s) Device 0 : "GeForce GT 330M" CUDA Driver Version / Runtime Version 5.5 / 5.5 CUDA Capability Major/Minor version number: 1.2 Total amount of global memory: 256 MBytes ( bytes) ( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores Total amount of Dimensions shared memory per block: bytes Maximum number of threads per block: 512 maximales Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: x x 1

43 27 Flot d'exécution CUDA Code série

44 27 Flot d'exécution CUDA Code série Allocation de mémoire sur le GPU Copie de données vers le GPU Lancement du "kernel"

45 27 Flot d'exécution CUDA Code série Allocation de mémoire sur le GPU Copie de données vers le GPU Lancement du "kernel" Code parallèle Exécution du noyau

46 27 Flot d'exécution CUDA Code série Allocation de mémoire sur le GPU Copie de données vers le GPU Lancement du "kernel" Code parallèle Exécution du noyau Code série Copie des résultats vers l'hôte Libération de la mémoire du GPU

47 V1.0 Gestion de mémoire 28

48 29 Gestion de mémoire

49 29 Gestion de mémoire cudamalloc(void ** devptr, size_t size) Équivalent de malloc en C (new en C++)

50 29 Gestion de mémoire cudamalloc(void ** devptr, size_t size) Équivalent de malloc en C (new en C++) cudafree(void * devptr); Équivalent de free en C (delete en C++)

51 29 Gestion de mémoire cudamalloc(void ** devptr, size_t size) Équivalent de malloc en C (new en C++) cudafree(void * devptr); Équivalent de free en C (delete en C++) cudamemcpy(void * dst, const void * src, size_t size, enum cudamemcpykind kind); cudamemcpyhosttohost cudamemcpyhosttodevice cudamemcpydevicetohost

52 V1.0 Quiz sur les pointeurs et la gestion de mémoire 30

53 31 Sortie 1 de 3 global void add2(int *a) { *a = *a + 2; int main( void ) { int *data_h, *data_d; cudamalloc((void**)&data_d, sizeof(int)); data_h = (int *)malloc(sizeof(int)); *data_h = 5; cudamemcpy(data_d, data_h, sizeof(int), cudamemcpyhosttodevice); add2<<<1,1>>>(data_d); cudamemcpy(data_h, data_d, sizeof(int), cudamemcpydevicetohost); printf("data: %d\n", *data_h); free(data_h); cudafree(data_d); return 0; A. data: 5 B. data: 7 C. Erreur ou segfault D. Ne compile pas E. Autre

60 31 Sortie 1 de 3 global void add2(int *a) { *a = *a + 2; int main( void ) { int *data_h, *data_d; cudamalloc((void**)&data_d, sizeof(int)); data_h = (int *)malloc(sizeof(int)); *data_h = 5; cudamemcpy(data_d, data_h, sizeof(int), cudamemcpyhosttodevice); add2<<<1,1>>>(data_d); cudamemcpy(data_h, data_d, sizeof(int), cudamemcpydevicetohost); printf("data: %d\n", *data_h); free(data_h); cudafree(data_d); return 0; A. data: 5 B. B. data: 77 C. Erreur ou segfault D. Ne compile pas E. Autre

61 32 Sortie 2 de 3 global void add2(int *a) { *a = *a + 2; int main( void ) { int *data_h, *data_d; cudamalloc((void**)&data_d, sizeof(int)); data_h = (int *)malloc(sizeof(int)); *data_h = 5; /* cudamemcpy(data_d, data_h, sizeof(int), cudamemcpyhosttodevice); */ add2<<<1,1>>>(data_d); cudamemcpy(data_h, data_d, sizeof(int), cudamemcpydevicetohost); printf("data: %d\n", *data_h); free(data_h); cudafree(data_d); return 0; A. data: 5 B. data: 7 C. Erreur ou segfault D. Ne compile pas E. Autre

62 32 Sortie 2 de 3 global void add2(int *a) { *a = *a + 2; int main( void ) { int *data_h, *data_d; cudamalloc((void**)&data_d, sizeof(int)); data_h = (int *)malloc(sizeof(int)); *data_h = 5; /* cudamemcpy(data_d, data_h, sizeof(int), cudamemcpyhosttodevice); */ add2<<<1,1>>>(data_d); cudamemcpy(data_h, data_d, sizeof(int), cudamemcpydevicetohost); printf("data: %d\n", *data_h); free(data_h); cudafree(data_d); return 0; A. data: 5 B. data: 7 C. Erreur ou segfault D. Ne compile pas E. Autre E. Autre

63 33 Sortie 3 de 3 global void add2(int *a) { *a = *a + 2; int main( void ) { int *data_h, *data_d; cudamalloc((void**)&data_d, sizeof(int)); data_h = (int *)malloc(sizeof(int)); *data_h = 5; cudamemcpy(data_d, data_h, sizeof(int), cudamemcpyhosttodevice); add2<<<1,1>>>(data_d); cudamemcpy(data_h, data_d, sizeof(int), cudamemcpydevicetohost); printf("data: %d\n", *data_d); free(data_h); cudafree(data_d); return 0; A. data: 5 B. data: 7 C. Erreur ou segfault D. Ne compile pas E. Autre

64 33 Sortie 3 de 3 global void add2(int *a) { *a = *a + 2; int main( void ) { int *data_h, *data_d; cudamalloc((void**)&data_d, sizeof(int)); data_h = (int *)malloc(sizeof(int)); *data_h = 5; cudamemcpy(data_d, data_h, sizeof(int), cudamemcpyhosttodevice); add2<<<1,1>>>(data_d); cudamemcpy(data_h, data_d, sizeof(int), cudamemcpydevicetohost); printf("data: %d\n", *data_d); free(data_h); cudafree(data_d); return 0; A. data: 5 B. data: 7 C. Erreur ou segfault D. Ne compile pas E. Autre

65 V1.0 Calcul d'index 34

66 35 Code minimal global void foo() { int main() { foo<<< 1, 1 >>>();

67 35 Code minimal global void foo() {? int main() { foo<<< 1, 1 >>>();

68 35 Code minimal global void foo() { int main() { foo<<< 1, 1 >>>(); Taille de la grille (nombre de blocs)

69 35 Code minimal global void foo() { int main() { foo<<< 1, 1 >>>(); Taille d un bloc (nombre de threads)

70 36 Grille? Bloc?

71 36 Grille? Bloc? Kernel = grille de blocs de threads (dimensions fixes)

72 36 Grille? Bloc? Kernel = grille de blocs de threads (dimensions fixes) Grille = tableau 3D de blocs uniformes

73 36 Grille? Bloc? Kernel = grille de blocs de threads (dimensions fixes) Grille = tableau 3D de blocs uniformes Bloc = tableau 3D de threads

74 36 Grille? Bloc? Kernel = grille de blocs de threads (dimensions fixes) Grille = tableau 3D de blocs uniformes Bloc = tableau 3D de threads 1 bloc = max 1024 threads

75 36 Grille? Bloc? Kernel = grille de blocs de threads (dimensions fixes) Grille = tableau 3D de blocs uniformes Bloc = tableau 3D de threads 1 bloc = max 1024 threads 1 bloc = 1 SM (streaming multiprocessor), plusieurs «warps»

76 36 Grille? Bloc? Kernel = grille de blocs de threads (dimensions fixes) Grille = tableau 3D de blocs uniformes Bloc = tableau 3D de threads 1 bloc = max 1024 threads 1 bloc = 1 SM (streaming multiprocessor), plusieurs «warps» 1 warp = SIMD (single instruction, multiple data)

77 37 Calcul d'index Le programmeur défini le nombre de bloc dans la grille, puis le nombre de threads par bloc Le code du kernel connait son identifiant de thread, son identifiant de bloc et la taille de ce dernier thread ID Block 0 Block 1 Block 2 Taille d'un block: 6 kernel<<<3, 6>>> ()

78 38 Calcul d'index thread ID Taille d'un block: 6 Block 0 Block 1 Block 2 kernel<<<3, 6>>> ()

79 38 Calcul d'index thread ID Block 0 Block 1 Block 2 Taille d'un block: 6 kernel<<<3, 6>>> () threadidx.x Numéro du fil d'exécution dans le block

80 38 Calcul d'index thread ID Taille d'un block: 6 Block 0 Block 1 Block 2 blockidx.x Numéro du block kernel<<<3, 6>>> () threadidx.x Numéro du fil d'exécution dans le block

81 38 Calcul d'index thread ID Taille d'un block: 6 Block 0 Block 1 Block 2 blockdim.x Dimension d'un block (équivalent au nombre de thread par block) blockidx.x Numéro du block kernel<<<3, 6>>> () threadidx.x Numéro du fil d'exécution dans le block

82 39 Calcul d'index Quelle expression permet de calculer l'index (unique) pour chaque fil d'exécution? A. idx = threadidx.x + blockidx.x B. idx = threadidx.x * blockidx.x C. idx = threadidx.x * blockdim.x + blockidx.x D. idx = threadidx.x + blockdim.x * blockidx.x

83 39 Calcul d'index Quelle expression permet de calculer l'index (unique) pour chaque fil d'exécution? A. idx = threadidx.x + blockidx.x B. idx = threadidx.x * blockidx.x C. idx = threadidx.x * blockdim.x + blockidx.x D. idx = threadidx.x + blockdim.x * blockidx.x

84 40 Exercice 3 : remplir un vecteur Fichier Description integers.cu Déplacer le code pour remplir un vecteur du CPU vers le GPU

85 40 Exercice 3 : remplir un vecteur Fichier Description integers.cu Déplacer le code pour remplir un vecteur du CPU vers le GPU Indice : N'oubliez-pas de copier les résultats vers l'hôte

86 41 Exercice 3 : remplir un vecteur Solution 1 #define BLOCK_SIZE 1 const int N = 100; fillarray<<<n, BLOCK_SIZE>>>( );

87 41 Exercice 3 : remplir un vecteur? Solution 1 #define BLOCK_SIZE 1 const int N = 100; const uint32_t N = ; fillarray<<<n, BLOCK_SIZE>>>( );

88 41 Exercice 3 : remplir un vecteur Solution 1 #define BLOCK_SIZE 1 const int N = 100; const uint32_t N = ; fillarray<<<n, BLOCK_SIZE>>>( ); Maximum sizes of each dimension of a grid: x x /deviceQuery

89 42 Exercice 3 : remplir un vecteur Solution 2 #define BLOCK_SIZE 128 const int N = 100; fillarray<<<n/block_size, BLOCK_SIZE>>>( );

90 42 Exercice 3 : remplir un vecteur Solution 2 #define BLOCK_SIZE 128 const int N = 100; fillarray<<<n/block_size, BLOCK_SIZE>>>( ); $./integers integers: integers.cu:39: int main(int, char**): Assertion `data[i] == i' failed. Abandon (core dumped)?

91 42 Exercice 3 : remplir un vecteur Solution 2 N/BLOCK_SIZE est une division entière #define BLOCK_SIZE 128 const int N = 100; fillarray<<<n/block_size, BLOCK_SIZE>>>( ); $./integers integers: integers.cu:39: int main(int, char**): Assertion `data[i] == i' failed. Abandon (core dumped)

92 43 Exercice 3 : remplir un vecteur Solution 3 #define BLOCK_SIZE 128 global void fillarray(int *data, int N) { int idx = threadidx.x + blockidx.x*blockdim.x; data[idx] = idx; const int N = 100; fillarray<<<(n+block_size-1)/block_size, BLOCK_SIZE>>>( );

93 43 Exercice 3 : remplir un vecteur Solution 3 #define BLOCK_SIZE 128 global void fillarray(int *data, int N) { int idx = threadidx.x + blockidx.x*blockdim.x; data[idx] = idx; const int N = 100; fillarray<<<(n+block_size-1)/block_size, BLOCK_SIZE>>>( ); A. $./integers B. Correct! Vraiment?

94 43 Exercice 3 : remplir un vecteur Solution 3 #define BLOCK_SIZE 128 global void fillarray(int *data, int N) { int idx = threadidx.x + blockidx.x*blockdim.x; data[idx] = idx; idx = {0 128 idx > 99 écrase de la mémoire qui ne nous appartient pas. const int N = 100; fillarray<<<(n+block_size-1)/block_size, BLOCK_SIZE>>>( ); A. $./integers B. Correct!

95 44 Exercice 3 : remplir un vecteur Solution 4 (valide) #define BLOCK_SIZE 128 global void fillarray(int *data, int N) { int idx = threadidx.x + blockidx.x*blockdim.x; if (idx < N) { data[idx] = idx; const int N = 100; fillarray<<<(n+block_size-1)/block_size, BLOCK_SIZE>>>( );

96 45 Exercice 4 : multiplication de matrices Fichiers matrixmul.cu matrixmul_med.cu matrixmul_adv.cu Description Compléter, à l'aide des fonctions CUDA, le programme de multiplication de matrices

97 46 Division de la grille dim3 dimblock(block_size,block_size); dim3 dimgrid(msize/block_size,msize/block_size); MatMulKernel<<<dimGrid,dimBlock>>>(d_A, d_b, d_c);

98 47 Exercice 4 : algorithme Thread 0, Block 0, idx = 0, idy=0, MSIZE = for (int i = 0; i < MSIZE; ++i) { Cvalue += A[row * MSIZE + i] * B[i * MSIZE + col]; C[row * MSIZE + col] = Cvalue; Thread 3, Block 1, idx = 1, idy=1, MSIZE =

105 48 Exercice 4 : multiplication de matrices global void MatMulKernel(float* A, float* B, float* C) { int col = threadidx.x + blockidx.x * blockdim.x; int row = threadidx.y + blockidx.y * blockdim.y; // Compute the row and column for (int i = 0; i < MSIZE; ++i) { Cvalue += A[row * MSIZE + i] * B[i * MSIZE + col]; C[row*MSIZE+col] = Cvalue;

106 V1.0 Vérification des erreurs 49

107 50 Vérification des erreurs Toutes les fonctions Cuda retournent un «cudaerror_t» Doit être égal à «cudasuccess» pour être sans erreur cudagetlasterror() retourne la dernière erreur (pour les kernels) Appeler cudadevicesynchronize() pour attendre après un kernel

108 51 Exercice 5 : vérification des erreurs Fichier Description errorcheck.cu Ajouter des vérifications d'erreurs CUDA dans le code. Corriger les erreurs.

109 51 Exercice 5 : vérification des erreurs Fichier Description errorcheck.cu Ajouter des vérifications d'erreurs CUDA dans le code. Corriger les erreurs. Indice : À première vue, ce programme compile et s'exécute sans erreur. Vérifiez bien tous les retours d'appels CUDA-C.

110 52 Exercice 5 : solution int main(void) { int *data_d = 0, *data_h = 0; cudaerror_t err; if ((err = cudamalloc((void**)&data_d, sizeof(int)))!= cudasuccess) { printf("could not allocate that much memory. \n%s",cudageterrorstring(err)); exit(1); setdata<<<1,1>>>(0); cudadevicesynchronize(); err = cudagetlasterror(); if (err!= cudasuccess) { printf("error calling setdata. \n%s",cudageterrorstring(err)); goto cleanup; cleanup: if ((err = cudafree(data_d))!= cudasuccess) { printf("could not free memory (free #1) \n%s",cudageterrorstring(err)); exit(1);

111 52 Initialisation des pointeurs à 0 Exercice 5 : solution int main(void) { int *data_d = 0, *data_h = 0; cudaerror_t err; if ((err = cudamalloc((void**)&data_d, sizeof(int)))!= cudasuccess) { printf("could not allocate that much memory. \n%s",cudageterrorstring(err)); exit(1); setdata<<<1,1>>>(0); cudadevicesynchronize(); err = cudagetlasterror(); if (err!= cudasuccess) { printf("error calling setdata. \n%s",cudageterrorstring(err)); goto cleanup; cleanup: if ((err = cudafree(data_d))!= cudasuccess) { printf("could not free memory (free #1) \n%s",cudageterrorstring(err)); exit(1);

112 52 Exercice 5 : solution int main(void) { int *data_d = 0, *data_h = 0; cudaerror_t err; Type de retour des fonctions CUDA if ((err = cudamalloc((void**)&data_d, sizeof(int)))!= cudasuccess) { printf("could not allocate that much memory. \n%s",cudageterrorstring(err)); exit(1); setdata<<<1,1>>>(0); cudadevicesynchronize(); err = cudagetlasterror(); if (err!= cudasuccess) { printf("error calling setdata. \n%s",cudageterrorstring(err)); goto cleanup; cleanup: if ((err = cudafree(data_d))!= cudasuccess) { printf("could not free memory (free #1) \n%s",cudageterrorstring(err)); exit(1);

113 52 Exercice 5 : solution int main(void) { int *data_d = 0, *data_h = 0; cudaerror_t err; if ((err = cudamalloc((void**)&data_d, sizeof(int)))!= cudasuccess) { printf("could not allocate that much memory. \n%s",cudageterrorstring(err)); exit(1); setdata<<<1,1>>>(0); cudadevicesynchronize(); err = cudagetlasterror(); if (err!= cudasuccess) { printf("error calling setdata. \n%s",cudageterrorstring(err)); goto cleanup; Valeur toujours attendue cleanup: if ((err = cudafree(data_d))!= cudasuccess) { printf("could not free memory (free #1) \n%s",cudageterrorstring(err)); exit(1);

114 52 Exercice 5 : solution int main(void) { int *data_d = 0, *data_h = 0; cudaerror_t err; if ((err = cudamalloc((void**)&data_d, sizeof(int)))!= cudasuccess) { printf("could not allocate that much memory. \n%s",cudageterrorstring(err)); exit(1); setdata<<<1,1>>>(0); cudadevicesynchronize(); err = cudagetlasterror(); if (err!= cudasuccess) { printf("error calling setdata. \n%s",cudageterrorstring(err)); goto cleanup; Important après l appel d un kernel pour vérifier les erreurs cleanup: if ((err = cudafree(data_d))!= cudasuccess) { printf("could not free memory (free #1) \n%s",cudageterrorstring(err)); exit(1);

115 52 Exercice 5 : solution int main(void) { int *data_d = 0, *data_h = 0; cudaerror_t err; if ((err = cudamalloc((void**)&data_d, sizeof(int)))!= cudasuccess) { printf("could not allocate that much memory. \n%s",cudageterrorstring(err)); exit(1); setdata<<<1,1>>>(0); cudadevicesynchronize(); err = cudagetlasterror(); if (err!= cudasuccess) { printf("error calling setdata. \n%s",cudageterrorstring(err)); goto cleanup; Erreur précédente cleanup: if ((err = cudafree(data_d))!= cudasuccess) { printf("could not free memory (free #1) \n%s",cudageterrorstring(err)); exit(1);

116 53 Exercice 6 : produit scalaire Fichier Description Aucun Penser un algorithme parallèle de produit scalaire Lister les problèmes rencontrés

117 53 Exercice 6 : produit scalaire Fichier Description Aucun Penser un algorithme parallèle de produit scalaire Lister les problèmes rencontrés Pseudocode global void dot( float *a, float *b, float *c) { c = a0b0 + a 1 b1 + a2b anbn;

118 54 Solution possible global void dot( float *a, float *b, float *c) { if(threadidx.x + blockidx.x*blockdim.x == 0) { for(int i = 0; i < N; i++) { *c += a[i]*b[i];

119 54 Solution possible global void dot( float *a, float *b, float *c) { if(threadidx.x + blockidx.x*blockdim.x == 0) { for(int i = 0; i < N; i++) { *c += a[i]*b[i]; Problème 1 : N'utilise pas le parallélisme

120 54 Solution possible global void dot( float *a, float *b, float *c) { if(threadidx.x + blockidx.x*blockdim.x == 0) { for(int i = 0; i < N; i++) { *c += a[i]*b[i]; Problème 1 : N'utilise pas le parallélisme Problème 2 : Comment stocker les valeurs intermédiaires

121 54 Solution possible global void dot( float *a, float *b, float *c) { if(threadidx.x + blockidx.x*blockdim.x == 0) { for(int i = 0; i < N; i++) { *c += a[i]*b[i]; Problème 1 : N'utilise pas le parallélisme Problème 2 : Comment stocker les valeurs intermédiaires Problème 3 : Comment effectuer la somme (réduction)

122 55 Rappel : réduction La réduction consiste à appliquer une fonction d'aggrégation (exemple: sum) à un ensemble de valeurs pour en retirer une seule valeur de retour.

123 56 Types de mémoire

124 57 Types de mémoire Type Bande Visibilité Notes Exemple passante Mémoire Lente, latence Tous les fils cudamalloc() device float data[n]; globale élevée d'exécution Mémoire Lente, avec Tous les fils Lecture seule constant float constante cache d'exécution data[n]; Mémoire 150x plus Fils d'exécution Durée de vie shared data[n]; partagée rapide que la d'un même bloc limitée. 50kb/ mémoire bloc globale Registre Plus rapide, Fil d'exécution Limité float data[n]; pas de latence courant (visibilité, espace, durée)

125 58 Stratégie de réduction Multiplications parallèles Réductions successives (additions)

126 59 Types de mémoire Type Bande Visibilité Notes Exemple passante Mémoire Lente, latence Tous les fils cudamalloc() device float data[n]; globale élevée d'exécution Mémoire Lente, avec Tous les fils Lecture seule constant float constante cache d'exécution data[n]; Mémoire 150x plus Fils d'exécution Durée de vie shared data[n]; partagée rapide que la d'un même bloc limitée. 50kb/ mémoire bloc globale Registre Plus rapide, Fil d'exécution Limité float data[n]; pas de latence courant (visibilité, espace, durée)

127 60 Rappel : race condition Deux ou plusieurs fils sont en concurrence pour mettre à jour une valeur partagée 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. if (idx == 0) 6. array[array_size-1] = tmp; 7. else 8. array[idx-1] = tmp; 9.

128 61 Rappel : race condition 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. if (idx == 0) 6. array[array_size-1] = tmp; 7. else 8. array[idx-1] = tmp; 9. Fil d'exécution 0 Fil d'exécution 1 Fil d'exécution 2 Note Calcul d'index Copie de la valeur dans un registre Écriture de la nouvelle valeur

132 62 Rappel : race condition 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. if (idx == 0) 6. array[array_size-1] = tmp; 7. else 8. array[idx-1] = tmp; 9. Fil d'exécution 1 Fil d'exécution 2 Note Calcul d'index 4. Copie de la valeur dans un registre (fil 2) 8. Écriture de la nouvelle valeur (fil 2) 4. Copie de la (mauvaise) valeur dans un registre (fil 1) 8. Écriture de la (mauvaise) nouvelle valeur (fil 1)

133 62 Rappel : race condition 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. if (idx == 0) 6. array[array_size-1] = tmp; 7. else 8. array[idx-1] = tmp; 9. Fil d'exécution 1 Fil d'exécution 2 Note Calcul d'index 4. Copie de la valeur dans un registre (fil 2) 8. Écriture de la nouvelle valeur (fil 2) 4. Copie de la (mauvaise) valeur dans un registre (fil 1) 8. Écriture de la (mauvaise) nouvelle valeur (fil 1) 1,2

134 62 Rappel : race condition 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. if (idx == 0) 6. array[array_size-1] = tmp; 7. else 8. array[idx-1] = tmp; Fil d'exécution 1 Fil d'exécution 2 Note Calcul d'index 4. Copie de la valeur dans un registre (fil 2) 8. Écriture de la nouvelle valeur (fil 2) 4. Copie de la (mauvaise) valeur dans un registre (fil 1) 8. Écriture de la (mauvaise) nouvelle valeur (fil 1)

135 62 Rappel : race condition 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. if (idx == 0) 6. array[array_size-1] = tmp; 7. else 8. array[idx-1] = tmp; Fil d'exécution 1 Fil d'exécution 2 Note Calcul d'index 4. Copie de la valeur dans un registre (fil 2) 8. Écriture de la nouvelle valeur (fil 2) 4. Copie de la (mauvaise) valeur dans un registre (fil 1) 8. Écriture de la (mauvaise) nouvelle valeur (fil 1)

136 62 Rappel : race condition 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. if (idx == 0) 6. array[array_size-1] = tmp; 7. else 8. array[idx-1] = tmp; 9. 1 Fil d'exécution 1 Fil d'exécution 2 Note Calcul d'index 4. Copie de la valeur dans un registre (fil 2) 8. Écriture de la nouvelle valeur (fil 2) 4. Copie de la (mauvaise) valeur dans un registre (fil 1) 8. Écriture de la (mauvaise) nouvelle valeur (fil 1)

137 62 Rappel : race condition 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. if (idx == 0) 6. array[array_size-1] = tmp; 7. else 8. array[idx-1] = tmp; 9. 1 Fil d'exécution 1 Fil d'exécution 2 Note Calcul d'index 4. Copie de la valeur dans un registre (fil 2) 8. Écriture de la nouvelle valeur (fil 2) 4. Copie de la (mauvaise) valeur dans un registre (fil 1) 8. Écriture de la (mauvaise) nouvelle valeur (fil 1)

138 63 Solution?

139 63 Solution? Utiliser un point de synchronisation syncthreads();

140 64 Point de synchronisation, mais où? 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. if (idx == 0) 6. array[array_size-1] = tmp; 7. else 8. array[idx-1] = tmp; 9.

141 64 Point de synchronisation, mais où? 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. if (idx == 0) 6. array[array_size-1] = tmp; 7. else 8. array[idx-1] = tmp; 9.

142 65 Rappel : race condition 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. syncthreads(); 6. if (idx == 0) 7. array[array_size-1] = tmp; 8. else 9. array[idx-1] = tmp; Fil d'exécution 1 Fil d'exécution 2 Note Calcul d'index 4. Copie de la valeur dans un registre (fil 2) 5. Point de synchronisation 4. Copie de la (bonne) valeur dans un registre (fil 1) 5. Point de synchronisation Écriture de la (bonne) nouvelle valeur (fil 1 et 2)

143 65 Rappel : race condition 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. syncthreads(); 6. if (idx == 0) 7. array[array_size-1] = tmp; 8. else 9. array[idx-1] = tmp; Fil d'exécution 1 Fil d'exécution 2 Note Calcul d'index 4. Copie de la valeur dans un registre (fil 2) 5. Point de synchronisation 4. Copie de la (bonne) valeur dans un registre (fil 1) 5. Point de synchronisation Écriture de la (bonne) nouvelle valeur (fil 1 et 2) 1,2

144 65 Rappel : race condition 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. syncthreads(); 6. if (idx == 0) 7. array[array_size-1] = tmp; 8. else 9. array[idx-1] = tmp; 1 2 Fil d'exécution 1 Fil d'exécution 2 Note Calcul d'index 4. Copie de la valeur dans un registre (fil 2) 5. Point de synchronisation 4. Copie de la (bonne) valeur dans un registre (fil 1) 5. Point de synchronisation Écriture de la (bonne) nouvelle valeur (fil 1 et 2)

147 65 Rappel : race condition 1. #define ARRAY_SIZE global shiftl(int *array) { 3. int idx = threadidx.x; 4. tmp = array[idx]; 5. syncthreads(); 6. if (idx == 0) 7. array[array_size-1] = tmp; 8. else 9. array[idx-1] = tmp; Fil d'exécution 1 Fil d'exécution 2 Note Calcul d'index 4. Copie de la valeur dans un registre (fil 2) 5. Point de synchronisation 4. Copie de la (bonne) valeur dans un registre (fil 1) 5. Point de synchronisation Écriture de la (bonne) nouvelle valeur (fil 1 et 2) 1,2

149 66 Stratégie de réduction

150 66 Stratégie de réduction Points de synchronisations

151 66 Stratégie de réduction Points de synchronisations

152 67 Exercice 7 : implémenter le dot product Fichier Description dotproduct.cu Implémenter l'algorithme parallèle de produit scalaire.

153 67 Exercice 7 : implémenter le dot product Fichier Description dotproduct.cu Implémenter l'algorithme parallèle de produit scalaire. Indice 1 : Utilisez la mémoire partagée (performance)

154 67 Exercice 7 : implémenter le dot product Fichier Description dotproduct.cu Implémenter l'algorithme parallèle de produit scalaire. Indice 1 : Utilisez la mémoire partagée (performance) Indice 2 : N'oubliez-pas la synchronisation

155 68 Exercice 7 : implémenter le dot product global void dotproduct(float *a, float *b, float *c, float *result, int N) { // At best, our K20 card can handle bytes of // shared storage (see devicequery output) // That means 49152/sizeof(float) elements, which is shared float cache[block_size]; int idx = threadidx.x + blockidx.x*blockdim.x; cache[threadidx.x] = (idx < N)? a[idx]*b[idx] : 0.0f; syncthreads(); for (int i = blockdim.x/2; i > 0; i /= 2) { if (threadidx.x < i) { cache[threadidx.x] += cache[threadidx.x+i]; syncthreads(); Filename if (threadidx.x == 0) { c[blockidx.x] = cache[0];

156 68 Exercice 7 : implémenter le dot product global void dotproduct(float *a, float *b, float *c, float *result, int N) { // At best, our K20 card can handle bytes of // shared storage (see devicequery output) // That means 49152/sizeof(float) elements, which is shared float cache[block_size]; int idx = threadidx.x + blockidx.x*blockdim.x; cache[threadidx.x] = (idx < N)? a[idx]*b[idx] : 0.0f; syncthreads(); Filename for (int i = blockdim.x/2; i > 0; i /= 2) { if (threadidx.x < i) { cache[threadidx.x] += cache[threadidx.x+i]; syncthreads(); if (threadidx.x == 0) { c[blockidx.x] = cache[0]; Est-ce que c est complet? Qu arrive-t-il s il y a plusieurs blocs?

157 69 Exercice 7 : implémenter le dot product global void dotproduct(float *a, float *b, float *c, float *result, int N) {... if (threadidx.x == 0) { c[blockidx.x] = cache[0]; Filename syncthreads(); for (int i=0; i<blockdim.x; i++) c[0] += c[i];

158 69 Exercice 7 : implémenter le dot product global void dotproduct(float *a, float *b, float *c, float *result, int N) {... if (threadidx.x == 0) { c[blockidx.x] = cache[0]; Filename syncthreads(); for (int i=0; i<blockdim.x; i++) c[0] += c[i]; Est-ce que ça marcherait?

159 69 Exercice 7 : implémenter le dot product global void dotproduct(float *a, float *b, float *c, float *result, int N) {... if (threadidx.x == 0) { c[blockidx.x] = cache[0]; Filename syncthreads(); for (int i=0; i<blockdim.x; i++) c[0] += c[i]; Est-ce que ça marcherait? Aucune synchronisation possible entre deux blocs à l intérieur d un kernel. La seule façon de synchroniser deux blocs est de lancer deux kernels. => doit finir le calcul sur le CPU ou lancer un autre kernel.

160 V1.0 Cuda 6 : Mémoire unifiée 70

161 71 Cuda 6 : Mémoire unifiée Permet d utiliser le même pointeur sur l hôte et sur le GPU Requiert cartes récentes (Kepler+) On doit compiler avec -arch=sm_35 cudamalloc(void **,size_t); cudamemcpy(h_c,d_c,size_t,cudamemcpydevicet ohost); => cudamallocmanaged(void **,size_t)

162 72 Exercice 8 : multiplication matricielle + mémoire unifiée Fichier Description matrixmul.cu Allouer la mémoire avec cudamallocmanaged et modifier MatMul pour en tenir compte

163 73 Solution (partie 1) // Allocate space for the matrices mata = (float *) malloc(size); matb = (float *) malloc(size); matc = (float *) malloc(size);... free(matc); free(matb); free(mata); // Allocate space for the matrices cudamallocmanaged(mata,size); cudamallocmanaged(matb,size); cudamallocmanaged(matc,size);... cudafree(matc); cudafree(matb); cudafree(mata);

164 74 Solution (partie 2) void MatMul(float* A, float* B, float* C) { float *d_a = 0; size_t size = MSIZE * MSIZE * sizeof(float); // TODO : Remove all memory management calls // Allocate space for matrix A on device cudamalloc(&d_a, size); // Copy matrix A to device cudamemcpy(d_a, A, size, cudamemcpyhosttodevice); float *d_b = 0; // Allocate space for matrix B on device cudamalloc(&d_b, size); // Copy matrix B to device cudamemcpy(d_b, B, size, cudamemcpyhosttodevice); // Allocate C in device memory float *d_c = 0; cudamalloc(&d_c, size); // Invoke kernel dim3 dimblock(block_size,block_size); dim3 dimgrid(msize/block_size,msize/block_size); MatMulKernel<<<dimGrid,dimBlock>>>(d_A, d_b, d_c); // Read C from device memory cudamemcpy(c, d_c, size, cudamemcpydevicetohost); // Free device memory cudafree(d_a); cudafree(d_b); cudafree(d_c);

165 74 Solution (partie 2) void MatMul(float* A, float* B, float* C) { float *d_a = 0; size_t size = MSIZE * MSIZE * sizeof(float); // TODO : Remove all memory management calls // Allocate space for matrix A on device cudamalloc(&d_a, size); // Copy matrix A to device cudamemcpy(d_a, A, size, cudamemcpyhosttodevice); float *d_b = 0; // Allocate space for matrix B on device cudamalloc(&d_b, size); // Copy matrix B to device cudamemcpy(d_b, B, size, cudamemcpyhosttodevice); // Allocate C in device memory float *d_c = 0; cudamalloc(&d_c, size); // Invoke kernel dim3 dimblock(block_size,block_size); dim3 dimgrid(msize/block_size,msize/block_size); MatMulKernel<<<dimGrid,dimBlock>>>(A,B,C); cudadevicesynchronize(); // Read C from device memory cudamemcpy(c, d_c, size, cudamemcpydevicetohost); // Free device memory cudafree(d_a); cudafree(d_b); cudafree(d_c);

166 74 Solution (partie 2) void MatMul(float* A, float* B, float* C) { float *d_a = 0; size_t size = MSIZE * MSIZE * sizeof(float); // TODO : Remove all memory management calls // Allocate space for matrix A on device cudamalloc(&d_a, size); // Copy matrix A to device cudamemcpy(d_a, A, size, cudamemcpyhosttodevice); float *d_b = 0; // Allocate space for matrix B on device cudamalloc(&d_b, size); // Copy matrix B to device cudamemcpy(d_b, B, size, cudamemcpyhosttodevice); // Allocate C in device memory float *d_c = 0; cudamalloc(&d_c, size); // Invoke kernel dim3 dimblock(block_size,block_size); dim3 dimgrid(msize/block_size,msize/block_size); MatMulKernel<<<dimGrid,dimBlock>>>(A,B,C); cudadevicesynchronize(); // Read C from device memory cudamemcpy(c, d_c, size, cudamemcpydevicetohost);? // Free device memory cudafree(d_a); cudafree(d_b); cudafree(d_c);

167 75 cudadevicesynchronize Par défaut, tous les appels Cuda sont lancés dans le même «stream» (0) Stream : Une séquence d opérations qui s exécutent dans l ordre sur le GPU Kernels cudamemcpy, cudamemset,... Tout est synchrone sur le GPU... sauf la mémoire unifiée.

168 V1.0 Multi-GPU 76

169 77 Contexte Cuda

170 77 Contexte Cuda Toutes les opérations Cuda sont exécutées dans un «contexte Cuda» Un contexte Cuda est associé à un GPU. On change de contexte en appelant cudasetdevice(int devnum); On connaît le nombre de devices avec cudagetdevicecount(int * dev_count);

171 V1.0 Exercice 9 : Multiplications de N matrices 78

172 79 Problématique #1 Performance limitée par l initialisation On doit intialiser sur le GPU Problème : un générateur de nombre aléatoire typique est séquentiel!

173 80 Nombres aléatoires device unsigned int hash(unsigned int x) { x = (x+0x7ed55d16) + (x<<12); x = (x^0xc761c23c) ^ (x>>19); x = (x+0x165667b1) + (x<<5); x = (x+0xd3a2646c) ^ (x<<9); x = (x+0xfd7046c5) + (x<<3); x = (x^0xb55a4f09) ^ (x>>16); return x; global void RandomFillKernel(float * A, unsigned int seed ) { unsigned int idx = threadidx.x + blockidx.x * blockdim.x; A[idx] = float(hash(idx+seed) / UINT_MAX);

174 81 Problématique #2

175 81 Problématique #2 Mémoire unifiée ne fonctionne pas très bien avec plusieurs GPUs

176 81 Problématique #2 Mémoire unifiée ne fonctionne pas très bien avec plusieurs GPUs Quel GPU devrait avoir la priorité sur la mémoire unifiée?

177 81 Problématique #2 Mémoire unifiée ne fonctionne pas très bien avec plusieurs GPUs Quel GPU devrait avoir la priorité sur la mémoire unifiée? Solution : retour aux cudamemcpy, cudamalloc, etc.

178 82 Exercice 9 : multiplication matricielle multi-gpu Fichier Description matrixmul.cu Modifier le code pour qu il s exécute sur 2 GPUs Soumission de tâche Du noeud interactif, faire : prepare_formation job2

179 83 Solution (partie 1) struct timespec ts; int dev_count = 0; cudagetdevicecount(&dev_count); float * matas[n], * matas_h[n]; float * matbs[n], * matbs_h[n]; float * matcs[n], * matcs_h[n];

180 83 Solution (partie 1)

181 84 Solution (partie 2) #pragma omp parallel for num_threads(5) private(ts) for (int d=0; d<n; d++) { cudasetdevice(d % dev_count); clock_gettime(clock_monotonic,&ts); print_duration("clock started. Duration %fs\n",&ts);... // Allocate space for the matrices print_duration("malloc done. Duration %fs\n",&ts); // Seed the random number generator RandomFillKernel<<<MSIZE*MSIZE/BLOCK_SIZE,BLOCK_SIZE>>>(matAs[d],rand()); RandomFillKernel<<<MSIZE*MSIZE/BLOCK_SIZE,BLOCK_SIZE>>>(matBs[d],rand()); print_duration("initialization done. Duration %fs\n",&ts); //Multiply the matrices MatMul(matAs[d], matbs[d], matcs[d]); print_duration("multiplication done. Duration %fs\n",&ts); //Copy the results back to host cudamemcpy(matcs[d],matcs_h[d],size,cudamemcpydevicetohost); print_duration("copied data back. Duration %fs\n",&ts); cudafree(matcs[d]); print_duration("freed device memory. Duration %fs\n",&ts);

182 84 Solution (partie 2)

183 85 CUDA_VISIBLE_DEVICES

184 85 CUDA_VISIBLE_DEVICES Liste les numéros des cartes disponibles ex. : CUDA_VISIBLE_DEVICES=0,1 Changer la valeur : export CUDA_VISIBLE_DEVICES=0

185 86 Exécution avec une seule carte Clock started. Duration s Clock started. Duration s Malloc done. Duration s Initialization done. Duration s Malloc done. Duration s Initialization done. Duration s Multiplication done. Duration s Copied data back. Duration s Multiplication done. Duration s Copied data back. Duration s Freed device memory. Duration s Freed device memory. Duration s Qu est-ce qui se passe avec la deuxième multiplication?

186 V1.0 Opérations asynchrones 87

187 88 Streams Par défaut, tous les appels Cuda sont lancés dans le même «stream» (0) Stream : Une séquence d opérations qui s exécutent dans l ordre sur le GPU

188 89 Streams (suite) cudastream_t stream; cudastreamcreate(&stream); kernel<<<grid,block,0,stream>>>(...); cudamemcpyasync(...,stream); cudastreamsynchronize(stream);

189 V1.0 Conclusion 90

190 91 Ressources supplémentaires Wiki : [email protected] [email protected] [email protected]

Montrer encore