Cluster Analysis
An Example on Product Positioning Caractéristiques de 24 modèles de voiture (Source : L argus de l automobile, 2004) Modèle Cylindrée Puissance Vitesse Poids Largeur Longueur (cm 3 ) (ch) (km/h) (kg) (mm) (mm) Citroën C2. Base 24 6 58 932 659 3666 Smart Fortwo Coupé 698 52 35 730 55 2500 Mini.6 70 598 70 28 25 690 3625 Nissan Micra.2 65 240 65 54 965 660 375 Renault Clio 3.0 V6 2946 255 245 400 80 382 Audi A3.9 TDI 896 05 87 295 765 4203 Peugeot 307.4 HDI 70 398 70 60 79 746 4202 Peugeot 407 3.0 V6 BVA 2946 2 229 640 8 4676 Mercedes Classe C 270 CDI 2685 70 230 600 728 4528 BMW 530d 2993 28 245 595 846 484 Jaguar S-Type 2.7 V6 Bi-Turbo 2720 207 230 722 88 4905 BMW 745i 4398 333 250 870 902 5029 Mercedes Classe S 400 CDI 3966 260 250 95 2092 5038 Citroën C3 Pluriel.6i 587 0 85 77 700 3934 BMW Z4 2.5i 2494 92 235 260 78 409 Audi TT.8T 80 78 80 228 280 764 404 Aston Martin Vanquish 5935 460 306 835 923 4665 Bentley Continental GT 5998 560 38 2385 98 4804 Ferrari Enzo 5998 660 350 365 2650 4700 Renault Scenic.9 dci 20 870 20 88 430 805 4259 Volkswagen Touran.9 TDI 05 896 05 80 498 794 439 Land Rover Defender Td5 2495 22 35 695 790 3883 Land Rover Discovery Td5 2495 38 57 275 290 4705 Nissan X-Trail 2.2 dci 284 36 80 520 765 4455 2
Data reduction To neutralize the problem of units, one replace the original datas by the standardized datas : X X p = = X M X x s p s p x p de mean 0 and standard-deviation. 3
The data standardized (Zscore) Case Summaries 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 20 2 22 23 24 Total Std. Deviation Zscore: Zscore: Zscore: Zscore: Zscore: Zscore: MODÈLE Cylindrée Puissance Vitesse Poids Largeur Longueur Citroën C2. Base -.054 -.935 -.002 -.43 -.82 -.052 Smart Fortwo Coupé -.335 -.993 -.409 -.952 -.464-3.057 Mini.6 70 -.742 -.235.058 -.70 -.672 -.23 Nissan Micra.2 65 -.978 -.90 -.073 -.346 -.808 -.968 Renault Clio 3.0 V6.47.30.535 -.223 -.29 -.80 Audi A3.9 TDI -.545 -.653 -.490 -.494 -.332 -.29 Peugeot 307.4 HDI 70 -.873 -.878 -.967 -.794 -.48 -.30 Peugeot 407 3.0 V6 BVA.47.028.253.396 -.24.685 Mercedes Classe C 270 CDI -.025 -.235.270.293 -.500.430 BMW 530d.78.073.535.280.034.968 Jaguar S-Type 2.7 V6 Bi-Turbo -.002.002.270.608 -.092.079 BMW 745i.05.8.624.989.288.292 Mercedes Classe S 400 CDI.820.342.624.06.48.307 Citroën C3 Pluriel.6i -.749 -.62 -.525 -.799 -.627 -.59 BMW Z4 2.5i -.5 -.094.359 -.585 -.260 -.32 Audi TT.8T 80 -.62 -.7.235 -.533 -.337 -.407 Aston Martin Vanquish 2.8.627.64.899.383.666 Bentley Continental GT 2.60 2.269.826 2.38.360.905 Ferrari Enzo 2.60 2.9 2.39 -.34 3.675.726 Renault Scenic.9 dci 20 -.562 -.557 -.472 -.46 -.5 -.032 Volkswagen Touran.9 TDI 05 -.545 -.653 -.64.029 -.20.95 Land Rover Defender Td5 -.50 -.544 -.409.538 -.29 -.679 Land Rover Discovery Td5 -.50 -.44 -.020.777.592.735 Nissan X-Trail 2.2 dci -.355 -.454 -.64.086 -.332.305.000.000.000.000.000.000.000.000.000.000.000.000 4
9. Building a typology of the statistical units Searching for homogeneous groups of individuals (clusters) in the population : - Two individuals belonging to the same group are somehow close to each other (similar behaviors); - Two individuals belonging to different groups are somehow far from each other (different behaviors); Build a partition of the population into homogeneous clusters (low within-variability) which are different one from the other (high between-variability). 5
Dendrogramme 9 8 7 6 5 4 3 2 groups Choosing the cutting level x x x Definition of the clusters 6
Hierarchical Classification (Ward Criterion) X X p g 2 g X 2 g 3 Ward Distance : D 2 (G i, G j ) = n i = number of cases in cluster G i n n n i i + j n j d 2 ( g, g ) i j 7
Distance Matrix between 24 cars Proximity Matrix Case :Citroën C2. Base 2:Smart Fortwo Coupé 3:Mini.6 70 4:Nissan Micra.2 65... 23:Land Rover Discovery 24:Nissan X-Trail 2.2 d This is a dissimilarity matrix Squared Euclidean Distance 23:Land :Citroën C2 2:Smart 4:Nissan Rover 24:Nissan. Base Fortwo Coupé 3:Mini.6 70 Micra.2 65... Discovery X-Trail 2.2 d.000 4.965 2.27.026... 20.325 5.246 4.965.000 9.06 5.42... 39.487 8.625 2.27 9.06.000 2.249... 6.268 3.420.026 5.42 2.249.000... 9.36 4.703............ 20.325 39.487 6.268 9.36....000 6.953 5.246 8.625 3.420 4.703... 6.953.000 p 2 2 k l = jk jl j= d (x, x ) (x x ) D Ward (Citroën C2, Nissan Micra) =.026 =.03 ( + ) 8
H I E R A R C H I C A L C L U S T E R A N A L Y S I S Dendrogram using Ward Method Rescaled Distance Cluster Combine C A S E 0 5 0 5 20 25 Label Num +---------+---------+---------+---------+---------+ Citroën C2. Base Nissan Micra.2 65 4 Peugeot 307.4 HDI 7 Citroën C3 Pluriel 4 BMW Z4 2.5i 5 Audi TT.8T 80 6 Renault Clio 3.0 V6 5 Mini.6 70 3 Volkswagen Touran. 2 Nissan X-Trail 2.2 d 24 Audi A3.9 TDI 6 Renault Scenic.9 d 20 Land Rover Defender 22 Smart Fortwo Coupé 2 Peugeot 407 3.0 V6 B 8 BMW 530d 0 Jaguar S-Type 2.7 V6 Mercedes Classe C 27 9 BMW 745i 2 Mercedes Classe S 40 3 Land Rover Discovery 23 Aston Martin Vanquis 7 Bentley Continental 8 Ferrari Enzo 9 9
Clusters Interpretation Report Ward Method 2 3 4 5 Total N N N N N N Cylindrée Puissance Vitesse Poids Largeur Longueur 885.3 30.08 88.69 295.85 748.38 402.3 3 3 3 3 3 3 698.00 52.00 35.00 730.00 55.00 2500.00 37.86 29.57 227.29 788.4 92.43 487.43 7 7 7 7 7 7 5966.50 50.00 32.00 20.00 920.50 4734.50 2 2 2 2 2 2 5998.00 660.00 350.00 365.00 2650.00 4700.00 2722.54 206.67 24.7 486.58 838.42 4277.83 24 24 24 24 24 24 0
Décomposition de la somme des carrés totale X X p g g 2 g g 3 X 2 n K K 2 2 2 i = k k + i k i= k= k= i G d (x,g) n d (g,g) d (x,g ) k Total sum of squares = (n-)p Between group = sum of squares + Within group sum of squares
Stage 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 20 2 22 23 Coefficient : Somme des carrés intra-classes de la typologie en K classes Distance de Ward(,4) Cluster Combined Agglomeration Schedule Cluster Cluster 2 Coefficients Cluster Cluster 2 Next Stage 4.03 0 0 3 Groupe contenant Stage Cluster First Appears 2 24.067 0 0 8 6 20.54 0 0 8 8 0.255 0 0 5 8.377 4 0 9 5 6.506 0 0 7 4.772 0 0 3 6 2.056 3 2 5 8 9.460 5 0 6 2 3.988 0 0 6 5 5 2.567 0 6 2 3 5 3.373 0 7 7 4.384 7 8 7 8 5.650 0 0 20 6 22 7.70 8 0 7 8 2 0.798 9 0 9 3 6 5.7 2 5 8 3 20.448 3 7 2 8 23 25.850 6 0 22 7 9 36.5 4 0 22 2 47.523 8 0 23 8 7 73.86 9 20 23 8 38.000 2 22 0 Résultats SPSS : Somme des carrés intraclasses Part de somme des carrés totale expliquée par la typologie en K classes : (38 - Coeff[n-K])/38 Part de somme des carrés totale expliquée par la typologie en 2 classes : (38-73.86)/38 = 0.465 Somme des carrés intra-classes pour la typologie en K=2 classes Somme des carrés totale = p(n-) 2
Autre méthode : k-means clustering Choose a few "seeds," individuals who are quite different from one another (the "seeds" can be provided by the analyst or chosen by the program) 2 Go over the file, allocating each individual to the closest seed. 3 Then, compute the mean of each group. This constitutes the new seed for the group. If all the new seeds are close to the previous ones, stop. Otherwise, go to 2. 3