Découverte du logiciel Stata Mesures et tests d association Christophe Lalanne www.aliquote.org
Synopsis Tests de comparaison de deux moyennes Tests de comparaison de k moyennes Tests de comparaison de deux proportions Analyse d un tableau de contingence Mesures d association en épidémiologie d2e5ca9 2 / 47
Données d illustration Enquête socio-économique allemande réalisée en 2009 : «GSOEP» (3). ybirth hhnr2009 sex mar edu yedu voc Données socio-démographiques année de naissance foyer résidentiel sexe statut marital niveau d éducation nombre d années de formation niveau secondaire ou université Emploi et revenu emp type d emploi egp catégorie socio professionnelle income revenus ( ) hhinc revenus du foyer ( ) Logement size hhsize taille du logement nombre de personnes dans habitation d2e5ca9 3 / 47
Fichier de données : gsoep09.dta. use data/gsoep09 (SOEP 2009 (Kohler/Kreuter)) Pré-traitements :. gen age = 2009 - ybirth. mvdecode income, mv(0=.c) income: 1369 missing values generated. gen lincome = log(income) (2001 missing values generated) d2e5ca9 4 / 47
Tests de comparaison de deux moyennes d2e5ca9 5 / 47
Comparaison de deux moyennes Le test de Student, via la commande ttest, s utilise dans le cas des comparaisons de moyennes pour un échantillon (H 0 : µ = 0) ou deux échantillons (indépendants ou non). Illustration : le revenu moyen diffère-t-il selon le sexe?. bysort sex: summarize lincome ------------------------------------------------------------------------------- -> sex = Male Variable Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- lincome 1746 10.08129 1.083648 3.828641 13.70765 ------------------------------------------------------------------------------- -> sex = Female Variable Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- lincome 1664 9.443893 1.073004 5.09375 13.32572 d2e5ca9 6 / 47
. graph box lincome, over(sex) ytitle("income (log(2)") 14 12 Income (log(2) 10 8 6 4 Male Female d2e5ca9 7 / 47
Test de Student Statistics Summaries, tables, and tests Classical tests of hypotheses t test. ttest lincome, by(sex) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- Male 1746 10.08129.0259338 1.083648 10.03043 10.13216 Female 1664 9.443893.0263042 1.073004 9.3923 9.495486 ---------+-------------------------------------------------------------------- combined 3410 9.770257.0192551 1.124407 9.732504 9.808009 ---------+-------------------------------------------------------------------- diff.6374003.0369475.5649587.7098419 ------------------------------------------------------------------------------ diff = mean(male) - mean(female) t = 17.2515 Ho: diff = 0 degrees of freedom = 3408 Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = 1.0000 Pr( T > t ) = 0.0000 Pr(T > t) = 0.0000 d2e5ca9 8 / 47
Test de Student (bis) Sans supposer l égalité des variances parentes (correction de Satterthwaite, option unequal) (5) :. ttest lincome, by(sex) welch Two-sample t test with unequal variances ------------------------------------------------------------------------------ Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- Male 1746 10.08129.0259338 1.083648 10.03043 10.13216 Female 1664 9.443893.0263042 1.073004 9.3923 9.495486 ---------+-------------------------------------------------------------------- combined 3410 9.770257.0192551 1.124407 9.732504 9.808009 ---------+-------------------------------------------------------------------- diff.6374003.0369388.5649759.7098247 ------------------------------------------------------------------------------ diff = mean(male) - mean(female) t = 17.2556 Ho: diff = 0 Welch's degrees of freedom = 3405.02 Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = 1.0000 Pr( T > t ) = 0.0000 Pr(T > t) = 0.0000 Si l on souhaite vraiment comparer deux variances, la commande sdtest offre la même syntaxe que ttest. d2e5ca9 9 / 47
Intervalles de confiance La commande ci permet de construire des intervalles de fluctuation pour un certain niveau de confiance (level()) :. bysort sex: ci lincome ------------------------------------------------------------------------------- -> sex = Male Variable Obs Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------- lincome 1746 10.08129.0259338 10.03043 10.13216 ------------------------------------------------------------------------------- -> sex = Female Variable Obs Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------- lincome 1664 9.443893.0263042 9.3923 9.495486 Commande additionnelle : mean (idem, utilisation de la loi normale pour les IC à 95 %). d2e5ca9 10 / 47
. mean lincome if sex == 1 Mean estimation Number of obs = 1746 -------------------------------------------------------------- Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ lincome 10.08129.0259338 10.03043 10.13216 -------------------------------------------------------------- Manuellement :. local zc = 1-invnormal(0.95). display 10.08129 - `zc'/2 *.0259338 10.089652 Si l on souhaite construire des intervalles de confiance basés sur une distribution de Student, on utilisera plutôt invt (tprob fournit les valeurs de probabilités au lieu des fractiles) :. display 10.08129 - invt(1745, 0.975) *.0259338 10.030425 d2e5ca9 11 / 47
Alternative non-paramétrique Le test de Wilcoxon (différent de median) constitue une alternative non-paramétrique au test de Student.. ranksum lincome, by(sex) Two-sample Wilcoxon rank-sum (Mann-Whitney) test sex obs rank sum expected -------------+--------------------------------- Male 1746 3551869.5 2977803 Female 1664 2263885.5 2837952 -------------+--------------------------------- combined 3410 5815755 5815755 unadjusted variance 8.258e+08 adjustment for ties -16.745225 ---------- adjusted variance 8.258e+08 Ho: lincome(sex==male) = lincome(sex==female) z = 19.976 Prob > z = 0.0000 d2e5ca9 12 / 47
Tests de comparaison de k moyennes d2e5ca9 13 / 47
Analyse de variance à un facteur L analyse de variance (ANOVA) est utilisée pour comparer plus de 2 moyennes (H 0 : µ 1 = µ 2 = = µ k ). Stata offre deux commandes (sans passer par le modèle linéaire) : oneway et anova. Illustration : le revenu moyen diffère-t-il selon le type d emploi?. recode egp (1/2=1) (3/5=2) (8/9=3) (15/18=.), /// gen ( egp4 ). label define egp4 1 " Service class 1/2" /// 2 "Non - manuals & self - employed " 3 " Manuals ". label values egp4 egp4 (4435 differences between egp and egp4) d2e5ca9 14 / 47
Distributions par groupe. histogram lincome, by(egp4, col(3)) freq Service class 1/2 Non manuals & self employed Manuals 200 150 Frequency 100 50 0 5 10 15 5 10 15 5 10 15 lincome Graphs by RECODE of egp (Social Class (EGP)) d2e5ca9 15 / 47
. twoway (kdensity lincome), by(egp4).8 Service class 1/2 Non manuals & self employed.6.4 kdensity lincome.2 0.8.6 Manuals 5 10 15.4.2 0 5 10 15 x Graphs by RECODE of egp (Social Class (EGP)) d2e5ca9 16 / 47
. graph box lincome, over(egp4) ytitle("income (log(2)") 14 12 Income (log(2) 10 8 6 4 Service class 1/2 Non manuals & self employed Manuals d2e5ca9 17 / 47
Moyennes conditionnelles. tabstat lincome, by(egp4) stats(mean sd count) Summary for variables: lincome by categories of: egp4 (RECODE of egp (Social Class (EGP))) egp4 mean sd N -----------------+------------------------------ Service class 1/ 10.29525.9454878 1085 Non-manuals & se 9.776857.9735212 868 Manuals 9.615197 1.002863 1102 -----------------+------------------------------ Total 9.902652 1.018826 3055 ------------------------------------------------ d2e5ca9 18 / 47
Tableau d ANOVA Statistics Linear models and related ANOVA/MANOVA Oneway ANOVA. oneway lincome egp4 Analysis of Variance Source SS df MS F Prob > F ------------------------------------------------------------------------ Between groups 272.026782 2 136.013391 143.24 0.0000 Within groups 2898.0461 3052.949556388 ------------------------------------------------------------------------ Total 3170.07288 3054 1.03800684 Bartlett's test for equal variances: chi2(2) = 3.7888 Prob>chi2 = 0.150 oneway [response_var] [factor_var] [if] [in] [, options] tabulate : affichage des moyennes, écarts-type et effectifs bonferroni : comparaison des paires de moyennes avec correction de Bonferroni d2e5ca9 19 / 47
Vérification des conditions d application indépendance des observations normalité des résidus égalité des variances (parentes) d2e5ca9 20 / 47
Normalité des résidus La commande swilk fournit le test de Shapiro-Wilks. Mais en règle générale, les méthodes graphiques sont préférables :. quietly: anova lincome egp4. predict r, resid. qnorm r (2356 missing values generated) 4 2 Residuals 0 2 4 6 4 2 0 2 4 Inverse Normal d2e5ca9 21 / 47
Égalité des variances Stata fournit le résultat du test de Bartlett pour l égalité des variances avec la commande oneway. Le test de Levenne s obtient avec la commande robvar (W0) :. robvar lincome, by(egp4) RECODE of egp (Social Class Summary of lincome (EGP)) Mean Std. Dev. Freq. ------------+------------------------------------ Service c 10.295247.94548776 1085 Non-manua 9.7768571.97352115 868 Manuals 9.6151967 1.0028632 1102 ------------+------------------------------------ Total 9.9026521 1.0188262 3055 W0 = 12.5051486 df(2, 3052) Pr > F = 0.00000390 W50 = 7.9388574 df(2, 3052) Pr > F = 0.00036403 W10 = 10.6968625 df(2, 3052) Pr > F = 0.00002348 d2e5ca9 22 / 47
Comparaison de paires de moyennes Option de correction pour les tests post-hoc : bonferroni, scheffe ou sidak.. oneway lincome egp4, bonferroni noanova Comparison of lincome by RECODE of egp (Social Class (EGP)) (Bonferroni) Row Mean- Col Mean Service Non-manu ---------+---------------------- Non-manu -.51839 0.000 Manuals -.680051 -.16166 0.000 0.001 On arrive à des conclusions similaires en appliquant la correction de Bonferroni sur les résultats de simples tests de Student.. quietly: ttest lincome if egp4!= 1, by(egp4). display r(p)*3.0009856 d2e5ca9 23 / 47
Alternative à oneway La commande oneway est limité au cas à un facteur explicatif. La commande anova est plus générale et couvre : les plans factoriels et emboîtés, les plans équilibrés ou non (cf. calcul des sommes de carrés), les mesures répétées, l analyse de covariance.. anova lincome egp4 Number of obs = 3055 R-squared = 0.0858 Root MSE =.974452 Adj R-squared = 0.0852 Source Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model 272.026782 2 136.013391 143.24 0.0000 egp4 272.026782 2 136.013391 143.24 0.0000 Residual 2898.0461 3052.949556388 -----------+---------------------------------------------------- Total 3170.07288 3054 1.03800684 d2e5ca9 24 / 47
Comparaisons multiples En utilisant anova, les comparaisons par paires de moyennes s obtiennent à l aide de pwcompare, commande plus générale que pwmean. Les options de correction (mcompare()) incluent en plus : tukey, snk, duncan et dunnett.. pwcompare egp4, cformat(%3.2f) Pairwise comparisons of marginal linear predictions Margins : asbalanced ------------------------------------------------------------------------------ Unadjusted Contrast Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ egp4 Non-manuals & self-employed vs Service class 1/2-0.52 0.04-0.61-0.43 Manuals vs Service class 1/2-0.68 0.04-0.76-0.60 Manuals d2e5ca9 vs 25 / 47
Tests de comparaison de deux proportions d2e5ca9 26 / 47
Tests de proportion exact et approché Outre le test du χ 2 de Pearson dans le cas du croisement de deux variables binaires, Stata dispose des commandes bitest (test binomial) et prtest (test reposant sur l approximation normale). Dans le cas univarié, la variable binaire doit être codée en 0/1. Plusieurs types d intervalles de confiance sont disponibles (4). Illustration : distribution équilibrée des deux sexes dans l échantillon.. generate sexb = sex - 1. tabulate sexb sexb Freq. Percent Cum. ------------+----------------------------------- 0 2,585 47.77 47.77 1 2,826 52.23 100.00 ------------+----------------------------------- Total 5,411 100.00 d2e5ca9 27 / 47
Test binomial Statistics Summaries, tables, and tests Classical tests of hypotheses Proportion test. bitest sexb == 0.5 Variable N Observed k Expected k Assumed p Observed p -------------+------------------------------------------------------------ sexb 5411 2826 2705.5 0.50000 0.52227 Pr(k >= 2826) = 0.000551 (one-sided test) Pr(k <= 2826) = 0.999500 (one-sided test) Pr(k <= 2585 or k >= 2826) = 0.001102 (two-sided test). ci sexb, binomial -- Binomial Exact -- Variable Obs Mean Std. Err. [95% Conf. Interval] -------------+--------------------------------------------------------------- sexb 5411.5222695.0067905.508859.5356559 d2e5ca9 28 / 47
Test de proportion pour un échantillon Statistics Summaries, tables, and tests Classical tests of hypotheses Binomial probability test. prtest sexb == 0.5 One-sample test of proportion sexb: Number of obs = 5411 ------------------------------------------------------------------------------ Variable Mean Std. Err. [95% Conf. Interval] -------------+---------------------------------------------------------------- sexb.5222695.0067905.5089604.5355785 ------------------------------------------------------------------------------ p = proportion(sexb) z = 3.2763 Ho: p = 0.5 Ha: p < 0.5 Ha: p!= 0.5 Ha: p > 0.5 Pr(Z < z) = 0.9995 Pr( Z > z ) = 0.0011 Pr(Z > z) = 0.0005 d2e5ca9 29 / 47
Test de proportion pour deux échantillons. generate egpb = egp4 == 1. prtest egpb, by(sexb) Two-sample test of proportions 0: Number of obs = 2585 1: Number of obs = 2826 ------------------------------------------------------------------------------ Variable Mean Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- 0.2201161.0081491.2041441.236088 1.1854211.0073107.1710923.1997498 -------------+---------------------------------------------------------------- diff.034695.0109478.0132376.0561523 under Ho:.0109269 3.18 0.001 ------------------------------------------------------------------------------ diff = prop(0) - prop(1) z = 3.1752 Ho: diff = 0 Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(Z < z) = 0.9993 Pr( Z > z ) = 0.0015 Pr(Z > z) = 0.0007 d2e5ca9 30 / 47
Commandes immédiates Plusieurs commandes Stata acceptent des formes «immédiates». prtesti # obs1 #p1 # obs2 #p2 [, levels (#) count ] Statistics Summaries, tables, and tests Classical tests of hypotheses Proportion test calculator. prtesti 2585 0.2201 2826 0.1854 L option count permet de travailler avec les effectifs observés plutôt que des fréquences relatives. d2e5ca9 31 / 47
Analyse d un tableau de contingence d2e5ca9 32 / 47
Construction d un tableau 2x2 Statistics Summaries, tables, and tests Frequency tables Two-way table with measures of association La commande tabulate (twoway) permet de construire un tableau d effectifs ou de fréquences relatives et dispose d options pour les statistiques de Pearson et de Fisher (1).. tabulate sex egp4 RECODE of egp (Social Class (EGP)) Gender Service c Non-manua Manuals Total ---------------------+---------------------------------+---------- Male 569 290 717 1,576 Female 524 592 396 1,512 ---------------------+---------------------------------+---------- Total 1,093 882 1,113 3,088 d2e5ca9 33 / 47
Profils ligne et colonne. tabulate sex egp4, row +----------------+ Key ---------------- frequency row percentage +----------------+ RECODE of egp (Social Class (EGP)) Gender Service c Non-manua Manuals Total ---------------------+---------------------------------+---------- Male 569 290 717 1,576 36.10 18.40 45.49 100.00 ---------------------+---------------------------------+---------- Female 524 592 396 1,512 34.66 39.15 26.19 100.00 ---------------------+---------------------------------+---------- Total 1,093 882 1,113 3,088 35.40 28.56 36.04 100.00 d2e5ca9 34 / 47
Test d association du χ 2. tabulate sex egp4, chi RECODE of egp (Social Class (EGP)) Gender Service c Non-manua Manuals Total ---------------------+---------------------------------+---------- Male 569 290 717 1,576 Female 524 592 396 1,512 ---------------------+---------------------------------+---------- Total 1,093 882 1,113 3,088 Pearson chi2(2) = 196.5961 Pr = 0.000 d2e5ca9 35 / 47
Effectifs théoriques L option expected fournit les effectifs théoriques.. tabulate sex egp4, expected +--------------------+ Key -------------------- frequency expected frequency +--------------------+ RECODE of egp (Social Class (EGP)) Gender Service c Non-manua Manuals Total ---------------------+---------------------------------+---------- Male 569 290 717 1,576 557.8 450.1 568.0 1,576.0 ---------------------+---------------------------------+---------- Female 524 592 396 1,512 535.2 431.9 545.0 1,512.0 ---------------------+---------------------------------+---------- Total 1,093 882 1,113 3,088 1,093.0 882.0 1,113.0 3,088.0 d2e5ca9 36 / 47
Test exact de Fisher. tabulate sex egp4, exact Enumerating sample-space combinations: stage 3: enumerations = 1 stage 2: enumerations = 351 stage 1: enumerations = 0 RECODE of egp (Social Class (EGP)) Gender Service c Non-manua Manuals Total ---------------------+---------------------------------+---------- Male 569 290 717 1,576 Female 524 592 396 1,512 ---------------------+---------------------------------+---------- Total 1,093 882 1,113 3,088 Fisher's exact = 0.000 d2e5ca9 37 / 47
Mesures d association en épidémiologie d2e5ca9 38 / 47
Mesures de risque Statistics Epidemiology and related Tables for epidemiologists Stata offre une grande variété de tests d association et de mesures de risque classiquement utilisées en épidémiologie. d2e5ca9 39 / 47
Odds-ratio La commande tabodds s utilise dans le cas des études castémoins ou des études transversales. Elle permet de calculer l odds-ratio et son intervalle de confiance asymptotique (autre option : cornfield ou woolf), ainsi que tester l homogénéité des OR entre strates (test de Mantel-Haenszel). Autres commandes disponibles : cc et mcc (étude cas-témoins), ir (étude de cohorte). Toutes ces commandes disposent d une forme «immédiate» alternative. Manuel : [ST] epitab d2e5ca9 40 / 47
Données d illustration Étude sur les poids de naisssance (2). low poids de naissance < 2,5 kg age âge de la mère lwt poids de la mère (livres) aux dernières règles race ethnicité de la mère («w», «b», «o») smoke statut fumeur de la mère pendant la grossesse ht antécédent d hypertension ui présence d irritabilité utérine ftv nb de visites chez le gynécologue 1 er trimestre ptl nb d accouchements pré terme antérieurs bwt poids du bébé (grammes) d2e5ca9 41 / 47
. clear all. webuse lbw (Hosmer & Lemeshow data). list in 1/5 +-----------------------------------------------------------------------+ id low age lwt race smoke ptl ht ui ftv bwt ----------------------------------------------------------------------- 1. 85 0 19 182 black nonsmoker 0 0 1 0 2523 2. 86 0 33 155 other nonsmoker 0 0 0 3 2551 3. 87 0 20 105 white smoker 0 0 0 1 2557 4. 88 0 21 108 white smoker 0 0 1 2 2594 5. 89 0 18 107 white smoker 0 0 1 0 2600 +-----------------------------------------------------------------------+ d2e5ca9 42 / 47
Calcul de l odds-ratio. tabodds low smoke, or --------------------------------------------------------------------------- smoke Odds Ratio chi2 P>chi2 [95% Conf. Interval] -------------+------------------------------------------------------------- nonsmoker 1.000000.... smoker 2.021944 4.90 0.0269 1.069897 3.821169 --------------------------------------------------------------------------- Test of homogeneity (equal odds): chi2(1) = 4.90 Pr>chi2 = 0.0269 Score test for trend of odds: chi2(1) = 4.90 Pr>chi2 = 0.0269 d2e5ca9 43 / 47
. cc low smoke, woolf smoked during pregnancy Proportion Exposed Unexposed Total Exposed -----------------+------------------------+------------------------ Cases 30 29 59 0.5085 Controls 44 86 130 0.3385 -----------------+------------------------+------------------------ Total 74 115 189 0.3915 Point estimate [95% Conf. Interval] ------------------------+------------------------ Odds ratio 2.021944 1.08066 3.783112 (Woolf) Attr. frac. ex..5054264.0746392.7356673 (Woolf) Attr. frac. pop.2569965 +------------------------------------------------- chi2(1) = 4.92 Pr>chi2 = 0.0265 d2e5ca9 44 / 47
Calcul du risque relatif. cs low smoke smoked during pregnancy Exposed Unexposed Total -----------------+------------------------+------------ Cases 30 29 59 Noncases 44 86 130 -----------------+------------------------+------------ Total 74 115 189 Risk.4054054.2521739.3121693 Point estimate [95% Conf. Interval] ------------------------+------------------------ Risk difference.1532315.0160718.2903912 Risk ratio 1.607642 1.057812 2.443262 Attr. frac. ex..377971.0546528.5907112 Attr. frac. pop.1921887 +------------------------------------------------- chi2(1) = 4.92 Pr>chi2 = 0.0265 d2e5ca9 45 / 47
Références I 1. I Campbell. Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statistics in Medicine, 26(19) :3661 3675, 2007. 2. D Hosmer and S Lemeshow. Applied Logistic Regression. New York : Wiley, 1989. 3. U Kohler and F Kreuter. Data Analysis Using Stata. College Station : Stata Press, 2012. 4. RG Newcombe. Two-sided confidence intervals for the single proportion : comparison of seven methods. Statistics in Medicine, 17(8) :857 872, 1998. 5. BL Welch. On the comparison of several mean values : An alternative approach. Biometrika, 38 :330 336, 1951. d2e5ca9 46 / 47
Index des commandes anova, 24 bitest, 28 bysort, 6, 10 cc, 44 ci, 10, 28 clear, 42 cs, 45 display, 11, 23 generate, 4, 27, 30 graph box, 7, 17 histogram, 15 invt, 11 kdensity, 16 label define, 14 label values, 14 list, 42 local, 11 log, 4 mean, 11 mvdecode, 4 normal, 11 oneway, 19, 23 predict, 21 prtest, 29, 30 prtesti, 31 pwcompare, 25 pwmean, 25 qnorm, 21 quietly, 23 ranksum, 12 recode, 14 robvar, 22 sqrt, 11 summarize, 6 tabodds, 43 tabstat, 18 tabulate, 27, 33 37 ttest, 8, 9, 23 twoway, 16 use, 4 webuse, 42 d2e5ca9 47 / 47