Un réseau de neurones pour l adaptation de domaine

Un réseau de neurones pour l adaptation de domaine Pascal Germain Travail conjoint avec Hana Ajakan, Hugo Larochelle, François Laviolette et Mario Marchand Groupe de recherche en apprentissage automatique de l Université Laval (GRAAL) 3 janvier 04 Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 / 4

Plan Mise en contexte Apprentissage automatique et classification Adaptation de domaine Talk : NIPS 04 Workshop on Transfer and Multi-task learning Domain Adaptation Setting Theoretical Foundations Neural Network for Domain Adaptation Empirical Results 3 Conclusion Travaux futurs Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 / 4

Une définition de l apprentissage automatique Field of study that gives computers the ability to learn without being explicitly programmed Arthur Samuel, 959 Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 3 / 4

Exemple critiques de films - + + Algorithme d'apprentissage -??? Classificateur Pascal Germain (GRAAL) Re seau de neurones adaptatif + + 3 janvier 04 4 / 4

Définitions de base Échantillon de données S def = { (x, y ), (x, y ),..., (x m, y m ) } (X Y) m Pour cette présentation, Espace d entré : x X R d (Attributs à valeurs réelles) Espace de sortie : y Y = {, } (Classification binaire) Algorithme d apprentissage A(S) η Classificateur Cas général : η : X Y Pour cette présentation : η : R d {, } Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 5 / 4

Apprentissage inductif Hypothèse Les exemples sont générées i.i.d. par une distribution D sur X Y. S D m Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 6 / 4

Apprentissage inductif Hypothèse (classification binaire) Les exemples sont générées i.i.d. par une distribution D sur R d {, }. S D m Un algorithme d apprentissage reçoit un échantillon d entraînement S = {(x i, y i )} m i= D m, et retourne un classificateur η : R d {, }. 0 On veut un classificateur avec un faible risque R D (η) def = Pr (x,y) D ( η(x) y 4 3 0 3 4 Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 7 / 4 ).

Exemple critiques de livres critiques de films -??? +?????? Algorithme d'apprentissage + -?????? Classificateur Pascal Germain (GRAAL) Re seau de neurones adaptatif - + 3 janvier 04 8 / 4

Définitions Deux distributions de données Distribution source D S ; Distribution cible D T. Deux chantillons d entraînement Échantillon source : S def = { (x s, y s ), (xs, y s ),..., (xs m, y s m) } (D S ) m Échantillon cible : Algorithme d apprentissage Classificateur T def = { x t, xt,,..., xt m} (D T ) m (non étiqueté!) A(S, T ) η η : X Y doit classifier des exemples cibles (x t D T ). Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 9 / 4

Adaptation de domaine Hypothèse Les exemples sont générées i.i.d. par les distributions D S et D T sur X Y. S (D S ) m et T (D T ) m Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 0 / 4

Adaptation de domaine (deux approches) Question Dans quelle(s) situation(s) l adaptation est-elle possible de D S vers D T? Réponse (partielle) Lorsque les domaines D S et D T sont semblables. Un outil Notion de distance d η (D S, D T ) entre les distributions. Deux approches pour la conception d algorithmes. Trouver un classificateur η H tel que d η (D S, D T ) est faible.. Modifier la représentation des exemples : Trouver une fonction h telle que d η (h(d S ), h(d T )) est faible. Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 / 4

Our Domain Adaptation Setting Binary classification tasks Input space : R d Labels : {, } Two different data distributions Source domain : D S Target domain : D T A domain adaptation learning algorithm is provided with a labeled source sample S = {(x s i, y s i )} m i= (D S ) m, an unlabeled target sample T = {x t i } m i= (D T ) m. 4 4 3 3 0 0 3 3 4 4 3 0 3 4 4 4 3 0 3 4 The goal is to build a classifier η : R d {, } with a low target risk R DT (η) def = Pr (x t,y t ) D T [η(x t ) y t ]. Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 / 4

Divergence between source and target domains Definition (Ben David et al., 006) Given two domain distributions D S and D T, and a hypothesis class H, the H-divergence between D S and D T is def [ d H (D S, D T ) = sup Pr η(x s ) = ] [ + Pr η(x t ) = ] η H x s D S x t D T. The H-divergence measures the ability of an hypothesis class H to discriminate between source D S and target D T distributions. 4 3 0 3 4 4 3 0 3 4 Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 3 / 4

Bound on the target risk Theorem (Ben David et al., 006) Let H be a hypothesis class of VC-dimension d. With probability δ over the choice of samples S (D S ) m and T (D T ) m, for every η H : R DT (η) R S (η)+ 4 m with β inf η H [R D S (η )+R DT (η )]. d log e m d + log 4 δ +ˆd H (S, T )+ 4 m d log m d + log 4 δ +β Empirical risk on the source sample : def R S (η) = m I [η(x s i ) y s i ]. m Empirical H-divergence : [ ˆd H (S, T ) def = max η H m i= m I [η(x s i ) = ] + m i= ] m I [η(x t i ) = ]. Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 4 / 4 i=

Bound on the target risk Theorem (Ben David et al., 006) Let H be a hypothesis class of VC-dimension d. With probability δ over the choice of samples S (D S ) m and T (D T ) m, for every η H : R DT (η) R S (η)+ 4 m with β inf η H [R D S (η )+R DT (η )]. d log e m d + log 4 δ +ˆd H (S, T )+ 4 m d log m d + log 4 δ +β Target risk R DT (η) is low if, given S and T, R S (η) is small, i.e., η H is good on and ˆd H(S, T ) is small, i.e., all η H are bad on 4 4 4 3 3 3 0 0 0 3 3 3 4 4 3 0 3 4 4 4 3 0 3 4 4 4 3 0 3 4 Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 5 / 4

Standard Neural Network Let consider a neural network architecture with one hidden layer h(x) = sigm(b + Wx), and f(h(x)) = softmax(c + Vh(x)). min W,V,b,c [ m m ( ) ] log f y s i (x s i ). i= } {{ } source loss where f y (x) denotes the conditional probability that the neural network assigns x to class y. Given a source sample S = {(x s i, y s i )}m i= (D S) m,. Pick a x s S. Update V towards f(h(x s )) = y s 3. Update W towards f(h(x s )) = y s h(x) f(h(x))... h(x) j... x... x j... The hidden layer learns a representation h( ) from which linear hypothesis f( ) can classify source examples. Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 6 / 4 V x d h(x) n W

Domain-Adversarial Neural Network (DANN) Empirical H-divergence [ ˆd H (S, T ) def = max η H m m I [η(x s i ) = ] + m i= ] m I [η(x t i ) = ]. i= We estimate the H-divergence by a logistic regressor that model the probability that a given input (either x s or x t ) is from the source domain : o(h(x)) def = sigm(d + w h(x)). Given a representation output by the hidden layer h( ) : [ ) m ˆd H (h(s), h(t ) max log ( o(h(x s i )) ) + m log ( o(h(x t i )) ) ]. w,d m m i= i= Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 7 / 4

Domain-Adversarial Neural Network (DANN) min W,V,b,c [ m log ( f y s m i (x s i ) ) i= }{{} source loss + λ max w,d ( m m log ( o(h(x s i )) ) m + log ( o(h(x t m i )) )) ], i= i= } {{ } adaptation regularizer where λ > 0 weights the domain adaptation regularization term. Given a source sample S = {(x s i, y s i )}m i= (D S) m, and a target sample T = {(x t i )}m i= (D T ) m,. Pick a x s S and x t T. Update V towards f(h(x s )) = y s 3. Update W towards f(h(x s )) = y s 4. Update w towards o(h(x s )) = and o(h(x t )) = 5. Update W towards o(h(x s )) = and o(h(x t )) = f(h(x)) DANN finds a representation h( ) that are good on S ; but unable to discriminate between S and T. V h(x)... h(x) j... x... x j... o(h(x)) w h(x) n x d W Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 8 / 4

Toy Dataset Standard Neural Network (NN) f(h(x)) Trained to classify source Trained to classify domains V h(x)... h(x) j... h(x) n 0 0 W x...... x j x d 4 3 0 3 4 4 3 0 3 4 Domain-Adversarial Neural Networks (DANN) f(h(x)) o(h(x)) Classification output : f(h(x)) Domain output : o(h(x)) V w h(x)... h(x) j... h(x) n 0 0 W x...... x j x d 4 3 0 3 4 4 3 0 3 4 Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 9 / 4

Amazon Reviews Input : product review (bag of words) Output : positive or negative rating. Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 0 / 4

Amazon Reviews Input : product review (bag of words) Output : positive or negative rating. Dataset DANN NN books dvd 0.0 0.99 books electronics 0.46 0.5 books kitchen 0.30 0.35 dvd books 0.47 0.6 dvd electronics 0.47 0.56 dvd kitchen 0.7 0.7 electronics books 0.80 0.8 electronics dvd 0.73 0.77 electronics kitchen 0.48 0.49 kitchen books 0.83 0.88 kitchen dvd 0.6 0.6 kitchen electronics 0.6 0.6 Note : We use a small labeled subset of 00 target examples to select the hyperparameters. Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 / 4

Marginalized Stacked Denoising Autoencoders (msda) Question Does DANN can be combined with other representation learning techniques for domain adaptation? The autoencoders msda (Chen et al. 0) provides a new common representation for source and target (unsupervised) With msda+svm, Chen et al. (0) obtained state-of-the-art results on Amazon Reviews : Train a linear SVM on msda source representations. We try msda+dann : Train DANN on source representations and target representations. Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 / 4

Amazon Reviews Input : product review (bag of words) Output : positive or negative rating. Dataset msda+dann msda+svm books dvd 0.76 0.75 books electronics 0.97 0.44 books kitchen 0.69 0.7 dvd books 0.76 0.76 dvd electronics 0.8 0.0 dvd kitchen 0.5 0.78 electronics books 0.37 0.9 electronics dvd 0.6 0.6 electronics kitchen 0.8 0.37 kitchen books 0. 0.34 kitchen dvd 0.08 0.09 kitchen electronics 0.4 0.38 Note : We use a small labeled subset of 00 target examples to select the hyperparameters. The noise parameter of msda representations is fixed to 50%. Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 3 / 4

Travaux futurs Quelques avenues à explorer : Réseau de neurones profonds (deep learning). Problèmes multi-classes et multi-étiquettes. Adaptation avec plusieurs ensembles sources. Merci! Pascal Germain (GRAAL) Réseau de neurones adaptatif 3 janvier 04 4 / 4