Optimisation et apprentissage statistique. Une introduction à l apprentissage statistique Stéphane Canu asi.insa-rouen.fr/enseignants/~scanu Ecole d été BasMatI 2015, Porquerolles June 2, 2015
Plan 1 Une définition de l apprentissage statistique 2 Apprendre avec des réseaux de neurones profonds Learning linear The margin Linear SVM: the problem Deep learning 3 Les principaux algorithmes d apprentissage
Apprentissage : humain vs. machine Les apprentissages d un enfant marcher : un an parler : deux ans raisonner : le reste
Apprendre à raisonner
Deux définitions de l apprentissage Arthur Samuel (1959) Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed. Tom Mitchell (The Discipline of Machine Learning, 2006) How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes? A computer program CP is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E
Une tâche T «apprenable» Machine Learning A computer program CP learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E Points clés experience E : statistiques performance measure P : optimisation tasks T : utilité traduction automatique jouer aux échec...faire ce que l homme fait
Les tâches T
L experience E : les données capteurs variables qualitatives / ordinales / quantitatives texte chaine de caractères parole temps - série temporelle images/vidéos dépendances 2/3 d réseaux graphes jeux séquences d interactions flots tickets de caisse, web logs, traffic... étiquettes information d évaluation Les mégadonnées (volume, vitesse, variété, véracité) Des données qui arrivent sans qu on ait décidé de les collecter l importance des pré-traitements (nettoyage, normalisation, codage... ) et de la représentation : de la donnée au vecteur
Objectifs, tâches et mesures de performance P non supervisé clustering homogénéité des groupes supervisé classification bi classe nombre d erreur (0/1) multi classe détection nombre de non détection ordonnancement erreur de rang regression erreur quadratique renforcement expliquer... et semi supervisé, transductif, séquentiel, actif... probabilité de gagner qualité du résumé Généraliser bien faire (minimiser P) sur les futures données qu on ne connait pas
Meilleure compréhension du monde à partir d observations en vue de prédiction C est beau mais Comment ça marche?.
Plan 1 Une définition de l apprentissage statistique 2 Apprendre avec des réseaux de neurones profonds Learning linear The margin Linear SVM: the problem Deep learning 3 Les principaux algorithmes d apprentissage
The formal neuron Defines a hyperplane x 2 w t x + b = 0 x 1, w 1 x 2, w 2... x p, w p 1, b h y = ϕ(w t x + b) x 1 x input w weight, b bias ϕ activation function y output ϕ(t) = tanh(t) 5 1 0 1 0 5
Hyperplanes: formal definition Given vector w IR d and bias b IR Hyperplane as a function h, h : IR d IR x h(x) = w t x + b Hyperplane as a border in IR d (and an implicit function) (w, b) = {x IR d w t x + b = 0} The border invariance property k IR, (kw, kb) = (w, b) (x, h(x)) = w t x + b) h(x) d(x, ) (x, 0) = {x IR 2 w t x + b = 0} the decision border
The formal neuron as a learning machine Fitting the data for the 2 classes classification problem given n pairs of input output data x i, y i, i = 1, n find w such that ϕ(w t x i ) } {{ } prediction of the model = y }{{} i ground truth 5 make some initial guess for w pick some data extract information improve w 4 3 2 1 0 1 2 2 1 0 1 2 3 4 5 6 what about new data generalization?
Fitting the data with an energy-based model Fitting the data min w IR p+1 n i=1 ( ϕ(w t x i ) } {{ } prediction y i }{{} truth Improve w w new = w old ρ (ϕ(w t x i ) y i ) w ϕ(w t x i ) } {{ } extracted information ) 2 Algorithm 1 Adaline Data: w initialization, ρ fixed stepsize Result: w while not converged do x i, y i pick a point i at random d (w t x i y i )x i w w ρd check that the cost decease end 5 4 3 2 1 0 1 2 2 1 0 1 2 3 4 5 6
Fitting the data with an energy-based model Coding the output (truth?) { 1 if xi class 1 y i = 1 else Fitting the data As a loss min w IR p+1 n i=1 ( ϕ(w t x i ) } {{ } prediction p y i }{{} truth t (ϕ(w t x i ) y i ) 2 = yi 2(y iϕ(w t x i ) 1) 2. } {{ } score z i n cost(z i ) z i = ϕ(w t x i )y i min w IR p+1 i=1 ) 2 { 0 if sign(w 0/1 loss cost(z i ) = t x i ) = sign(y i ) 1 else As a constraint sign(w t 1 x i )sign(y i ) 0. 2 0 1 0 1 2 square loss 0/1 loss hinge loss logistic loss
Separating hyperplanes Find a line to separate (classify) blue from red D(x) = sign ( v x + a )
Separating hyperplanes Find a line to separate (classify) blue from red D(x) = sign ( v x + a ) the decision border: v x + a = 0
Separating hyperplanes Find a line to separate (classify) blue from red D(x) = sign ( v x + a ) the decision border: v x + a = 0 How to choose a solution? there are many solutions... The problem is ill posed
This is not the problem we want to solve {(x i, y i ); i = 1 : n} a training sample, i.i.d. according to IP(x, y) unknown we want to be able to classify new observations: minimize IP(error)
This is not the problem we want to solve {(x i, y i ); i = 1 : n} a training sample, i.i.d. according to IP(x, y) unknown we want to be able to classify new observations: minimize IP(error) Looking for a universal approach use training data: (a few errors) prove IP(error) remains small scalable - algorithmic complexity
This is not the problem we want to solve {(x i, y i ); i = 1 : n} a training sample, i.i.d. according to IP(x, y) unknown we want to be able to classify new observations: minimize IP(error) Looking for a universal approach use training data: (a few errors) prove IP(error) remains small scalable - algorithmic complexity with high probability (for the canonical hyperplane): Vapnik s Book, 1982 IP(error) < ÎP(error) } {{ } =0 here + φ( 1 ) margin } {{ } = v
Margin guarantees min dist(x i, (v, a)) i [1,n] } {{ } margin: m m v x = 0 R Theorem (Margin Error Bound) Let R be the radius of the smallest ball B R (a) = { x IR d x c < R }, containing the points (x 1,..., x n ) i.i.d from some unknown distribution IP. Consider a decision function D(x) = sign(v x) associated with a separating hyperplane v of margin m (no training error). Then, with probability at least 1 δ for any δ > 0, the generalization error of this hyperplane is bounded by R 2 ln(2/δ) IP(error) 2 n m 2 + 3 2n theorem 4.17 p 102 in J Shawe-Taylor, N Cristianini Kernel methods for pattern analysis, Cambridge 2004
Statistical machine learning Computation learning theory (COLT) {x {x i, y i } i i = 1, n A f = v x + a x Loss L y p = f (x) ÎP(error) = 1 n L(f (x i), y i ) Vapnik s Book, 1982
Statistical machine learning Computation learning theory (COLT) x P IP {x {x i, y i } i i = 1, n A f = v x + a y Loss L y p = f (x) IP P ( IP(error) ÎP(error) ) Prob = = 1 IE(L) n L(f (x + φ( v ) δ i), y i ) Vapnik s Book, 1982
linear discrimination Find a line to classify blue and red D(x) = sign ( v x + a ) the decision border: v x + a = 0 there are many solutions... The problem is ill posed How to choose a solution? choose the one with larger margin
Maximize our confidence = maximize the margin the decision border: (v, a) = {x IR d v x + a = 0} maximize the margin margin 0 0 0 max v,a min dist(x i, (v, a)) i [1,n] } {{ } margin: m Maximize the confidence m max v,a with v x i + a min m i=1,n v the problem is still ill posed if (v, a) is a solution, 0 < k (kv, ka) is also a solution...
From the geometrical to the numerical margin w T x max w,b Valeur de la marge dans le cas monodimensionnel {x w T x = 0} 1/ w 1 x +1 < marge > 1/ w change variable: w = m Maximize the (geometrical) margin max m v,a v x i + a with min m i=1,n v if the min is greater, everybody is greater (y i { 1, 1}) max v,a with m v m v and b = with y i (w x i + b) 1 ; i = 1, n and m = 1 w y i (v x i + a) v a m v min w,b m, i = 1, n = w = 1 m w 2 with y i (w x i + b) 1 i = 1, n
The canonical hyperplane { min w,b w 2 with y i (w x i + b) 1 i = 1, n Definition (The canonical hyperplane) An hyperplane (w, b) in IR d is said to be canonical with respect the set of vectors {x i IR d, i = 1, n} if so that the distance min i=1,n w x i + b = 1 min dist(x i, (w, b)) = w x + b = 1 i=1,n w w The maximal margin (=minimal norm) canonical hyperplane
Linear SVM: the problem The maximal margin (=minimal norm) canonical hyperplane margin 0 0 0 Linear SVMs are the solution of the following problem (called primal) Let {(x i, y i ); i = 1 : n} be a set of labelled data with x IR d, y i {1, 1} A support vector machine (SVM) is a linear classifier associated with the following decision function: D(x) = sign ( w x + b ) where w IR d and b IR a given thought the solution of the following problem: { min w IR d, b IR 1 2 w 2 with y i (w x i + b) 1, i = 1, n This is a quadratic program (QP): { min z 1 2 z Az d z with Bz e
So far so good linear model with parameter w energy based "model" to "learn w minimizing a loss (square, 0/1, hinge,... ) or verifying constraints generalization minimizing another loss or verifying other constraints (simplicity!) reformulate and optimize What about non linearity?
Non linearity combining linear neurons Alpaydın, Introduction to Machine Learning, 2010
Neural networks as universal approximator Running several neurons at the same time y = ϕ(w 3 h (2) ) h (2) = ϕ(w 2 h (1) ) h (1) = ϕ(w 1 x) x Multilayered neural network: learning internal representation W 1, W 2, W 3 from L. Arnold PhD
OCR: the MNIST database (Y. LeCun, 1989)
The caltech 101 database (2004) 101 classes 30 training images per category...and the winner is NOT a deep network dataset is too small
Deep architecture lessons from caltech 101 (2009) Rabs Rectification Layer N Local Contrast Normalization Layer PA Average Pooling and Subsampling Layer in What is the Best Multi-Stage Architecture for Object Recognition? Jarrett et al, 2009
The image net database (Deng et al., 2012) ImageNet = 15 million labeled high-resolution images of 22,000 categories. Large-Scale Visual Recognition Challenge (a subset of ImageNet) 1000 categories. 1.2 million training images, 50,000 validation images, 150,000 testing images. www.image-net.org/challenges/lsvrc/2012/
Deep architecture and the image net Illustration of the architecture of a CNN 60 million parameters using 2 GPU regularization I I I data augmentation dropout weight decay
A new fashion in image processing Baidu: 5% Y. LeCun StatLearn tutorial
Learning Deep architecture min W IR n X d kf (xi, W ) yi k2 + λkw k2 i=1 d = 60 106 n = 1, 200 000 ++ λ = 0.0005 Y. Bengio tutorial
Then a m. occurs: learning internal representation Y. LeCun StatLearn tutorial
Deep architecture: more contests and benchmark records speech (phoneme) recognition automatic translation Optical Character Recognition (OCR), ICDAR Chinese handwriting recognition benchmark Grand Challenge on Mitosis Detection Road sign recognition Higgs boson challenge
Machine Learning in High Energy Physics The winning model is "brute force" a bag of 70 dropout neural networks three hidden layers of 600 neurons produced by 2-fold stratified cross-validations https://www.kaggle.com/c/higgs-boson
How it works? universal approximator energy based model (fit the data) regularization mechanisms optimization procedure
Plan 1 Une définition de l apprentissage statistique 2 Apprendre avec des réseaux de neurones profonds Learning linear The margin Linear SVM: the problem Deep learning 3 Les principaux algorithmes d apprentissage
Les objectifs de l apprentissage automatique Algorithmes d apprentissage bonne généralisation généricité passage à l échelle Questions théoriques (Vapnik s Book, 1982 - Valiant, 1984) apprenabilité, sous quelles conditions? complexité (en temps, en échantillon)
Une brève histoire de l apprentissage artificiel 1940 Etude de la logique en temps qu objet mathématique (Shannon, Godel et al). 1950 Test de Turing, 1956 Dartmouth conférence : artificial intelligence. 1960 "Within a generation... the problem of creating artificial intelligence will substantially be solved." 1960 Le perceptron 1970 Systèmes experts 1980 AI Winter & Réseaux de neurones (perceptron non linéaire) 1990 SVM & COLT 2000 les applications : Google, Amazon (recommendation), spam filtering, OCR, parole, traduction... 2010 Big data, Go player http://www.cs.manchester.ac.uk/ugt/comp24111
Top 10 algorithmes en fouille de données Arbres de décision (et les forêts) C4.5, CART, SVM, AdaBoost, knn, Bayesien k-means, Apriori, EM, Naive Bayes, PageRank, identified by the IEEE International Conference on Data Mining (ICDM) in December 2006
Top 10 algorithmes en fouille de données Arbres de décision (et les forêts) C4.5, CART, SVM, AdaBoost, knn, Bayesien k-means, Apriori, EM, Naive Bayes, PageRank, en 2014 un 11 ème : deep networks identified by the IEEE International Conference on Data Mining (ICDM) in December 2006
Une brève histoire des modes en apprentissage
Question méthode 1 définir objectifs, les contraintes et les couts 2 étude préliminaire récupérer des données construire un bac à sable affiner les objectifs et les couts en interagissant avec les données proposer un modèle et choisir une solution 3 mise en oeuvre sur de véritable flots de données randomisation outils adaptés
New trends in ML fashion: big data ICML 2014 workshops Designing Machine Learning Platforms for Big Data Xiangxiang Meng, Wayne Thompson, Xiaodong Lin New Learning Frameworks and Models for Big Data Massih-Reza Amini, Eric Gaussier, James Kwok, Yiming Yang Deep Learning Models for Emerging Big Data Applications Shan Suthaharan, Jinzhu Jia Unsupervised Learning for Bioacoustic Big Data Hervé Glotin, P. Dugan, F. Chamroukhi, C. Clark, Yann LeCun Knowledge-Powered Deep Learning for Text Mining Bin Gao, Scott Yih, Richard Socher, Jiang Bian Optimizing Customer Lifetime Value in Online Marketing Georgios Theocharous, Mohammad Ghavamzadeh, Shie Mannor Comment traiter ces big data : data science
New trends in ML fashion: big data hiring boom
Quelques prédictions... à propos du futur Data science - Applications multi compétences chaine de traitements Outils pour l apprentissage l apprentissage sans paramètres (off-the-shelf) passage à l échelle (4v - big data - mégadonnées) Algorithmes d apprentissage optimisation (mégadonnées, non convexe) dynamique (interactions) apprendre à apprendre (transfert) Théorie de l apprentissage la nature de l information
Pour en savoir plus livres Bishop, C. M. 1995. Neural Networks for Pattern Recognition. Oxford: Oxford University Press. Duda, R. O., P. E. Hart, and D. G. Stork. 2001. Pattern Classification, 2nd ed. New York: Wiley. Hastie, T., R. Tibshirani, and J. Friedman. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer. A. Cornuéjols & L. Miclet, "L apprentissage artificiel. Concepts et algorithmes". Eyrolles. 2 ème éd. 2010. conférences xcml, NIPS, COLT, séminaire SMILE, Paris Machine Learning Applications Group revues JMLR, Machine Learning, Foundations and Trends in Machine Learning, machine learning survey http://www.mlsurveys.com/