Scalable Density Clustering for Spark THOMAS TRIPLET, PH.D., ENG. MARCH 9 TH 2016 Principal partenaire financier WWW.CRIM.CA
TECHNOLOGIES BIG-DATA Hadoop Core HDFS: Système de fichiers distribué YARN: Gestion des ressources CPU et planification MapReduce: Traitement en lot (batch) des données à grande échelle Écosystème Hadoop NoSQL: HBase, Cassandra, Accumulo, etc SQL: Hive, Stinger (Hortonworks), Impala (Cloudera), Presto (FB), Tajo, Drill (MapR) Transfert: Sqoop, Flume Calcul/ML: Spark, Storm, Giraph, Mahout Scripts: Pig, Cascading Administration: Hue, ZooKeeper, Knox Recherche: Solr, ElasticSearch 2
APACHE Popular distributed in-memory computing framework 10-100x faster than Hadoop MapReduce and low latency Linear horizontal scalability Fault tolerant (RDDs) Applications range from long-running batch jobs to stream processing High-level Scala, Java, Python and R APIs 3
AGENDA Clustering algorithms (unsupervised learning) Distance-based (k-means) Density-based (DBSCAN) PatchWork Algorithm Results Performance Conclusion Future Work 4
INTRODUCTION: MACHINE LEARNING Supervised Learning Unsupervised Learning (clustering) Class labels are known and predefined Training and testing datasets are (manually) labeled with same classes Goal is to learn function/rule that can classify new data points Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees Class labels of the data are unknown Group/cluster similar data points without prior knowledge Goal is to discover structure or pattern in the data Examples: k-means, EM, DBScan, HCA 5
INTRODUCTION: MACHINE LEARNING Supervised Learning Unsupervised Learning (clustering) Class labels are known and predefined Training and testing datasets are (manually) labeled with same classes Goal is to learn function/rule that can classify new data points Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees PatchWork Class labels of the data are unknown Group/cluster similar data points without prior knowledge Goal is to discover structure or pattern in the data Examples: k-means, EM, DBScan, HCA 5
INTRODUCTION: CLUSTERING Distance-based Density-based Popular algorithm: k-means (implemented in MLLib) Relies on distance function between data points Easy to implement Linear complexity (big-data) Easy to distribute Discovers spherical clusters of similar sizes only Sensitive to noise and local optima Prior knowledge of k. Popular algorithm: DBScan (not in MLLib) Relies on the density of data points in feature space Natural protection against noise and outliers Discovers clusters of arbitrary shape and size No prior knowledge of k Discovers clusters of similar densities only Quadratic complexity: not scalable 6
INTRODUCTION: CLUSTERING Distance-based Density-based Popular algorithm: k-means (implemented in MLLib) Relies on distance function between data points Easy to implement Linear complexity (big-data) Easy to distribute PatchWork Discovers spherical clusters of similar sizes only Sensitive to noise and local optima Prior knowledge of k. Popular algorithm: DBScan (not in MLLib) Relies on the density of data points in feature space Natural protection against noise and outliers Discovers clusters of arbitrary shape and size No prior knowledge of k Discovers clusters of similar densities only Quadratic complexity: not scalable 6
PATCHWORK ALGORITHM 2 main steps: 1. createcells( datapoints ) à cells à RDD[(string, int)] 2. createclusters( cells) à clusters 7
STEP 1: CELL CREATION 8
STEP 1: CELL CREATION 9
STEP 1: CELL CREATION ( -1,2 ; 4 ) ( -1,3 ; 4 ) ( -2,2 ; 4 ) ( -3,4 ; 1 ) ( 2,3 ; 4 ) ( 2,4 ; 3 ) ( 3,3 ; 3 ) ( 3,4 ; 3 ) 10
STEP 1: CELL CREATION ( -1,2 ; 1 ) ( -2,2 ; 1 ) ( 3,4 ; 1 ) ( -1,2 ; 1 ) setofcells = datapoints.map(pà(cellid(p),1)).reducebykey(_ + _) ( -1,2 ; 4 ) ( -1,3 ; ( -2,2 ; 4 4 ) )... ( -3,4 ; ( 2,3 ; 1 4 ) ) ( 3,4 ; 1 ) ( 2,4 ; 3 ) ( -1,2 ; 1 ) ( 3,3 ; 3 ) ( 3,4 ; 1 ) ( 3,4 ; 3 ) 11
STEP 2: CLUSTER CREATION 12
EXPERIMENTAL SETUP 6 servers, each with: Intel Xeon E5-2650 8 cores @2.6GHz 192GB memory 30TB storage Cloudera CDH 5.4.0 Apache Spark 1.3 13
DATASETS Aggregation Compound Jain Spiral 14
RESULTS (JAIN DATASET) K-means DBScan PatchWork 15
RESULTS (SPIRAL DATASET) K-means DBScan PatchWork 16
RESULTS (AGGREGATION DATASET) K-means DBScan PatchWork 17
RESULTS (COMPOUND DATASET) K-means DBScan PatchWork 18
PERFORMANCE DBSCAN PatchWork MLLib k-means 100,000 10,000 Running Time (seconds) 1,000 100 10 1 10,000.0 100,000.0 1,000,000.0 10,000,000.0 100,000,000.0 1,000,000,000.0 10,000,000,000.0 Millions of data points 19
PERFORMANCE: SCALABILITY MLLib k-means PatchWork 1 0.75 Normalized execution-time 0.5 0.25 0 1 2 3 4 5 Number of servers 20
CONCLUSION 21
FUTURE WORK Tests against new clustering algorithms available in Spark 1.6 Better distribution of step 2 Indexing for region query using R-trees Streaming version 22
Q & A Contact: thomas.triplet@crim.ca Availability: https://github.com/crim-ca/patchwork (MIT Licence) Reference: Frank Gouineau, Tom Landry, Thomas Triplet (2016) PatchWork, a Scalable Density-Grid Clustering Algorithm. In Proc. 31th ACM Symposium On Applied Computing, Data-Mining track
WWW.CRIM.CA Thomas Triplet, Ph.D., Eng. thomas.triplet@crim.ca Suivez-nous Dialoguez avec nous Suivez-nous #CRIM_ca wwwcrimca Le CRIM est un centre de recherche appliquée en TI qui développe, en mode collaboratif avec ses clients et partenaires, des technologies innovatrices et du savoir-faire de pointe, et les transfère aux entreprises et aux organismes québécois afin de les rendre plus productifs et plus compétitifs localement et mondialement. Le CRIM dispose de quatre équipes de recherche en TI de calibre mondial. Le CRIM œuvre principalement dans les domaines des interactions et interfaces personne-système, de l analytique avancée et des architectures et technologies avancées de développement et tests. Détenteur d une certification ISO 9001:2008, son action s inscrit dans les politiques et stratégies pilotées par le ministère de l'économie, de l'innovation et des Exportations (MEIE), son principal partenaire financier. Principal partenaire financier Tous droits réservés 2016 CRIM. 405, avenue Ogilvy, bureau 101, Montréal (Québec) H3N 1M3/514 840-1234/1 877 840-2746