Scalable Density Clustering for Spark

Dimension: px

Commencer à balayer dès la page:

Download "Scalable Density Clustering for Spark"

Nathalie Fournier
il y a 7 ans
Total affichages :

1 Scalable Density Clustering for Spark THOMAS TRIPLET, PH.D., ENG. MARCH 9 TH 2016 Principal partenaire financier

2 TECHNOLOGIES BIG-DATA Hadoop Core HDFS: Système de fichiers distribué YARN: Gestion des ressources CPU et planification MapReduce: Traitement en lot (batch) des données à grande échelle Écosystème Hadoop NoSQL: HBase, Cassandra, Accumulo, etc SQL: Hive, Stinger (Hortonworks), Impala (Cloudera), Presto (FB), Tajo, Drill (MapR) Transfert: Sqoop, Flume Calcul/ML: Spark, Storm, Giraph, Mahout Scripts: Pig, Cascading Administration: Hue, ZooKeeper, Knox Recherche: Solr, ElasticSearch 2

APACHE Popular distributed in-memory computing framework 10-100x faster than Hadoop MapReduce and low latency Linear horizontal

3 APACHE Popular distributed in-memory computing framework x faster than Hadoop MapReduce and low latency Linear horizontal scalability Fault tolerant (RDDs) Applications range from long-running batch jobs to stream processing High-level Scala, Java, Python and R APIs 3

4 AGENDA Clustering algorithms (unsupervised learning) Distance-based (k-means) Density-based (DBSCAN) PatchWork Algorithm Results Performance Conclusion Future Work 4

INTRODUCTION: MACHINE LEARNING Supervised Learning Unsupervised Learning (clustering) Class labels are known and predefined Training and testing datasets are (manually) labeled with same classes Goal

5 INTRODUCTION: MACHINE LEARNING Supervised Learning Unsupervised Learning (clustering) Class labels are known and predefined Training and testing datasets are (manually) labeled with same classes Goal is to learn function/rule that can classify new data points Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees Class labels of the data are unknown Group/cluster similar data points without prior knowledge Goal is to discover structure or pattern in the data Examples: k-means, EM, DBScan, HCA 5

6 INTRODUCTION: MACHINE LEARNING Supervised Learning Unsupervised Learning (clustering) Class labels are known and predefined Training and testing datasets are (manually) labeled with same classes Goal is to learn function/rule that can classify new data points Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees PatchWork Class labels of the data are unknown Group/cluster similar data points without prior knowledge Goal is to discover structure or pattern in the data Examples: k-means, EM, DBScan, HCA 5

7 INTRODUCTION: CLUSTERING Distance-based Density-based Popular algorithm: k-means (implemented in MLLib) Relies on distance function between data points Easy to implement Linear complexity (big-data) Easy to distribute Discovers spherical clusters of similar sizes only Sensitive to noise and local optima Prior knowledge of k. Popular algorithm: DBScan (not in MLLib) Relies on the density of data points in feature space Natural protection against noise and outliers Discovers clusters of arbitrary shape and size No prior knowledge of k Discovers clusters of similar densities only Quadratic complexity: not scalable 6

INTRODUCTION: CLUSTERING Distance-based Density-based Popular algorithm: k-means (implemented in MLLib) Relies on distance function between data points Easy to implement Linear complexity (big-data)

8 INTRODUCTION: CLUSTERING Distance-based Density-based Popular algorithm: k-means (implemented in MLLib) Relies on distance function between data points Easy to implement Linear complexity (big-data) Easy to distribute PatchWork Discovers spherical clusters of similar sizes only Sensitive to noise and local optima Prior knowledge of k. Popular algorithm: DBScan (not in MLLib) Relies on the density of data points in feature space Natural protection against noise and outliers Discovers clusters of arbitrary shape and size No prior knowledge of k Discovers clusters of similar densities only Quadratic complexity: not scalable 6

9 PATCHWORK ALGORITHM 2 main steps: 1. createcells( datapoints ) à cells à RDD[(string, int)] 2. createclusters( cells) à clusters 7

10 STEP 1: CELL CREATION 8

11 STEP 1: CELL CREATION 9

12 STEP 1: CELL CREATION ( -1,2 ; 4 ) ( -1,3 ; 4 ) ( -2,2 ; 4 ) ( -3,4 ; 1 ) ( 2,3 ; 4 ) ( 2,4 ; 3 ) ( 3,3 ; 3 ) ( 3,4 ; 3 ) 10

13 STEP 1: CELL CREATION ( -1,2 ; 1 ) ( -2,2 ; 1 ) ( 3,4 ; 1 ) ( -1,2 ; 1 ) setofcells = datapoints.map(pà(cellid(p),1)).reducebykey(_ + _) ( -1,2 ; 4 ) ( -1,3 ; ( -2,2 ; 4 4 ) )... ( -3,4 ; ( 2,3 ; 1 4 ) ) ( 3,4 ; 1 ) ( 2,4 ; 3 ) ( -1,2 ; 1 ) ( 3,3 ; 3 ) ( 3,4 ; 1 ) ( 3,4 ; 3 ) 11

14 STEP 2: CLUSTER CREATION 12

15 EXPERIMENTAL SETUP 6 servers, each with: Intel Xeon E GB memory 30TB storage Cloudera CDH Apache Spark

16 DATASETS Aggregation Compound Jain Spiral 14

17 RESULTS (JAIN DATASET) K-means DBScan PatchWork 15

18 RESULTS (SPIRAL DATASET) K-means DBScan PatchWork 16

19 RESULTS (AGGREGATION DATASET) K-means DBScan PatchWork 17

20 RESULTS (COMPOUND DATASET) K-means DBScan PatchWork 18

21 PERFORMANCE DBSCAN PatchWork MLLib k-means 100,000 10,000 Running Time (seconds) 1, , , ,000, ,000, ,000, ,000,000, ,000,000,000.0 Millions of data points 19

22 PERFORMANCE: SCALABILITY MLLib k-means PatchWork Normalized execution-time Number of servers 20

23 CONCLUSION 21

24 FUTURE WORK Tests against new clustering algorithms available in Spark 1.6 Better distribution of step 2 Indexing for region query using R-trees Streaming version 22

Q & A Contact: thomas.triplet@crim.ca Availability: https://github.

25 Q & A Contact: thomas.triplet@crim.ca Availability: (MIT Licence) Reference: Frank Gouineau, Tom Landry, Thomas Triplet (2016) PatchWork, a Scalable Density-Grid Clustering Algorithm. In Proc. 31th ACM Symposium On Applied Computing, Data-Mining track

26 Thomas Triplet, Ph.D., Eng. Suivez-nous Dialoguez avec nous Suivez-nous #CRIM_ca wwwcrimca Le CRIM est un centre de recherche appliquée en TI qui développe, en mode collaboratif avec ses clients et partenaires, des technologies innovatrices et du savoir-faire de pointe, et les transfère aux entreprises et aux organismes québécois afin de les rendre plus productifs et plus compétitifs localement et mondialement. Le CRIM dispose de quatre équipes de recherche en TI de calibre mondial. Le CRIM œuvre principalement dans les domaines des interactions et interfaces personne-système, de l analytique avancée et des architectures et technologies avancées de développement et tests. Détenteur d une certification ISO 9001:2008, son action s inscrit dans les politiques et stratégies pilotées par le ministère de l'économie, de l'innovation et des Exportations (MEIE), son principal partenaire financier. Principal partenaire financier Tous droits réservés 2016 CRIM. 405, avenue Ogilvy, bureau 101, Montréal (Québec) H3N 1M3/ /

Documents pareils

Déploiement d une architecture Hadoop pour analyse de flux. françois-xavier.andreu@renater.fr

Déploiement d une architecture Hadoop pour analyse de flux françois-xavier.andreu@renater.fr 1 plan Introduction Hadoop Présentation Architecture d un cluster HDFS & MapReduce L architecture déployée Les